NodePrime — Eval Team · 2026-04-24

Our framework made an agent +4.9% better —
automatically.

No engineer hand-wrote a new prompt. The system drafted candidates, scored each against a golden set, kept the winner. This is how every client's agents will improve between delivery visits.

run · hc_demo_weakbase · baseline 0.816 → best 0.857 · 4 iteration(s)

Step 1

Score what the agent already does

Run the current prompt on a golden set of real prospects. Measure mean quality.

Step 2

Propose a rewrite

An LLM-driven mutator rewrites the prompt preserving the contract. Stored as a candidate.

Step 3

Keep it only if it wins

Score the candidate on the same set. Adopt if it beats the baseline by our threshold. Repeat.

Before / After — same prospect, different prompt generations

baseline prompt

Denver expansion + AI-assisted recruiting at Apex Talent

Hi there, I saw Apex Talent Solutions just opened a Denver office and posted 12 recruiter roles—clear signal you're scaling. Scaling recruiting ops usually means the speed-vs-quality squeeze: screen faster, but lose quality; hire more recruiters, but burn them out. We work with staffing firms to solve this differently. NodePrime builds AI systems that reduce recruiter overhead—better matching, faster screening, smarter pipeline triage. Here's the key difference: our systems keep working and earning money after we hand them over. Our first client recovered $240K/year. We offer a $5K diagnostic to map your biggest bottleneck and show the ROI. Interested in exploring? [Your name] NodePrime

score 0.804

optimized prompt

Placement velocity for Apex Talent's Denver expansion

Your Denver office launch is smart. With 12 open recruiter roles in 3 weeks, you're scaling headcount—which usually means trading velocity for quality. In staffing, that's the trade-off that hurts margins. We work with firms like you on exactly this. Our first client recovered $240K/year through better sourcing velocity and lower turnover. A $5K diagnostic maps your specific opportunity. We're outcomes-first: if the math doesn't work, we say so. No long consulting engagements on hope. Can we grab 15 minutes Wednesday?

score 0.881 · +0.077

Input: Apex Talent Solutions · both outputs from the exact same prospect context

What the system noticed on its own — pending review

prompt_edit · pending review

agent:email_outreach@1.0.0

Lower-scoring emails use generic, non-personalized subject lines; higher-scoring samples reference prospect's specific signal (location, vertical, expansion detail), establishing immediate credibility and relevance.

confidence 74%

prompt_edit · pending review

agent:email_generic@0.1.0

Lower-scoring samples (0.82–0.84) spend excessive space diagnosing industry-specific problems before introducing NodePrime's solution, while higher-scoring samples (0.87–0.88) validate the prospect's action, name the constraint briefly, then immediately pivot to the value proposition.

confidence 71%

These are hypotheses, not changes. A human reviews, approves or rejects.

What this unlocks

Every client agent improves between delivery visits without an engineer rewriting anything.
We can show the client their agent's trajectory, not just its current score.
When we onboard a new client we ship with a working agent on day one — it improves from there.
The system tells us what to fix next — we decide, it executes.

Our framework made an agent +4.9% better —automatically.

Before / After — same prospect, different prompt generations

What the system noticed on its own — pending review

What this unlocks

Our framework made an agent +4.9% better —
automatically.