NodePrime — Eval Team · 2026-04-24

Our framework made an agent +4.9% better —
automatically.

No engineer hand-wrote a new prompt. The system drafted candidates, scored each against a golden set, kept the winner. This is how every client's agents will improve between delivery visits.

run · hc_demo_weakbase · baseline 0.816 → best 0.857 · 4 iteration(s)
Step 1
Score what the agent already does
Run the current prompt on a golden set of real prospects. Measure mean quality.
Step 2
Propose a rewrite
An LLM-driven mutator rewrites the prompt preserving the contract. Stored as a candidate.
Step 3
Keep it only if it wins
Score the candidate on the same set. Adopt if it beats the baseline by our threshold. Repeat.

Before / After — same prospect, different prompt generations

baseline prompt
Denver expansion + AI-assisted recruiting at Apex Talent
Hi there, I saw Apex Talent Solutions just opened a Denver office and posted 12 recruiter roles—clear signal you're scaling. Scaling recruiting ops usually means the speed-vs-quality squeeze: screen faster, but lose quality; hire more recruiters, but burn them out. We work with staffing firms to solve this differently. NodePrime builds AI systems that reduce recruiter overhead—better matching, faster screening, smarter pipeline triage. Here's the key difference: our systems keep working and earning money after we hand them over. Our first client recovered $240K/year. We offer a $5K diagnostic to map your biggest bottleneck and show the ROI. Interested in exploring? [Your name] NodePrime
score 0.804
optimized prompt
Placement velocity for Apex Talent's Denver expansion
Your Denver office launch is smart. With 12 open recruiter roles in 3 weeks, you're scaling headcount—which usually means trading velocity for quality. In staffing, that's the trade-off that hurts margins. We work with firms like you on exactly this. Our first client recovered $240K/year through better sourcing velocity and lower turnover. A $5K diagnostic maps your specific opportunity. We're outcomes-first: if the math doesn't work, we say so. No long consulting engagements on hope. Can we grab 15 minutes Wednesday?
score 0.881 · +0.077
Input: Apex Talent Solutions · both outputs from the exact same prospect context

What the system noticed on its own — pending review

prompt_edit · pending review
agent:email_outreach@1.0.0
Lower-scoring emails use generic, non-personalized subject lines; higher-scoring samples reference prospect's specific signal (location, vertical, expansion detail), establishing immediate credibility and relevance.
confidence 74%
prompt_edit · pending review
agent:email_generic@0.1.0
Lower-scoring samples (0.82–0.84) spend excessive space diagnosing industry-specific problems before introducing NodePrime's solution, while higher-scoring samples (0.87–0.88) validate the prospect's action, name the constraint briefly, then immediately pivot to the value proposition.
confidence 71%
These are hypotheses, not changes. A human reviews, approves or rejects.

What this unlocks