Built on real data.
Research-backed prompts.

We tested every hypothesis systematically. Here's what works.

1,800 tests run
36 strategies evaluated
5 scoring dimensions

Methodology

An automated evaluation framework tested every combination. No guesswork.

Test cases

50 real-world coding prompts across 5 domains.

General Broad tasks like "add dark mode," "refactor auth," "set up CI/CD"
Debug Bug reports like "it crashes on submit," "memory leak in production"
Backend Server-side like "add a payments endpoint," "implement S3 upload"
Interface UI tasks like "make the table sortable," "create an onboarding flow"
Agent AI agent tasks like "build a support agent," "meeting summarizer"

36 prompting strategies

Each variant tested a different hypothesis.

Persona framing Expert prompt engineer, senior developer, tech lead, "you are the AI coder," or no persona
Methodology 3-step, 6-step, 8-step, checklist, chain-of-thought, freeform, XML-structured, self-evaluation
Example count 0 examples (zero-shot), 2 diverse, 4 examples, 6 examples
Rule strictness Very strict, loose/creative, anti-hallucination priority, conciseness focus, completeness focus
Temperature 0, 0.3, 0.5, 0.7, and 1.0 (default)
Token limits 512, 1024, and 2048 max tokens

Scoring

Every output scored by an independent AI judge across 5 dimensions, each rated 1–10.

Specificity

Are concrete details like language, framework, and scope added?

Intent Preservation

Is the user's original ask preserved without adding unwanted features?

Conciseness

Is every word earning its place?

Completeness

Are scope, constraints, and edge cases covered?

No Hallucination

Are invented requirements or context avoided?

Results

All 36 strategies ranked by average score. Range: 6.10 to 7.83.

#1 7.83 Low-temperature baseline (temp 0.3)
#2 7.77 2 diverse examples (one vague, one already-good)
#3 7.77 4 few-shot examples
#4 7.72 Refined v2: anti-hallucination + conciseness + no-questions
#5 7.68 Refined v2 at temperature 0.3
View all 36 rankings
#StrategyAvg Score
1Low-temperature baseline (temp 0.3)7.83
22 diverse examples (one vague, one already-good)7.77
34 few-shot examples7.77
4Refined v2: anti-hallucination + conciseness + no-questions7.72
5Refined v2 at temperature 0.37.68
6Original baseline (v1)7.59
7Extreme conciseness focus7.56
8Minimal instructions, 6 examples (learn by example)7.52
9No persona, just instructions7.40
10Strong anti-hallucination emphasis7.36
11Very detailed/verbose system prompt7.33
12Senior developer persona7.29
13Max tokens 512 (shorter output)7.28
14Temperature 0.77.28
15Tech lead writing tickets persona7.27
16Negative examples (what NOT to do)7.26
17XML-tag structured internal reasoning7.20
18Before/after framing in instructions7.18
19Temperature 0 (deterministic)7.16
20Expanded 8-step methodology7.14
21Max tokens 2048 (longer output)7.14
22Temperature 0.57.08
23Very strict output rules7.08
24Freeform, no numbered steps7.07
25Zero-shot (no examples)7.04
26Chain-of-thought reasoning7.02
27Simplified 3-step methodology6.98
28Ask-then-answer methodology6.94
29Temperature 0.3 (different base)6.88
30Self-evaluate before outputting6.84
31Minimal rules, creative freedom6.80
32Structured output with sections6.78
33Checklist-style methodology6.77
34Ultra-terse system prompt6.49
35Persona: you ARE the AI coder6.41
36Completeness over conciseness focus6.10

Performance by domain

Average scores for the winning strategy across each domain.

Agent
7.77
Backend
7.46
Frontend
7.15
General
6.82
Debug
6.64

Key insights

Temperature matters more than prompt engineering

Same prompt, different temperature. That alone produced the single biggest improvement of any change we tested.

temp 1.0 7.59
temp 0.3 7.83

Two examples beat zero. Four didn't help.

Zero-shot: 7.04. Two diverse examples: 7.77 (+10.4%). Adding more examples gave no further gain.

Guardrails prevent catastrophic failures

The worst outputs weren't mediocre. The AI asked questions instead of improving, or invented requirements. Two rules fixed both.

More instructions made it worse

The most detailed variant scored last (6.10). The winner: a 3-step method, strict guardrails, nothing else.

Persona framing has diminishing returns

"You ARE the AI coder" scored 6.41. A simple, credible persona works best.

There's a ceiling, and it's the starting prompt

No system prompt pushes past ~7.8. The algorithm can only work with what you give it. The real lever is the starting prompt itself.

The refined algorithm

The production algorithm combines the top-performing traits from our evaluation.

System prompt

v1 → v2

  • Anti-hallucination as rule #1: "Did the user imply this, or am I making it up?"
  • 3-step methodology (WHAT/HOW/DONE): replaced the 6-step approach
  • No-questions rule: improve, don't interrogate
  • Sentence budget: 2–4 simple, 4–6 complex
  • 2 diverse examples: down from 4
Parameters

Tuned values

Temperature
1.0 0.3
Max tokens 1024 (unchanged)
Model Claude Haiku 4.5 (unchanged)
Original v1 7.59
Refined v2 7.83
+3.2% overall improvement