The Algorithm

Methodology

An automated evaluation framework tested every combination. No guesswork.

Test cases

50 real-world coding prompts across 5 domains.

General Broad tasks like "add dark mode," "refactor auth," "set up CI/CD"

Debug Bug reports like "it crashes on submit," "memory leak in production"

Backend Server-side like "add a payments endpoint," "implement S3 upload"

Frontend UI tasks like "make the table sortable," "create an onboarding flow"

Agent AI agent tasks like "build a support agent," "meeting summarizer"

36 prompting strategies

Each variant tested a different hypothesis.

Persona framing Expert prompt engineer, senior developer, tech lead, "you are the AI coder," or no persona

Methodology 3-step, 6-step, 8-step, checklist, chain-of-thought, freeform, XML-structured, self-evaluation

Example count 0 examples (zero-shot), 2 diverse, 4 examples, 6 examples

Rule strictness Very strict, loose/creative, anti-hallucination priority, conciseness focus, completeness focus

Temperature 0, 0.3, 0.5, 0.7, and 1.0 (default)

Token limits 512, 1024, and 2048 max tokens

Scoring

Every output scored by an independent AI judge across 5 dimensions, each rated 1–10.

Specificity

Are concrete details like language, framework, and scope added?

Intent Preservation

Is the user's original ask preserved without adding unwanted features?

Conciseness

Is every word earning its place?

Completeness

Are scope, constraints, and edge cases covered?

No Hallucination

Are invented requirements or context avoided?

Results

All 36 strategies ranked by average score. Range: 6.10 to 7.83.

#1 7.83 Low-temperature baseline (temp 0.3)

#2 7.77 2 diverse examples (one vague, one already-good)

#3 7.77 4 few-shot examples

#4 7.72 Refined v2: anti-hallucination + conciseness + no-questions

#5 7.68 Refined v2 at temperature 0.3

View all 36 rankings

#	Strategy	Avg Score
1	Low-temperature baseline (temp 0.3)	7.83
2	2 diverse examples (one vague, one already-good)	7.77
3	4 few-shot examples	7.77
4	Refined v2: anti-hallucination + conciseness + no-questions	7.72
5	Refined v2 at temperature 0.3	7.68
6	Original baseline (v1)	7.59
7	Extreme conciseness focus	7.56
8	Minimal instructions, 6 examples (learn by example)	7.52
9	No persona, just instructions	7.40
10	Strong anti-hallucination emphasis	7.36
11	Very detailed/verbose system prompt	7.33
12	Senior developer persona	7.29
13	Max tokens 512 (shorter output)	7.28
14	Temperature 0.7	7.28
15	Tech lead writing tickets persona	7.27
16	Negative examples (what NOT to do)	7.26
17	XML-tag structured internal reasoning	7.20
18	Before/after framing in instructions	7.18
19	Temperature 0 (deterministic)	7.16
20	Expanded 8-step methodology	7.14
21	Max tokens 2048 (longer output)	7.14
22	Temperature 0.5	7.08
23	Very strict output rules	7.08
24	Freeform, no numbered steps	7.07
25	Zero-shot (no examples)	7.04
26	Chain-of-thought reasoning	7.02
27	Simplified 3-step methodology	6.98
28	Ask-then-answer methodology	6.94
29	Temperature 0.3 (different base)	6.88
30	Self-evaluate before outputting	6.84
31	Minimal rules, creative freedom	6.80
32	Structured output with sections	6.78
33	Checklist-style methodology	6.77
34	Ultra-terse system prompt	6.49
35	Persona: you ARE the AI coder	6.41
36	Completeness over conciseness focus	6.10

Performance by domain

Average scores for the winning strategy across each domain.

Agent

7.77

Backend

7.46

Frontend

7.15

General

6.82

Debug

6.64

Key insights

Temperature matters more than prompt engineering

Same prompt, different temperature. That alone produced the single biggest improvement of any change we tested.

temp 1.0 7.59

temp 0.3 7.83

Two examples beat zero. Four didn't help.

Zero-shot: 7.04. Two diverse examples: 7.77 (+10.4%). Adding more examples gave no further gain.

Guardrails prevent catastrophic failures

The worst outputs weren't mediocre. The AI asked questions instead of improving, or invented requirements. Two rules fixed both.

More instructions made it worse

The most detailed variant scored last (6.10). The winner: a 3-step method, strict guardrails, nothing else.

Persona framing has diminishing returns

"You ARE the AI coder" scored 6.41. A simple, credible persona works best.

There's a ceiling, and it's the starting prompt

No system prompt pushes past ~7.8. The algorithm can only work with what you give it. The real lever is the starting prompt itself.

The refined algorithm

The production algorithm combines the top-performing traits from our evaluation.

System prompt

v1 → v2

Anti-hallucination as rule #1: "Did the user imply this, or am I making it up?"
3-step methodology (WHAT/HOW/DONE): replaced the 6-step approach
No-questions rule: improve, don't interrogate
Sentence budget: 2–4 simple, 4–6 complex
2 diverse examples: down from 4

Parameters

Tuned values

Temperature

1.0 0.3

Max tokens 1024 (unchanged)

Model Claude Haiku 4.5 (unchanged)

Built on real data.
Research-backed prompts.

Methodology

Test cases

36 prompting strategies

Scoring

Specificity

Intent Preservation

Conciseness

Completeness

No Hallucination

Results

Performance by domain

Key insights

Temperature matters more than prompt engineering

Two examples beat zero. Four didn't help.

Guardrails prevent catastrophic failures

More instructions made it worse

Persona framing has diminishing returns

There's a ceiling, and it's the starting prompt

The refined algorithm

v1 → v2

Tuned values

Research-driven. Measurably better.
Every strategy tested, scored, and refined. And we're still improving.

Built on real data.Research-backed prompts.

Methodology

Test cases

36 prompting strategies

Scoring

Specificity

Intent Preservation

Conciseness

Completeness

No Hallucination

Results

Performance by domain

Key insights

Temperature matters more than prompt engineering

Two examples beat zero. Four didn't help.

Guardrails prevent catastrophic failures

More instructions made it worse

Persona framing has diminishing returns

There's a ceiling, and it's the starting prompt

The refined algorithm

v1 → v2

Tuned values

Research-driven. Measurably better.Every strategy tested, scored, and refined. And we're still improving.

Built on real data.
Research-backed prompts.

Research-driven. Measurably better.
Every strategy tested, scored, and refined. And we're still improving.