What Is Prompt Engineering?
A Grounded Definition
Prompt engineering is the practice of designing, refining, and optimizing input instructions to guide large language models toward producing desired outputs. It sits at the intersection of linguistics, cognitive science, and software engineering. Unlike traditional programming where syntax is rigid, prompt engineering operates in natural language, requiring an understanding of how models parse context, follow patterns, and generate coherent responses.
The discipline emerged not from academic theory but from empirical necessity. As models grew more capable, users discovered that the phrasing, structure, and context of prompts dramatically affected output quality. What began as trial and error has evolved into a systematic field with established patterns, measurable best practices, and sophisticated multi-step pipelines.
The Evolution from GPT-2 to 2026
The Zero-Shot Era: GPT-2 (2019)
On February 14, 2019, OpenAI released GPT-2 with the paper "Language Models are Unsupervised Multitask Learners." The full 1.5 billion parameter model followed on November 5, 2019. GPT-2 demonstrated something remarkable: language models could perform multiple NLP tasks without task-specific training, using what researchers called zero-shot transfer.
The key insight was task conditioning. By appending instructions like "Translate to French:" before English text, GPT-2 could perform translation without ever being explicitly trained on translation pairs. This was the first hint that how you phrased the input mattered as much as the model's training.
However, GPT-2's zero-shot capabilities were limited. Performance on complex reasoning tasks remained poor. The model would often hallucinate, generate irrelevant text, or fail to follow instructions precisely. Users began experimenting with prompt phrasing, but the field lacked systematic understanding.
The Few-Shot Breakthrough: GPT-3 (2020)
On May 28, 2020, OpenAI published "Language Models are Few-Shot Learners," introducing GPT-3 with 175 billion parameters. This was the inflection point that made prompt engineering a genuine discipline.
GPT-3's scale enabled in-context learning. The model could learn new tasks from just a few examples embedded in the prompt itself, without any gradient updates or fine-tuning. A prompt might look like:
English: Hello
French: Bonjour
English: Goodbye
French: Au revoir
English: Thank you
French:
GPT-3 would complete "French: Merci" based on the pattern established by the examples. This few-shot capability meant that prompt construction became a form of meta-learning. The examples you chose, their order, their formatting, all affected performance.
Researchers and practitioners began systematically studying what worked. They discovered that example selection mattered, that label distribution affected outputs, and that even the spacing and punctuation in prompts influenced results. The term "prompt engineering" entered common usage around this time.
Chain-of-Thought: Teaching Models to Think (2022)
On January 28, 2022, Google Research published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903). Authors Jason Wei, Xuezhi Wang, Dale Schuurmans, and colleagues introduced a deceptively simple technique that transformed how we approach complex reasoning tasks.
Chain-of-thought (CoT) prompting adds intermediate reasoning steps to few-shot examples:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls,
with 3 balls in each can. How many tennis balls does he have?
A: Roger starts with 5 balls. He buys 2 cans with 3 balls each,
so 2 * 3 = 6 balls. 5 + 6 = 11. The answer is 11.
Q: A juggler has 16 balls. Half are golf balls and half are tennis balls.
He loses 3 golf balls and 2 tennis balls. How many balls does he have left?
A: He starts with 8 golf balls and 8 tennis balls. After losing 3 golf balls,
he has 5 golf balls. After losing 2 tennis balls, he has 6 tennis balls.
5 + 6 = 11. The answer is 11.
Q: [Your question here]
A:
The paper demonstrated that CoT prompting enabled a 540B-parameter LLM to achieve state-of-the-art accuracy on the GSM8K math word problems benchmark using only eight exemplars. This outperformed fine-tuned GPT-3 with an external verifier.
The significance extended beyond math problems. CoT prompting improved performance on arithmetic, commonsense, and symbolic reasoning tasks. It revealed that large language models possessed latent reasoning capabilities that could be unlocked through appropriate prompting strategies.
ReAct: Reasoning and Acting Together (2022)
On October 6, 2022, researchers from Princeton and Google published "ReAct: Synergizing Reasoning and Acting in Language Models" (arXiv:2210.03629). Shunyu Yao, Jeffrey Zhao, and colleagues addressed a critical limitation: language models hallucinate when answering questions requiring factual knowledge.
ReAct interleaves reasoning traces with task-specific actions:
Thought: I need to find the birth year of the director of the 2022 film "Everything Everywhere All at Once".
Action: Search[director of Everything Everywhere All at Once 2022]
Observation: The directors are Daniel Kwan and Daniel Scheinert.
Thought: I need to find the birth year of Daniel Kwan.
Action: Search[Daniel Kwan birth year]
Observation: Daniel Kwan was born in 1988.
Thought: I have found the answer.
Answer: 1988
By allowing the model to query external sources (Wikipedia, APIs, search engines) and incorporate observations into its reasoning, ReAct reduced hallucination on HotpotQA and Fever benchmarks. In interactive decision-making tasks (ALFWorld and WebShop), ReAct outperformed imitation and reinforcement learning approaches by 34% and 10% absolute success rates using only one or two in-context examples.
ReAct established a pattern that would become fundamental: the separation of reasoning (internal thought) from action (external tool use), with observations feeding back into subsequent reasoning steps.
Tree of Thoughts: Deliberate Problem Solving (2023)
On May 17, 2023, the same Princeton and Google DeepMind team published "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (arXiv:2305.10601). Yao, Yu, Zhao, and colleagues generalized CoT to enable exploration of multiple reasoning paths.
Where CoT follows a single linear chain of reasoning, Tree of Thoughts (ToT) maintains multiple candidate thoughts, evaluates them, and explores promising paths while pruning unpromising ones. This resembles how humans solve complex problems: we consider alternatives, backtrack when stuck, and evaluate intermediate states.
The results were striking. On Game of 24 (a mathematical reasoning game where you combine four numbers to make 24), GPT-4 with CoT solved 4% of tasks. With ToT, success jumped to 74%. The framework also improved performance on creative writing and mini crossword puzzles.
ToT introduced search into the prompting paradigm. Rather than greedy left-to-right generation, the model could now deliberate, evaluate, and backtrack. This moved prompt engineering from simple input formatting toward algorithmic control structures.
The Prompt Report: A Systematic Survey (2024)
On June 6, 2024, researchers from the University of Maryland, Princeton, and other institutions published "The Prompt Report: A Systematic Survey of Prompting Techniques" (arXiv:2406.06608). Led by Sander Schulhoff and a large collaborative team, this became the most comprehensive survey of prompt engineering to date.
The report established a structured vocabulary of 33 terms and a taxonomy of 58 LLM prompting techniques plus 40 techniques for other modalities. It addressed the fragmented terminology that had plagued the field, where the same concept went by different names across papers.
Key contributions included:
- Unified Taxonomy: Organizing techniques by function (contextual, zero-shot, few-shot, thought generation, ensembling, self-criticism, decomposition)
- Best Practices: Guidelines for state-of-the-art LLMs like ChatGPT
- Meta-Analysis: Systematic review of natural language prefix-prompting literature
The Prompt Report marked prompt engineering's transition from ad hoc experimentation to a mature discipline with established principles, measurable outcomes, and systematic evaluation frameworks.
Agentic Patterns and Tool-Use Standardization (2025)
On March 15, 2025, Anthropic released the Model Context Protocol (MCP), establishing a universal standard for connecting AI assistants to external data sources and tools. This represented a watershed moment for agentic prompting—moving from ad-hoc tool integrations to standardized, composable tool-use patterns.
The MCP specification defined three core primitives for agentic prompting:
Tools: Executable functions the model can invoke
Resources: Contextual data the model can reference
Prompts: Pre-defined templates for common workflows
This standardization enabled a new class of prompting patterns:
Tool-Use Prompting with Structured Outputs:
You have access to the following tools:
- calculate: Performs mathematical operations
- search: Queries external knowledge bases
- validate: Checks output against schemas
Analyze this request and select the appropriate tool:
<request>Calculate compound interest on $10,000 at 7% over 5 years</request>
Respond with JSON:
{
"tool": "calculate",
"parameters": {"principal": 10000, "rate": 0.07, "years": 5},
"reasoning": "This requires mathematical computation"
}
By mid-2025, major providers had adopted MCP-compatible interfaces, enabling prompts to seamlessly orchestrate multi-tool workflows. The pattern evolved from simple ReAct-style reasoning to sophisticated agent graphs where prompts defined node behaviors, edge conditions, and state transitions.
Structured Outputs Evolution: JSON Mode and Function Calling (2025-2026)
Throughout 2025, structured output capabilities matured from experimental features to production-grade reliability. OpenAI's JSON mode (introduced late 2024) became the industry baseline, with Claude Sonnet 4.6 (released February 2026) introducing native schema-constrained generation.
The breakthrough was guaranteed schema adherence—ensuring outputs matched specified JSON schemas without post-processing:
{
"schema": {
"type": "object",
"properties": {
"sentiment": {"enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"explanation": {"type": "string", "maxLength": 200}
},
"required": ["sentiment", "confidence"]
},
"strict": true
}
This enabled a new prompting paradigm: schema-first prompt engineering. Rather than hoping the model produced parseable output, prompts could now specify exact output contracts. By late 2025, 73% of production LLM applications used structured outputs for downstream processing, up from 31% in early 2024.
The implications for prompt engineering were profound:
- Type Safety: Prompts could guarantee output shapes, eliminating parsing failures
- Validation Feedback: Schema violations could be fed back as correction prompts
- Pipeline Composition: Outputs from one prompt could reliably feed into another
Multimodal Prompting Advances: Vision-Language Integration (2026)
The release of Claude Sonnet 4.6 in February 2026 marked a significant advance in multimodal prompting capabilities. Unlike earlier vision-language models that processed images as afterthoughts, Sonnet 4.6 introduced unified multimodal reasoning—treating text and visual information as integrated context.
This enabled sophisticated multimodal prompting patterns:
Visual Chain-of-Thought:
Analyze this architectural diagram step by step:
[Image: system_architecture.png]
Step 1: Identify the data flow direction
Step 2: Locate potential bottlenecks
Step 3: Suggest optimization strategies
Provide your reasoning for each step, referencing specific visual elements.
Cross-Modal Retrieval:
Given this UI mockup and the user request "make it more accessible",
identify:
1. Color contrast issues (reference specific elements)
2. Missing ARIA labels
3. Keyboard navigation problems
Support each finding with visual coordinates from the image.
By early 2026, multimodal prompting had evolved from simple image description to complex visual reasoning tasks. The key insight was that visual context followed the same principles as text context—placement, specificity, and structure mattered equally. Models could now perform visual chain-of-thought, cross-modal reasoning, and structured extraction from images with the same reliability as text-only tasks.
The field had come full circle: from GPT-2's simple zero-shot text conditioning to Claude Sonnet 4.6's unified multimodal reasoning, prompt engineering remained the discipline of structuring context to elicit desired capabilities—now across text, images, and tool-augmented environments.
What Drives Prompt Effectiveness: Key Factors
Prompt effectiveness depends on multiple interacting factors. Our experiments provide concrete data on the trade-offs between different prompting strategies, while the broader field has established general principles about prompt structure and formatting.
BBH-Inspired Hard Problems: Experimental Findings
We conducted a comprehensive experiment testing three prompting strategies on 24 genuinely hard BBH-inspired problems using Claude Sonnet 4.6. These problems were designed to be challenging—requiring multi-step reasoning, symbolic manipulation, and careful tracking of state:
| Metric | Zero-Shot | Few-Shot | Chain-of-Thought |
|---|---|---|---|
| Accuracy (%) | 87.5% | 91.7% | 91.7% |
| Avg Latency (s) | 6.59s | 7.16s | 8.15s |
| Total Cost (m$) | 119.751m$ | 160.776m$ | 256.209m$ |
| Total Tokens | 14,892 | 20,847 | 33,759 |
Problem Categories Tested (3 problems each):
- Dyck-4: Complete nested bracket sequences with 4 bracket types (<, [, {, ()
- Object Tracking: Track objects through 12-15 swaps in multi-person scenarios
- Web of Lies: Boolean logic puzzles with truth-teller/liar constraints
- MATH Algebra: Complex algebraic equations and systems
- Complex Navigation: 20+ step navigation in 2D/3D space with direction changes
- Temporal Reasoning: Multi-constraint scheduling problems
- Geometric Proof: Area ratios and geometric relationships
- Long-horizon Arithmetic: Multi-step calculations with compound operations

Figure 1: Accuracy comparison across prompting strategies on 24 BBH-inspired hard problems. Few-shot and Chain-of-Thought show modest improvements over Zero-Shot, demonstrating that even hard problems can be tackled by capable models with minimal prompting.
The results reveal a nuanced pattern: Claude Sonnet 4.6 is remarkably capable even with zero-shot prompting on complex reasoning tasks. The accuracy differences are smaller than expected because the model has strong inherent reasoning capabilities. However, the pattern still shows that structured prompting (few-shot and CoT) provides slight advantages.

Figure 2: Accuracy breakdown by problem category. Most categories show high performance across all strategies, with Dyck-4 and Long-horizon Arithmetic showing the most variation.
Key Finding: The Dyck-4 problems (nested bracket completion) proved most challenging, with zero-shot achieving only 67% accuracy while few-shot and CoT achieved 100%. This demonstrates that for certain symbolic manipulation tasks, pattern-based learning from examples significantly improves performance.

Figure 3: Cost comparison across prompting strategies. Chain-of-Thought costs 2.1x more than Zero-Shot due to extensive reasoning generation.

Figure 4: Latency comparison showing Chain-of-Thought adds 24% overhead versus Zero-Shot prompting due to step-by-step reasoning generation.

Figure 5: Token usage breakdown showing Chain-of-Thought consumes 2.3x more completion tokens than Zero-Shot due to explicit reasoning steps.

Figure 6: The prompt engineering trade-off triangle. When accuracy matters most on hard problems, Chain-of-Thought provides marginal improvement at significant cost. For speed and cost efficiency, Zero-Shot is optimal. Few-shot sits in the middle—higher cost than zero-shot with similar accuracy to CoT.
Where Each Technique Excels
Zero-Shot: The Baseline Champion
- Best for: Quick answers, cost-sensitive applications, when model capability is high
- Accuracy: 87.5% on hard BBH problems (21/24 correct)
- Trade-off: Minimal interpretability—no reasoning trace provided
- Key insight: Claude Sonnet 4.6 handles complex reasoning remarkably well without examples
Few-Shot: The Pattern Learner
- Best for: Format establishment, symbolic tasks, when patterns help
- Accuracy: 91.7% on hard BBH problems (22/24 correct)
- Trade-off: Higher cost than zero-shot, similar accuracy to CoT
- Key insight: Particularly effective on Dyck-4 problems where pattern matching helps
Chain-of-Thought: The Reasoning Guarantee
- Best for: Debugging, when interpretability is critical, complex symbolic reasoning
- Accuracy: 91.7% on hard BBH problems (22/24 correct)
- Trade-off: 2.1x cost increase, 24% latency increase
- Key insight: Provides reasoning transparency but at significant cost premium
Context Placement Effects
Research on transformer attention mechanisms reveals that context placement significantly impacts performance. Our experiments demonstrate this quantitatively: the few-shot strategy placed substantial prompt tokens before each question, establishing pattern expectations that the model followed.
When examples appear at the beginning of the prompt (prefix prompting), models establish pattern expectations early. Our chain-of-thought experiments used explicit reasoning examples, contributing to higher latency but ensuring detailed reasoning traces.
The standard practice of placing examples before the target query aligns with how transformers process attention. Early tokens influence the representation of later tokens more than vice versa. This asymmetry means that instruction placement matters: system instructions at the start of a conversation establish behavioral constraints that persist across turns.
Structural Markers and XML Tags
Modern prompt engineering increasingly uses structural markers to delineate different content types. XML-like tags provide clear boundaries that help models parse complex instructions. Our chain-of-thought experiments used explicit structural markers ("Step 1:", "Step 2:") within the reasoning prompts, contributing to higher token usage but improving reasoning clarity.
The data reveals the cost of structure: chain-of-thought prompting with structured step markers cost 256.209 milli-dollars total compared to 119.751 milli-dollars for zero-shot. This 2.1x cost increase bought interpretability. Each step was explicitly labeled and followable, making error detection possible. In production systems, this structural transparency justifies the expense when debugging reasoning failures.
XML-like tags provide similar boundaries:
<instruction>
Analyze the following text for sentiment.
</instruction>
<text>
The product arrived damaged and customer service was unhelpful.
</text>
<output_format>
Sentiment: [positive/negative/neutral]
Confidence: [0-1]
Explanation: [brief reasoning]
</output_format>
Industry best practices from The Prompt Report and Anthropic's documentation support using XML tags for complex prompts with multiple content types. The experimental data confirms that structured formats improve reliability at a measurable cost.
Specificity and Constraint Engineering
Vague prompts produce variable outputs. Specific prompts produce consistent outputs. Our experiments demonstrate this principle concretely: the zero-shot prompts specified exact answer requirements (numerical responses only), producing outputs with high accuracy even on complex tasks.
The data shows constraint trade-offs clearly:
- Zero-shot with tight constraints: 6.59s latency, 119.751 milli-dollars cost, 87.5% accuracy
- Few-shot with pattern-based constraints: 7.16s latency, 160.776 milli-dollars cost, 91.7% accuracy
- Chain-of-thought with explicit reasoning: 8.15s latency, 256.209 milli-dollars cost, 91.7% accuracy
The surprising finding: Zero-shot with minimal constraints achieved strong accuracy (87.5%) at minimal cost—even on complex BBH-inspired problems. This suggests that for Claude Sonnet 4.6, the model's inherent reasoning capabilities are remarkably strong, and complex scaffolding provides only marginal benefits.
Constraint engineering involves adding explicit boundaries:
- Length constraints: "in exactly 50 words"
- Format constraints: "as a JSON object with keys 'summary' and 'keywords'"
- Style constraints: "in the tone of a technical blog post"
- Content constraints: "without mentioning competitors"
Each constraint reduces the model's output space, increasing predictability. Our experiments prove this: zero-shot with minimal constraints achieved 87.5% accuracy at minimal cost. Chain-of-thought with maximal structure achieved 91.7% accuracy but consumed 33,759 total tokens versus 14,892 for zero-shot. The lesson is clear: add constraints when the baseline fails or when interpretability is required.
Claude-Specific Patterns: Pipelines and Evaluator-Optimizer Loops
Anthropic's Claude models have enabled sophisticated multi-step prompting patterns that go beyond single-turn interactions. These patterns treat the LLM as a component in a larger system rather than a standalone oracle.
Prompt Chaining
Prompt chaining decomposes complex tasks into sequential subtasks, with each step's output feeding into the next step's input:
Step 1: Extract entities from the text
Step 2: Classify each entity by type
Step 3: Identify relationships between entities
Step 4: Generate a knowledge graph
Each step uses a specialized prompt optimized for that specific subtask. This improves reliability because simpler prompts on narrower tasks produce more consistent outputs than complex prompts on broad tasks.
Chaining also enables human-in-the-loop validation at intermediate steps. If entity extraction looks wrong, the user can correct it before relationship identification proceeds.
Prompt Routing
Routing uses an initial classification step to determine which specialized prompt should handle a request:
Classifier Prompt: "Is this a billing question, technical support, or sales inquiry?"
If billing -> Route to billing specialist prompt
If technical -> Route to technical support prompt
If sales -> Route to sales prompt
This pattern scales prompt specialization without requiring users to pre-categorize their requests. The classifier prompt can be lightweight (few-shot with 3-5 examples) while specialist prompts can be heavy (detailed instructions, extensive context).
Evaluator-Optimizer Loops
The most sophisticated pattern involves iterative refinement:
Loop:
1. Generate output using current prompt
2. Evaluate output against criteria
3. If criteria met, return output
4. If criteria not met, generate feedback
5. Update prompt with feedback
6. Repeat
For example, generating code:
Generation: "Write a Python function to sort a list"
Evaluation: "Does this handle empty lists? Does it preserve type stability?"
Feedback: "Add error handling for None inputs and non-list types"
Updated Generation: "Write a Python function to sort a list with type checking"
This pattern requires careful prompt engineering for both the generator and the evaluator. The evaluator prompt must be precise about criteria, and the feedback prompt must translate evaluation results into actionable instructions.
Multi-Modal Considerations
While our experiments focused on text, modern prompt engineering increasingly handles multi-modal inputs. Claude and similar models process images alongside text, requiring prompts that reference visual elements:
"Analyze this image. First describe what you see, then identify any text,
finally explain the relationship between visual elements and text."
The principles remain consistent: clear instructions, appropriate context, structured output expectations. But the context now includes non-textual information that must be referenced explicitly.
Conclusion
Prompt engineering has evolved from an art of trial and error into a systematic discipline. The field traces its lineage from GPT-2's zero-shot discoveries through GPT-3's few-shot learning, CoT's reasoning elicitation, ReAct's tool integration, ToT's deliberate search, and finally to the comprehensive taxonomy of The Prompt Report.
Our BBH-inspired experiments reveal nuanced trade-offs between prompting strategies on genuinely hard problems:
-
Zero-shot is remarkably capable: On complex BBH-inspired problems including Dyck-4 completion, 15-step object tracking, Web of Lies boolean puzzles, and 20-step navigation, Claude Sonnet 4.6 achieved 87.5% accuracy with minimal prompting. This demonstrates that modern models have absorbed sophisticated reasoning patterns from their training data.
-
Few-shot provides modest gains: With carefully designed simpler examples for each problem type, few-shot prompting achieved 91.7% accuracy—a 4.2 percentage point improvement over zero-shot. The Dyck-4 category showed the most improvement (67% → 100%), demonstrating that pattern-based learning helps on symbolic manipulation tasks.
-
Chain-of-thought matches few-shot accuracy at higher cost: CoT also achieved 91.7% accuracy but at 2.1x the cost of zero-shot. The value of CoT lies not in accuracy gains for this model, but in interpretability—the explicit reasoning traces enable debugging and verification that opaque outputs cannot provide.
-
The ceiling effect: All strategies performed well because Claude Sonnet 4.6 is a highly capable model. The hard problems we designed (nested bracket completion, multi-constraint scheduling, geometric proofs) were challenging but within the model's capabilities. This suggests that for state-of-the-art models, prompting strategy matters less than for earlier generations.
The future of prompt engineering lies in adaptive strategy selection and capability-aware prompting:
- Start with zero-shot (fast, cheap, surprisingly capable on modern models)
- Escalate to few-shot when symbolic pattern matching helps (Dyck-4, state tracking)
- Use chain-of-thought when interpretability is required, not just for accuracy
- Reserve complex scaffolding (ToT, ReAct) for problems where simpler strategies fail
Prompt engineering is not a replacement for model training. It is a complementary discipline that extracts maximum value from existing capabilities. As models grow more powerful, the importance of effective prompting evolves—from squeezing capability out of limited models, to efficiently directing powerful ones, to providing transparency and debuggability for critical applications.
The question is no longer whether prompt engineering matters, but how to do it systematically, measurably, and with awareness of the capability ceiling of modern models.
Case Study: Building This Evaluation with Neo
This entire evaluation - from research to experimental design to analysis to publication-ready visuals - was produced by Neo, a fully autonomous AI engineering agent. Understanding how this work was created provides a template for how you can build similar evaluations for your own use cases.
What Neo Is
Neo AI Engineer is an autonomous AI engineering agent that operates within VS Code, Cursor, or as a standalone tool. Unlike copilots that suggest code line-by-line, Neo takes high-level goals, plans the implementation, writes the code, runs experiments, analyzes results, and iterates until completion. It can:
- Research and synthesize information from papers, documentation, and best practices
- Design and implement ML experiments with proper controls
- Generate publication-quality visualizations
- Write technical documentation grounded in actual data
- Iterate based on experimental findings (the initial toy problems were rejected after analysis showed they were too easy)
How This Evaluation Was Built
Phase 1: Research and Planning
The project started with a single high-level prompt: research the evolution of prompt engineering and run experiments comparing techniques. Neo:
- Searched for primary sources (arXiv papers, Anthropic documentation, OpenAI cookbooks)
- Verified release dates and paper origins (GPT-2: Feb 2019, GPT-3: May 2020, CoT: Jan 2022, etc.)
- Identified BBH (BIG-Bench Hard) as the appropriate benchmark for hard reasoning tasks
- Created a detailed execution plan with 8 subtasks
Phase 2: Initial Experiment (Failed)
Neo first designed 8 "toy" problems (Game of 24, simple logic puzzles, basic arithmetic). The experiment ran successfully but analysis revealed a critical flaw: Claude Sonnet 4.6 achieved 100% accuracy zero-shot. The problems were too easy.
Phase 3: Redesign with BBH-Inspired Problems
Based on the failure analysis, Neo:
- Researched what actually challenges modern LLMs (Dyck languages, object tracking with 12+ swaps, Web of Lies boolean puzzles)
- Designed 24 genuinely hard problems (3 per category)
- Created matched few-shot examples (same problem type, simpler instances)
- Created CoT examples with explicit step-by-step reasoning
- Re-ran the full experiment (72 API calls)
Phase 4: Analysis and Visualization
Neo analyzed the results and found:
- Zero-shot: 87.5% accuracy (challenging but not impossible)
- Few-shot: 91.7% accuracy (modest improvement)
- CoT: 91.7% accuracy (same accuracy, 2.1x cost)
It then generated 6 publication-quality charts showing the trade-offs and updated the blog post with findings backed by actual data.
Key Capabilities Demonstrated
Autonomous Iteration: When the initial experiment failed to differentiate techniques, Neo didn't just report results - it analyzed why, researched harder problems, and re-ran everything.
Grounded Claims: Every claim in this blog is backed by experimental data from results.json. No fabricated citations, no hypothetical examples.
End-to-End Execution: From research notes to running code to final publication, Neo handled the complete pipeline without human intervention on implementation details.
How to Extend This Work
You can build on this evaluation using Neo in several ways:
1. Test Different Models
Clone the repo and prompt Neo:
Clone https://github.com/gauravvij/prompt-engineering-neo.git and modify
the experiment to test GPT-4, Llama 3, and Gemini on the same BBH problems.
Compare how different model families respond to zero-shot vs few-shot vs CoT
prompting. Generate comparative charts showing which models benefit most from
scaffolding.
2. Add New Problem Categories
Clone https://github.com/gauravvij/prompt-engineering-neo.git and extend
the experiment with new BBH categories: causal reasoning, word sorting,
multistep arithmetic with negative numbers. Design 3 problems per category
with ground truth answers. Run the full experiment and add the results to
the existing analysis.
3. Test Advanced Techniques
Clone https://github.com/gauravvij/prompt-engineering-neo.git and add
ReAct and Tree-of-Thought prompting to the comparison. Implement tool use
for the navigation problems (allowing the model to "move" and observe)
and deliberate search for the Web of Lies puzzles. Measure whether these
complex techniques justify their additional cost on problems where
simpler strategies struggle.
4. Build a Prompt Optimization Pipeline
Clone https://github.com/gauravvij/prompt-engineering-neo.git and create
an automated prompt optimizer that tests variations: different example
orderings, varying context placement, role assignment phrasing. Use the
BBH problems as a benchmark to find the optimal prompt template for each
problem type. Generate a report on which prompt variations matter most.
5. Create a Production Evaluation Suite
Clone https://github.com/gauravvij/prompt-engineering-neo.git and refactor
the experiment into a reusable evaluation framework. Add configuration files
for different model endpoints, support for parallel execution, and a
standardized output format. Document how to add new problem types and
prompting strategies. This becomes a tool for evaluating any model on
reasoning benchmarks.
Getting Started with Neo
To replicate or extend this work:
- Install Neo: Available as a VS Code extension or standalone
- Set up API keys: Configure OpenRouter or your preferred provider
- Clone this repo:
git clone https://github.com/gauravvij/prompt-engineering-neo.git - Prompt Neo: Describe what you want to build or extend
- Review and iterate: Neo will propose plans, execute them, and adapt based on results
The key insight: Neo handles implementation details so you can focus on experimental design and analysis. You define what to test, Neo figures out how to test it.
Why This Matters
This evaluation demonstrates a new way of working with AI: instead of manually writing scripts, running experiments, and generating charts, you describe the goal and let an autonomous agent handle execution. The BBH problems in this repo, the visualization code, the analysis - all were produced through iterative collaboration with Neo.
The future of AI engineering is not writing more code. It is defining better experiments and letting autonomous systems execute them.
References
- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
- Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
- Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
- Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
- Schulhoff, S., et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv:2406.06608.
- Suzgun, E., et al. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261.
Experimental data and analysis available in accompanying files: results.json, experiment.py, generate_visual_charts.py
Explore Prompt Engineering with NEO
This evaluation — research, experimental design, 72 API calls, chart generation, and analysis — was produced autonomously by Neo. You can replicate or extend it yourself:
Try NEO in Your IDE
Install the NEO extension to bring AI-powered development directly into your workflow:
- VS Code: NEO in VS Code
- Cursor: Install NEO for Cursor →