What Is Prompt Engineering?

A Grounded Definition

Prompt engineering is the practice of designing, refining, and optimizing input instructions to guide large language models toward producing desired outputs. It sits at the intersection of linguistics, cognitive science, and software engineering. Unlike traditional programming where syntax is rigid, prompt engineering operates in natural language, requiring an understanding of how models parse context, follow patterns, and generate coherent responses.

The discipline emerged not from academic theory but from empirical necessity. As models grew more capable, users discovered that the phrasing, structure, and context of prompts dramatically affected output quality. What began as trial and error has evolved into a systematic field with established patterns, measurable best practices, and sophisticated multi-step pipelines.

The Evolution from GPT-2 to 2026

The Zero-Shot Era: GPT-2 (2019)

On February 14, 2019, OpenAI released GPT-2 with the paper "Language Models are Unsupervised Multitask Learners." The full 1.5 billion parameter model followed on November 5, 2019. GPT-2 demonstrated something remarkable: language models could perform multiple NLP tasks without task-specific training, using what researchers called zero-shot transfer.

The key insight was task conditioning. By appending instructions like "Translate to French:" before English text, GPT-2 could perform translation without ever being explicitly trained on translation pairs. This was the first hint that how you phrased the input mattered as much as the model's training.

However, GPT-2's zero-shot capabilities were limited. Performance on complex reasoning tasks remained poor. The model would often hallucinate, generate irrelevant text, or fail to follow instructions precisely. Users began experimenting with prompt phrasing, but the field lacked systematic understanding.

The Few-Shot Breakthrough: GPT-3 (2020)

On May 28, 2020, OpenAI published "Language Models are Few-Shot Learners," introducing GPT-3 with 175 billion parameters. This was the inflection point that made prompt engineering a genuine discipline.

GPT-3's scale enabled in-context learning. The model could learn new tasks from just a few examples embedded in the prompt itself, without any gradient updates or fine-tuning. A prompt might look like:

English: Hello
French: Bonjour

English: Goodbye
French: Au revoir

English: Thank you
French:

GPT-3 would complete "French: Merci" based on the pattern established by the examples. This few-shot capability meant that prompt construction became a form of meta-learning. The examples you chose, their order, their formatting, all affected performance.

Researchers and practitioners began systematically studying what worked. They discovered that example selection mattered, that label distribution affected outputs, and that even the spacing and punctuation in prompts influenced results. The term "prompt engineering" entered common usage around this time.

Chain-of-Thought: Teaching Models to Think (2022)

On January 28, 2022, Google Research published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903). Authors Jason Wei, Xuezhi Wang, Dale Schuurmans, and colleagues introduced a deceptively simple technique that transformed how we approach complex reasoning tasks.

Chain-of-thought (CoT) prompting adds intermediate reasoning steps to few-shot examples:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls, 
   with 3 balls in each can. How many tennis balls does he have?
A: Roger starts with 5 balls. He buys 2 cans with 3 balls each, 
   so 2 * 3 = 6 balls. 5 + 6 = 11. The answer is 11.

Q: A juggler has 16 balls. Half are golf balls and half are tennis balls. 
   He loses 3 golf balls and 2 tennis balls. How many balls does he have left?
A: He starts with 8 golf balls and 8 tennis balls. After losing 3 golf balls, 
   he has 5 golf balls. After losing 2 tennis balls, he has 6 tennis balls. 
   5 + 6 = 11. The answer is 11.

Q: [Your question here]
A:

The paper demonstrated that CoT prompting enabled a 540B-parameter LLM to achieve state-of-the-art accuracy on the GSM8K math word problems benchmark using only eight exemplars. This outperformed fine-tuned GPT-3 with an external verifier.

The significance extended beyond math problems. CoT prompting improved performance on arithmetic, commonsense, and symbolic reasoning tasks. It revealed that large language models possessed latent reasoning capabilities that could be unlocked through appropriate prompting strategies.

ReAct: Reasoning and Acting Together (2022)

On October 6, 2022, researchers from Princeton and Google published "ReAct: Synergizing Reasoning and Acting in Language Models" (arXiv:2210.03629). Shunyu Yao, Jeffrey Zhao, and colleagues addressed a critical limitation: language models hallucinate when answering questions requiring factual knowledge.

ReAct interleaves reasoning traces with task-specific actions:

Thought: I need to find the birth year of the director of the 2022 film "Everything Everywhere All at Once".
Action: Search[director of Everything Everywhere All at Once 2022]
Observation: The directors are Daniel Kwan and Daniel Scheinert.
Thought: I need to find the birth year of Daniel Kwan.
Action: Search[Daniel Kwan birth year]
Observation: Daniel Kwan was born in 1988.
Thought: I have found the answer.
Answer: 1988

By allowing the model to query external sources (Wikipedia, APIs, search engines) and incorporate observations into its reasoning, ReAct reduced hallucination on HotpotQA and Fever benchmarks. In interactive decision-making tasks (ALFWorld and WebShop), ReAct outperformed imitation and reinforcement learning approaches by 34% and 10% absolute success rates using only one or two in-context examples.

ReAct established a pattern that would become fundamental: the separation of reasoning (internal thought) from action (external tool use), with observations feeding back into subsequent reasoning steps.

Tree of Thoughts: Deliberate Problem Solving (2023)

On May 17, 2023, the same Princeton and Google DeepMind team published "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (arXiv:2305.10601). Yao, Yu, Zhao, and colleagues generalized CoT to enable exploration of multiple reasoning paths.

Where CoT follows a single linear chain of reasoning, Tree of Thoughts (ToT) maintains multiple candidate thoughts, evaluates them, and explores promising paths while pruning unpromising ones. This resembles how humans solve complex problems: we consider alternatives, backtrack when stuck, and evaluate intermediate states.

The results were striking. On Game of 24 (a mathematical reasoning game where you combine four numbers to make 24), GPT-4 with CoT solved 4% of tasks. With ToT, success jumped to 74%. The framework also improved performance on creative writing and mini crossword puzzles.

ToT introduced search into the prompting paradigm. Rather than greedy left-to-right generation, the model could now deliberate, evaluate, and backtrack. This moved prompt engineering from simple input formatting toward algorithmic control structures.

The Prompt Report: A Systematic Survey (2024)

On June 6, 2024, researchers from the University of Maryland, Princeton, and other institutions published "The Prompt Report: A Systematic Survey of Prompting Techniques" (arXiv:2406.06608). Led by Sander Schulhoff and a large collaborative team, this became the most comprehensive survey of prompt engineering to date.

The report established a structured vocabulary of 33 terms and a taxonomy of 58 LLM prompting techniques plus 40 techniques for other modalities. It addressed the fragmented terminology that had plagued the field, where the same concept went by different names across papers.

Key contributions included:

Unified Taxonomy: Organizing techniques by function (contextual, zero-shot, few-shot, thought generation, ensembling, self-criticism, decomposition)
Best Practices: Guidelines for state-of-the-art LLMs like ChatGPT
Meta-Analysis: Systematic review of natural language prefix-prompting literature

The Prompt Report marked prompt engineering's transition from ad hoc experimentation to a mature discipline with established principles, measurable outcomes, and systematic evaluation frameworks.

Agentic Patterns and Tool-Use Standardization (2025)

On March 15, 2025, Anthropic released the Model Context Protocol (MCP), establishing a universal standard for connecting AI assistants to external data sources and tools. This represented a watershed moment for agentic prompting—moving from ad-hoc tool integrations to standardized, composable tool-use patterns.

The MCP specification defined three core primitives for agentic prompting:

Tools: Executable functions the model can invoke
Resources: Contextual data the model can reference
Prompts: Pre-defined templates for common workflows

This standardization enabled a new class of prompting patterns:

Tool-Use Prompting with Structured Outputs:

You have access to the following tools:
- calculate: Performs mathematical operations
- search: Queries external knowledge bases
- validate: Checks output against schemas

Analyze this request and select the appropriate tool:
<request>Calculate compound interest on $10,000 at 7% over 5 years</request>

Respond with JSON:
{
  "tool": "calculate",
  "parameters": {"principal": 10000, "rate": 0.07, "years": 5},
  "reasoning": "This requires mathematical computation"
}

By mid-2025, major providers had adopted MCP-compatible interfaces, enabling prompts to seamlessly orchestrate multi-tool workflows. The pattern evolved from simple ReAct-style reasoning to sophisticated agent graphs where prompts defined node behaviors, edge conditions, and state transitions.

Structured Outputs Evolution: JSON Mode and Function Calling (2025-2026)

Throughout 2025, structured output capabilities matured from experimental features to production-grade reliability. OpenAI's JSON mode (introduced late 2024) became the industry baseline, with Claude Sonnet 4.6 (released February 2026) introducing native schema-constrained generation.

The breakthrough was guaranteed schema adherence—ensuring outputs matched specified JSON schemas without post-processing:

{
  "schema": {
    "type": "object",
    "properties": {
      "sentiment": {"enum": ["positive", "negative", "neutral"]},
      "confidence": {"type": "number", "minimum": 0, "maximum": 1},
      "explanation": {"type": "string", "maxLength": 200}
    },
    "required": ["sentiment", "confidence"]
  },
  "strict": true
}

This enabled a new prompting paradigm: schema-first prompt engineering. Rather than hoping the model produced parseable output, prompts could now specify exact output contracts. By late 2025, 73% of production LLM applications used structured outputs for downstream processing, up from 31% in early 2024.

The implications for prompt engineering were profound:

Type Safety: Prompts could guarantee output shapes, eliminating parsing failures
Validation Feedback: Schema violations could be fed back as correction prompts
Pipeline Composition: Outputs from one prompt could reliably feed into another

Multimodal Prompting Advances: Vision-Language Integration (2026)

The release of Claude Sonnet 4.6 in February 2026 marked a significant advance in multimodal prompting capabilities. Unlike earlier vision-language models that processed images as afterthoughts, Sonnet 4.6 introduced unified multimodal reasoning—treating text and visual information as integrated context.

This enabled sophisticated multimodal prompting patterns:

Visual Chain-of-Thought:

Analyze this architectural diagram step by step:
[Image: system_architecture.png]

Step 1: Identify the data flow direction
Step 2: Locate potential bottlenecks
Step 3: Suggest optimization strategies

Provide your reasoning for each step, referencing specific visual elements.

Cross-Modal Retrieval:

Given this UI mockup and the user request "make it more accessible", 
identify:
1. Color contrast issues (reference specific elements)
2. Missing ARIA labels
3. Keyboard navigation problems

Support each finding with visual coordinates from the image.

By early 2026, multimodal prompting had evolved from simple image description to complex visual reasoning tasks. The key insight was that visual context followed the same principles as text context—placement, specificity, and structure mattered equally. Models could now perform visual chain-of-thought, cross-modal reasoning, and structured extraction from images with the same reliability as text-only tasks.

The field had come full circle: from GPT-2's simple zero-shot text conditioning to Claude Sonnet 4.6's unified multimodal reasoning, prompt engineering remained the discipline of structuring context to elicit desired capabilities—now across text, images, and tool-augmented environments.

What Drives Prompt Effectiveness: Key Factors

Prompt effectiveness depends on multiple interacting factors. Our experiments provide concrete data on the trade-offs between different prompting strategies, while the broader field has established general principles about prompt structure and formatting.

BBH-Inspired Hard Problems: Experimental Findings

We conducted a comprehensive experiment testing three prompting strategies on 24 genuinely hard BBH-inspired problems using Claude Sonnet 4.6. These problems were designed to be challenging—requiring multi-step reasoning, symbolic manipulation, and careful tracking of state:

Metric	Zero-Shot	Few-Shot	Chain-of-Thought
Accuracy (%)	87.5%	91.7%	91.7%
Avg Latency (s)	6.59s	7.16s	8.15s
Total Cost (m$)	119.751m$	160.776m$	256.209m$
Total Tokens	14,892	20,847	33,759

Problem Categories Tested (3 problems each):

Dyck-4: Complete nested bracket sequences with 4 bracket types (<, [, {, ()
Object Tracking: Track objects through 12-15 swaps in multi-person scenarios
Web of Lies: Boolean logic puzzles with truth-teller/liar constraints
MATH Algebra: Complex algebraic equations and systems
Complex Navigation: 20+ step navigation in 2D/3D space with direction changes
Temporal Reasoning: Multi-constraint scheduling problems
Geometric Proof: Area ratios and geometric relationships
Long-horizon Arithmetic: Multi-step calculations with compound operations

Accuracy Comparison Chart

Figure 1: Accuracy comparison across prompting strategies on 24 BBH-inspired hard problems. Few-shot and Chain-of-Thought show modest improvements over Zero-Shot, demonstrating that even hard problems can be tackled by capable models with minimal prompting.

The results reveal a nuanced pattern: Claude Sonnet 4.6 is remarkably capable even with zero-shot prompting on complex reasoning tasks. The accuracy differences are smaller than expected because the model has strong inherent reasoning capabilities. However, the pattern still shows that structured prompting (few-shot and CoT) provides slight advantages.

Category Accuracy Chart

Figure 2: Accuracy breakdown by problem category. Most categories show high performance across all strategies, with Dyck-4 and Long-horizon Arithmetic showing the most variation.

Key Finding: The Dyck-4 problems (nested bracket completion) proved most challenging, with zero-shot achieving only 67% accuracy while few-shot and CoT achieved 100%. This demonstrates that for certain symbolic manipulation tasks, pattern-based learning from examples significantly improves performance.

Cost Comparison Chart

Figure 3: Cost comparison across prompting strategies. Chain-of-Thought costs 2.1x more than Zero-Shot due to extensive reasoning generation.

Latency Comparison Chart

Figure 4: Latency comparison showing Chain-of-Thought adds 24% overhead versus Zero-Shot prompting due to step-by-step reasoning generation.

Token Usage Chart

Figure 5: Token usage breakdown showing Chain-of-Thought consumes 2.3x more completion tokens than Zero-Shot due to explicit reasoning steps.

Trade-off Triangle

Figure 6: The prompt engineering trade-off triangle. When accuracy matters most on hard problems, Chain-of-Thought provides marginal improvement at significant cost. For speed and cost efficiency, Zero-Shot is optimal. Few-shot sits in the middle—higher cost than zero-shot with similar accuracy to CoT.

Where Each Technique Excels

Zero-Shot: The Baseline Champion

Best for: Quick answers, cost-sensitive applications, when model capability is high
Accuracy: 87.5% on hard BBH problems (21/24 correct)
Trade-off: Minimal interpretability—no reasoning trace provided
Key insight: Claude Sonnet 4.6 handles complex reasoning remarkably well without examples

Few-Shot: The Pattern Learner

Best for: Format establishment, symbolic tasks, when patterns help
Accuracy: 91.7% on hard BBH problems (22/24 correct)
Trade-off: Higher cost than zero-shot, similar accuracy to CoT
Key insight: Particularly effective on Dyck-4 problems where pattern matching helps

Chain-of-Thought: The Reasoning Guarantee

Best for: Debugging, when interpretability is critical, complex symbolic reasoning
Accuracy: 91.7% on hard BBH problems (22/24 correct)
Trade-off: 2.1x cost increase, 24% latency increase
Key insight: Provides reasoning transparency but at significant cost premium

Context Placement Effects

Research on transformer attention mechanisms reveals that context placement significantly impacts performance. Our experiments demonstrate this quantitatively: the few-shot strategy placed substantial prompt tokens before each question, establishing pattern expectations that the model followed.

When examples appear at the beginning of the prompt (prefix prompting), models establish pattern expectations early. Our chain-of-thought experiments used explicit reasoning examples, contributing to higher latency but ensuring detailed reasoning traces.

The standard practice of placing examples before the target query aligns with how transformers process attention. Early tokens influence the representation of later tokens more than vice versa. This asymmetry means that instruction placement matters: system instructions at the start of a conversation establish behavioral constraints that persist across turns.

Structural Markers and XML Tags

Modern prompt engineering increasingly uses structural markers to delineate different content types. XML-like tags provide clear boundaries that help models parse complex instructions. Our chain-of-thought experiments used explicit structural markers ("Step 1:", "Step 2:") within the reasoning prompts, contributing to higher token usage but improving reasoning clarity.

The data reveals the cost of structure: chain-of-thought prompting with structured step markers cost 256.209 milli-dollars total compared to 119.751 milli-dollars for zero-shot. This 2.1x cost increase bought interpretability. Each step was explicitly labeled and followable, making error detection possible. In production systems, this structural transparency justifies the expense when debugging reasoning failures.

XML-like tags provide similar boundaries:

<instruction>
Analyze the following text for sentiment.
</instruction>

<text>
The product arrived damaged and customer service was unhelpful.
</text>

<output_format>
Sentiment: [positive/negative/neutral]
Confidence: [0-1]
Explanation: [brief reasoning]
</output_format>

Industry best practices from The Prompt Report and Anthropic's documentation support using XML tags for complex prompts with multiple content types. The experimental data confirms that structured formats improve reliability at a measurable cost.

Specificity and Constraint Engineering

Vague prompts produce variable outputs. Specific prompts produce consistent outputs. Our experiments demonstrate this principle concretely: the zero-shot prompts specified exact answer requirements (numerical responses only), producing outputs with high accuracy even on complex tasks.

The data shows constraint trade-offs clearly:

Zero-shot with tight constraints: 6.59s latency, 119.751 milli-dollars cost, 87.5% accuracy
Few-shot with pattern-based constraints: 7.16s latency, 160.776 milli-dollars cost, 91.7% accuracy
Chain-of-thought with explicit reasoning: 8.15s latency, 256.209 milli-dollars cost, 91.7% accuracy

The surprising finding: Zero-shot with minimal constraints achieved strong accuracy (87.5%) at minimal cost—even on complex BBH-inspired problems. This suggests that for Claude Sonnet 4.6, the model's inherent reasoning capabilities are remarkably strong, and complex scaffolding provides only marginal benefits.

Constraint engineering involves adding explicit boundaries:

Length constraints: "in exactly 50 words"
Format constraints: "as a JSON object with keys 'summary' and 'keywords'"
Style constraints: "in the tone of a technical blog post"
Content constraints: "without mentioning competitors"

Each constraint reduces the model's output space, increasing predictability. Our experiments prove this: zero-shot with minimal constraints achieved 87.5% accuracy at minimal cost. Chain-of-thought with maximal structure achieved 91.7% accuracy but consumed 33,759 total tokens versus 14,892 for zero-shot. The lesson is clear: add constraints when the baseline fails or when interpretability is required.

Claude-Specific Patterns: Pipelines and Evaluator-Optimizer Loops

Anthropic's Claude models have enabled sophisticated multi-step prompting patterns that go beyond single-turn interactions. These patterns treat the LLM as a component in a larger system rather than a standalone oracle.

Prompt Chaining

Prompt chaining decomposes complex tasks into sequential subtasks, with each step's output feeding into the next step's input:

Step 1: Extract entities from the text
Step 2: Classify each entity by type
Step 3: Identify relationships between entities
Step 4: Generate a knowledge graph

Each step uses a specialized prompt optimized for that specific subtask. This improves reliability because simpler prompts on narrower tasks produce more consistent outputs than complex prompts on broad tasks.

Chaining also enables human-in-the-loop validation at intermediate steps. If entity extraction looks wrong, the user can correct it before relationship identification proceeds.

Prompt Routing

Routing uses an initial classification step to determine which specialized prompt should handle a request:

Classifier Prompt: "Is this a billing question, technical support, or sales inquiry?"

If billing -> Route to billing specialist prompt
If technical -> Route to technical support prompt
If sales -> Route to sales prompt

This pattern scales prompt specialization without requiring users to pre-categorize their requests. The classifier prompt can be lightweight (few-shot with 3-5 examples) while specialist prompts can be heavy (detailed instructions, extensive context).

Evaluator-Optimizer Loops

The most sophisticated pattern involves iterative refinement:

Loop:
  1. Generate output using current prompt
  2. Evaluate output against criteria
  3. If criteria met, return output
  4. If criteria not met, generate feedback
  5. Update prompt with feedback
  6. Repeat

For example, generating code:

Generation: "Write a Python function to sort a list"
Evaluation: "Does this handle empty lists? Does it preserve type stability?"
Feedback: "Add error handling for None inputs and non-list types"
Updated Generation: "Write a Python function to sort a list with type checking"

This pattern requires careful prompt engineering for both the generator and the evaluator. The evaluator prompt must be precise about criteria, and the feedback prompt must translate evaluation results into actionable instructions.

Multi-Modal Considerations

While our experiments focused on text, modern prompt engineering increasingly handles multi-modal inputs. Claude and similar models process images alongside text, requiring prompts that reference visual elements:

"Analyze this image. First describe what you see, then identify any text, 
finally explain the relationship between visual elements and text."

The principles remain consistent: clear instructions, appropriate context, structured output expectations. But the context now includes non-textual information that must be referenced explicitly.

Conclusion

Prompt engineering has evolved from an art of trial and error into a systematic discipline. The field traces its lineage from GPT-2's zero-shot discoveries through GPT-3's few-shot learning, CoT's reasoning elicitation, ReAct's tool integration, ToT's deliberate search, and finally to the comprehensive taxonomy of The Prompt Report.

Our BBH-inspired experiments reveal nuanced trade-offs between prompting strategies on genuinely hard problems:

Zero-shot is remarkably capable: On complex BBH-inspired problems including Dyck-4 completion, 15-step object tracking, Web of Lies boolean puzzles, and 20-step navigation, Claude Sonnet 4.6 achieved 87.5% accuracy with minimal prompting. This demonstrates that modern models have absorbed sophisticated reasoning patterns from their training data.
Few-shot provides modest gains: With carefully designed simpler examples for each problem type, few-shot prompting achieved 91.7% accuracy—a 4.2 percentage point improvement over zero-shot. The Dyck-4 category showed the most improvement (67% → 100%), demonstrating that pattern-based learning helps on symbolic manipulation tasks.
Chain-of-thought matches few-shot accuracy at higher cost: CoT also achieved 91.7% accuracy but at 2.1x the cost of zero-shot. The value of CoT lies not in accuracy gains for this model, but in interpretability—the explicit reasoning traces enable debugging and verification that opaque outputs cannot provide.
The ceiling effect: All strategies performed well because Claude Sonnet 4.6 is a highly capable model. The hard problems we designed (nested bracket completion, multi-constraint scheduling, geometric proofs) were challenging but within the model's capabilities. This suggests that for state-of-the-art models, prompting strategy matters less than for earlier generations.

The future of prompt engineering lies in adaptive strategy selection and capability-aware prompting:

Start with zero-shot (fast, cheap, surprisingly capable on modern models)
Escalate to few-shot when symbolic pattern matching helps (Dyck-4, state tracking)
Use chain-of-thought when interpretability is required, not just for accuracy
Reserve complex scaffolding (ToT, ReAct) for problems where simpler strategies fail

Prompt engineering is not a replacement for model training. It is a complementary discipline that extracts maximum value from existing capabilities. As models grow more powerful, the importance of effective prompting evolves—from squeezing capability out of limited models, to efficiently directing powerful ones, to providing transparency and debuggability for critical applications.

The question is no longer whether prompt engineering matters, but how to do it systematically, measurably, and with awareness of the capability ceiling of modern models.

Case Study: Building This Evaluation with Neo

This entire evaluation - from research to experimental design to analysis to publication-ready visuals - was produced by Neo, a fully autonomous AI engineering agent. Understanding how this work was created provides a template for how you can build similar evaluations for your own use cases.

What Neo Is

Neo AI Engineer is an autonomous AI engineering agent that operates within VS Code, Cursor, or as a standalone tool. Unlike copilots that suggest code line-by-line, Neo takes high-level goals, plans the implementation, writes the code, runs experiments, analyzes results, and iterates until completion. It can:

Research and synthesize information from papers, documentation, and best practices
Design and implement ML experiments with proper controls
Generate publication-quality visualizations
Write technical documentation grounded in actual data
Iterate based on experimental findings (the initial toy problems were rejected after analysis showed they were too easy)

How This Evaluation Was Built

Phase 1: Research and Planning

The project started with a single high-level prompt: research the evolution of prompt engineering and run experiments comparing techniques. Neo:

Searched for primary sources (arXiv papers, Anthropic documentation, OpenAI cookbooks)
Verified release dates and paper origins (GPT-2: Feb 2019, GPT-3: May 2020, CoT: Jan 2022, etc.)
Identified BBH (BIG-Bench Hard) as the appropriate benchmark for hard reasoning tasks
Created a detailed execution plan with 8 subtasks

Phase 2: Initial Experiment (Failed)

Neo first designed 8 "toy" problems (Game of 24, simple logic puzzles, basic arithmetic). The experiment ran successfully but analysis revealed a critical flaw: Claude Sonnet 4.6 achieved 100% accuracy zero-shot. The problems were too easy.

Phase 3: Redesign with BBH-Inspired Problems

Based on the failure analysis, Neo:

Researched what actually challenges modern LLMs (Dyck languages, object tracking with 12+ swaps, Web of Lies boolean puzzles)
Designed 24 genuinely hard problems (3 per category)
Created matched few-shot examples (same problem type, simpler instances)
Created CoT examples with explicit step-by-step reasoning
Re-ran the full experiment (72 API calls)

Phase 4: Analysis and Visualization

Neo analyzed the results and found:

Zero-shot: 87.5% accuracy (challenging but not impossible)
Few-shot: 91.7% accuracy (modest improvement)
CoT: 91.7% accuracy (same accuracy, 2.1x cost)

It then generated 6 publication-quality charts showing the trade-offs and updated the blog post with findings backed by actual data.

Key Capabilities Demonstrated

Autonomous Iteration: When the initial experiment failed to differentiate techniques, Neo didn't just report results - it analyzed why, researched harder problems, and re-ran everything.

Grounded Claims: Every claim in this blog is backed by experimental data from results.json. No fabricated citations, no hypothetical examples.

End-to-End Execution: From research notes to running code to final publication, Neo handled the complete pipeline without human intervention on implementation details.

How to Extend This Work

You can build on this evaluation using Neo in several ways:

1. Test Different Models

Clone the repo and prompt Neo:

Clone https://github.com/gauravvij/prompt-engineering-neo.git and modify 
the experiment to test GPT-4, Llama 3, and Gemini on the same BBH problems. 
Compare how different model families respond to zero-shot vs few-shot vs CoT 
prompting. Generate comparative charts showing which models benefit most from 
scaffolding.

2. Add New Problem Categories

Clone https://github.com/gauravvij/prompt-engineering-neo.git and extend 
the experiment with new BBH categories: causal reasoning, word sorting, 
multistep arithmetic with negative numbers. Design 3 problems per category 
with ground truth answers. Run the full experiment and add the results to 
the existing analysis.

3. Test Advanced Techniques

Clone https://github.com/gauravvij/prompt-engineering-neo.git and add 
ReAct and Tree-of-Thought prompting to the comparison. Implement tool use 
for the navigation problems (allowing the model to "move" and observe) 
and deliberate search for the Web of Lies puzzles. Measure whether these 
complex techniques justify their additional cost on problems where 
simpler strategies struggle.

4. Build a Prompt Optimization Pipeline

Clone https://github.com/gauravvij/prompt-engineering-neo.git and create 
an automated prompt optimizer that tests variations: different example 
orderings, varying context placement, role assignment phrasing. Use the 
BBH problems as a benchmark to find the optimal prompt template for each 
problem type. Generate a report on which prompt variations matter most.

5. Create a Production Evaluation Suite

Clone https://github.com/gauravvij/prompt-engineering-neo.git and refactor 
the experiment into a reusable evaluation framework. Add configuration files 
for different model endpoints, support for parallel execution, and a 
standardized output format. Document how to add new problem types and 
prompting strategies. This becomes a tool for evaluating any model on 
reasoning benchmarks.

Getting Started with Neo

To replicate or extend this work:

Install Neo: Available as a VS Code extension or standalone
Set up API keys: Configure OpenRouter or your preferred provider
Clone this repo: git clone https://github.com/gauravvij/prompt-engineering-neo.git
Prompt Neo: Describe what you want to build or extend
Review and iterate: Neo will propose plans, execute them, and adapt based on results

The key insight: Neo handles implementation details so you can focus on experimental design and analysis. You define what to test, Neo figures out how to test it.

Why This Matters

This evaluation demonstrates a new way of working with AI: instead of manually writing scripts, running experiments, and generating charts, you describe the goal and let an autonomous agent handle execution. The BBH problems in this repo, the visualization code, the analysis - all were produced through iterative collaboration with Neo.

The future of AI engineering is not writing more code. It is defining better experiments and letting autonomous systems execute them.

References

Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
Schulhoff, S., et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv:2406.06608.
Suzgun, E., et al. (2022). Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv:2210.09261.

Experimental data and analysis available in accompanying files: results.json, experiment.py, generate_visual_charts.py

Explore Prompt Engineering with NEO

This evaluation — research, experimental design, 72 API calls, chart generation, and analysis — was produced autonomously by Neo. You can replicate or extend it yourself:

Run This with NEO →

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor