Building a Self-Healing AI Pipeline: Claude Code vs Claude Code + Neo MCP

The Question

Most AI coding assistants can generate code.

The more interesting question is:

Can they improve how AI systems are engineered?

To test this, we asked both systems to build a production-grade self-healing AI pipeline capable of:

Detecting failures
Diagnosing root causes
Recovering autonomously
Maintaining observability
Supporting production deployment

The Benchmark Task

Both implementations were asked to solve the same compound goal against a deliberately flaky tool environment:

Search for customer data, calculate the average lifetime value from the database, and write the output file.

The mock toolchain includes three tools: WebSearch, DatabaseQuery, and SecureWrite. DatabaseQuery fails roughly half the time with transient JSONDecodeError or RateLimitException responses. That setup forces the agent to survive real-world failure modes, not just happy-path demos.

We compared two builds from the same starting brief:

Implementation	How it was built	Entry point
`claudecode/`	Claude Code only (`ReflectiveAgent`)	`python3 run.py`
`neo-mcp/`	Claude Code + Neo MCP (`SelfHealingAgent`)	`python3 main.py`

Both run offline with zero runtime dependencies on the deterministic path (Python 3.10+). Set ANTHROPIC_API_KEY and either agent transparently switches to a real claude-opus-4-8 planner with no code changes.

What We Expected

We expected both systems to generate:

Retry logic
Error handling
Monitoring code

And they did.

The surprising difference was not in the code itself.

It was in the engineering workflow.

What Claude Code Built

Claude generated a working self-healing agent: a pure-Python ReflectiveAgent with a bounded retry loop, reflective exception handlers, and recovery metrics.

The implementation included:

Planning and execution via reasoner.py (offline deterministic reasoner, or AnthropicReasoner when an API key is present)
Two interchangeable planners (offline + Anthropic), swapped by shared method signature rather than a formal interface ABC
Three mock tools in tools.py with a deliberate exception taxonomy
Reflection on failure: full stack traces appended back into context as [TOOL_ERROR] system messages
Retry-vs-repair classification (RateLimitException / JSONDecodeError → transient; ToolInputError → repair arguments)
Exponential backoff with auditable wait-time metrics
A failure-heavy test harness in run.py that runs two resilience scenarios end to end

The result was functional and solved the task. Both scenarios in the harness end with the output file on disk (average_lifetime_value = 8200.0 over 5 customers) and OVERALL: PASS.

However, most of the system was organized around a single implementation path. The planning logic, recovery logic, and orchestration logic live in one agent class. As the system grows, modifying one area requires understanding multiple parts of the implementation.

How the Claude Code loop works

Each logical step (search → aggregate → write) runs up to max_retries attempts (default 3). On any exception:

Capture the stack trace via traceback.format_exc()
Append it to message history as a reflection signal
Inject a [SELF_CORRECTION] prompt for the reasoner
Classify transient vs deterministic failure and either back off or repair arguments
Retry under the budget, or halt cleanly (gave_up_steps increments) instead of crashing

A representative recovery metrics block from a full run:

"failures": 2, "failures_by_type": {"RateLimitException": 2},
"recoveries": 1, "retries": 2, "backoff_seconds": 0.3

What Changed With Neo MCP

Neo MCP shifted the problem from:

"Build a self-healing agent"

"Build a self-healing platform."

Instead of focusing solely on task completion, the generated architecture emphasized:

Separation of concerns across planner, agent, tools, and exceptions modules
Formal PlannerInterface ABC for planner swapping (both implementations ship two planners; Neo's edge is the explicit contract)
Structured recovery workflows with explicit classify → repair → verify stages
Failure classification via decide_next_action with five named actions: retry, modify, switch, abort, proceed
Recovery observability: both emit a metrics dict; Neo also records recovery_path
Testable interfaces with a pytest suite and multi-seed harness (seeds 1–10, with seed 7 codified to produce status == "partial")

This fundamentally changed the development experience.

Neo MCP architecture

planner_interface.py        ← ABC: plan_goal / reflect / decide_next_action
├── llm_sim.py              ← LLMSimulator (offline, deterministic)
└── anthropic_planner.py    ← AnthropicPlanner (real LLM via API)
agent.py                    ← SelfHealingAgent with retry, repair, metrics
tools.py                    ← WebSearch, DatabaseQuery (flaky), SecureWrite
exceptions.py               ← RateLimitException, ToolExecutionError
main.py                     ← Entry point with both demo scenarios

The self-healing flow is explicit:

Plan: decompose the goal into ordered tool-call steps
Execute: attempt each step; transient failures trigger jittered exponential backoff
Reflect: append a reflection message to context on failure
Classify: decide_next_action picks one of five named recovery actions
Repair: _repair_args fixes bad arguments (e.g. unknown column → known-good column) without backoff
Metrics: record every failure, retry, backoff second, and recovery step in recovery_path

Architecture: from a tightly coupled monolithic agent to a modular, interface-based platform with pluggable planning engines

Workflow Transformation

Where reliability enters the workflow: bolted on after the fact versus built into the architecture from day one

Before

Typical engineering workflow:

Build agent
Add retries
Add logging
Add metrics
Add tests
Refactor architecture
Discover scaling issues
Add observability

Reliability becomes something added later.

With Neo MCP

Typical workflow:

Define agent capabilities
Plug into framework
Get recovery workflows
Get observability
Get failure classification
Get testing primitives
Extend through interfaces

Reliability becomes part of the architecture from day one.

The Most Interesting Difference

Claude treated self-healing as:

Recover when something breaks.

Neo MCP treated self-healing as:

Understand why something broke, classify the failure, choose a recovery strategy, verify the outcome, and record the incident.

That difference sounds subtle.

In production systems it is enormous, but both implementations recover. The gap is in how many named recovery actions exist and how explicitly each step is recorded.

Example: Tool Failure

Claude Approach

Tool fails
→ Classify (transient vs deterministic)
→ Backoff and retry (transient)
→ Reflect and repair args (deterministic)
→ Escalate if budget exhausted

Neo MCP Approach

Tool fails
→ Classify failure (decide_next_action)
→ Determine if transient
→ Repair arguments if possible (_repair_args)
→ Retry with jittered backoff
→ Verify output
→ Record recovery path
→ Escalate if required

Tool failure recovery: Claude Code uses binary classification with implicit argument repair; Neo MCP adds five named decide_next_action paths, dedicated _repair_args, jittered backoff, and a recorded recovery_path

Neo MCP's advantage here is breadth: five named recovery actions, a dedicated repair function, jitter on backoff, and a recorded recovery_path. Claude Code already classifies and repairs (Scenario B proves it), but the logic is embedded in the reasoner loop rather than split into named steps.

What the test run proves

The harness exercises two failure modes that show up constantly in production agents:

Scenario A (flaky environment): DatabaseQuery throws transient rate-limit and JSON decode errors. The Claude Code agent backs off (0.1s → 0.2s) and recovers, then writes ltv_report.json. The Neo MCP agent classifies each failure, applies jittered backoff on transients, and verifies the write succeeded.

Scenario B (hallucination repair): both implementations recover. The planner's first database call uses plausible-but-wrong arguments (operation='average', no customer IDs). Claude Code's reasoner mutates the arguments implicitly from the injected stack trace and recovers on attempt 2. Neo MCP's _repair_args applies explicit column remapping and recovers the same way, writing ltv_report_b.json.

Both implementations pass. The difference is what you inherit when you extend the system: one gives you a working loop to refactor; the other gives you named interfaces, a test suite, and a recorded recovery_path out of the box.

What This Means For Experienced Engineers

Both implementations retry, classify failures, repair arguments, and emit metrics. Neo MCP did not eliminate those problems. It formalized them.

The four differences you can verify in source without argument:

Multi-class decide_next_action (retry / modify / switch / abort / proceed) vs binary transient/deterministic
Backoff with jitter vs plain exponential backoff
Dedicated _repair_args with explicit column remapping vs implicit reasoner mutation
pytest + multi-seed harness vs a manual run.py harness only

Those are the rows worth leaning on in any summary. Everything else is a matter of code organization, not capability presence.

Capability Comparison

Both implementations pass the benchmark. The table below compares how recovery is structured, not whether it exists. Every cell is checkable against the repo.

Capability	Claude Code	Claude Code + Neo MCP
Functional agent	Yes	Yes
Retry handling	Yes	Yes
Failure classification	Binary (transient vs deterministic)	Multi-class (`decide_next_action`: retry / modify / switch / abort / proceed)
Backoff strategy	Exponential	Exponential with jitter
Argument repair	Implicit (reasoner mutates args)	Dedicated `_repair_args` with explicit remapping
Planner swapping	By shared method signature	Formal `PlannerInterface` ABC
Recovery observability	Metrics dict	Metrics dict plus recorded `recovery_path`
Automated tests	None (manual harness)	pytest suite plus multi-seed harness (seed 7 codified partial)

Run it yourself

Clone the repository and run either implementation:

# Implementation A (Claude Code only)
cd claudecode
python3 run.py

# Implementation B (Claude Code + Neo MCP)
cd neo-mcp
python3 main.py

Run the Neo MCP test suite:

cd neo-mcp
pip install -r requirements-dev.txt
python3 -m pytest tests/ -v

Try a different flakiness draw with AGENT_SEED=8 python3 run.py (seeds 8/9 can surface the give-up path). Full execution logs and per-attempt repair traces are captured in each folder's execution_log.txt / execution_log.md.

The Biggest Takeaway

The benchmark did not show that Neo MCP writes dramatically different code. Both agents recover from flaky tools and hallucinated arguments.

What Neo MCP adds are four concrete, source-verifiable edges: multi-class decide_next_action, jittered backoff, a dedicated _repair_args function, and a pytest multi-seed test suite. Those are the claims that survive technical diligence.

Beyond that, Neo MCP formalizes recovery into named interfaces and recorded recovery_path steps. Claude Code does the same work ad hoc inside a single agent loop. For teams extending the system, that formalization is the practical difference, not a missing capability on either side.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor