CoT Surgeon: Surgical Editing for LLM Reasoning Chains

The Problem
LLMs produce chain-of-thought reasoning as unstructured prose. When one step is wrong, you re-run the entire prompt and hope the model self-corrects. There is no way to pinpoint the bad step, fix it, and propagate the correction without touching the rest.
NEO built CoT Surgeon to parse reasoning chains into a typed graph, expose individual nodes for editing, and recalculate only the affected downstream path.
The ReasoningGraph Structure
CoT Surgeon converts LLM output into a ReasoningGraph, a directed acyclic graph where every node has a type and a confidence score.
Three node types cover the full reasoning structure:
- FACT grounded, verifiable premises. Example: "The atmosphere contains N2, O2, and suspended particles."
- REASONING inferential steps derived from facts. Example: "Rayleigh scattering causes shorter wavelengths to scatter more."
- CONCLUSION the final answer derived from the reasoning chain.
Each node carries a confidence float between 0.0 and 1.0, estimated by the LLM during generation. Nodes below the CONFIDENCE_THRESHOLD (default 0.7) are flagged as low-confidence and highlighted in both the Streamlit UI and Mermaid exports.
The Edit and Recalculate Workflow
The core operation is surgical: fix one node, regenerate only the path that depends on it.
from cot_surgeon import ReasoningEngine
engine = ReasoningEngine(mode="mock")
graph = engine.generate_cot("Why is the sky blue?")
# Find low-confidence nodes
weak = graph.low_confidence_nodes(threshold=0.75)
for node in weak:
print(f"{node.id} conf={node.confidence:.2f} {node.content}")
# Fix the flawed step
graph.update_node("node_3", "Rayleigh scattering causes shorter wavelengths to scatter more strongly.")
# Recalculate everything downstream upstream nodes are untouched
graph = engine.recalculate_from_node(graph, "node_3")
Nodes that do not depend on node_3 are never touched. The graph's version counter increments on every edit. Every mutation pushes a snapshot onto an internal history stack with a default depth of 20, so graph.undo() steps back through changes.
Confidence Scoring and Graph Stats
After generation or recalculation, graph.stats() returns a summary:
stats = graph.stats()
# {
# "node_count": 5,
# "avg_confidence": 0.89,
# "low_confidence_count": 1,
# "edit_count": 1,
# "version": 2
# }
Low-confidence nodes receive a distinct color in Mermaid exports. Edited nodes are rendered in purple so you can see exactly which parts of a graph were modified.
LLM Backend Priority
Three backends are supported in auto mode:
- OpenRouter used when
OPENROUTER_API_KEYis set - Local llama.cpp used when
LLAMA_MODEL_PATHis set - Mock always available, uses built-in templates, no API key needed
Pass mode explicitly to bypass auto-detection:
engine = ReasoningEngine(mode="openrouter") # cloud
engine = ReasoningEngine(mode="local") # llama.cpp GGUF
engine = ReasoningEngine(mode="mock") # no key needed
How to Build This with NEO
Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:
"Build a Python library called CoT Surgeon that parses LLM chain-of-thought output into a typed directed acyclic graph. Each node has a type (FACT, REASONING, or CONCLUSION), a confidence score between 0 and 1, and an ID. When a node is edited, recalculate only the downstream subgraph leave upstream nodes untouched. Track a version counter on every edit, maintain an undo history stack of 20 snapshots, and flag nodes below a configurable confidence threshold. Support three LLM backends in priority order: OpenRouter, local llama.cpp GGUF, and a mock mode with built-in templates. Export graphs as Mermaid diagrams with low-confidence nodes colored distinctly and edited nodes in purple. Provide a Streamlit UI with Single Analysis and Batch Compare tabs."
NEO generates the project structure and core implementation from that. From there you iterate ask it to implement the selective recalculation logic that traverses only the affected downstream subgraph, add the undo/redo history stack with configurable depth, or build out the Batch Compare tab that runs multiple prompts in parallel and displays graphs side by side. Each request builds on what's already there without re-explaining the context.
To run the finished project:
git clone https://github.com/dakshjain-1616/cot-surgeon
cd cot-surgeon
pip install -r requirements.txt
streamlit run app.py
The Single Analysis tab lets you generate a reasoning graph, inspect and edit individual nodes, trigger downstream recalculation, and export to Mermaid. The Batch Compare tab is useful for regression testing prompt changes across multiple reasoning chains.
NEO built a structured reasoning editor that treats LLM chain-of-thought as inspectable, editable, and version-controlled data. See what else NEO ships at heyneo.com.
Try NEO in Your IDE
Install the NEO extension to bring AI-powered development directly into your workflow:
- VS Code: NEO in VS Code
- Cursor: Install NEO for Cursor →