LLM Behavior Diff: Detect Meaningful Output Changes Across Model Updates

Model upgrades often look safe in smoke tests and then quietly alter behavior in ways that matter for production users. This project makes that drift visible before rollout.
It compares two model/provider configurations over the same prompt suite and classifies differences by severity instead of leaving reviewers to eyeball raw text.
What You Get
A reproducible command-line workflow, HTML reports for review, and an optional judge path when embedding-only similarity is not enough for your domain.
This is especially useful when changing providers, context windows, or model versions in automation-heavy products.
Run the Project
git clone https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector
cd -LLM-Behavior-Diff-Model-Update-Detector
pip install -e .
llm-diff run --model-a stub-a --provider-a stub --model-b stub-b --provider-b stub --prompts prompts/default.yaml --output output/report.html --no-use-embeddings
export OPENROUTER_API_KEY=sk-or-...
llm-diff run --model-a meta-llama/llama-3.2-3b-instruct --provider-a openrouter --model-b google/gemini-2.0-flash-lite-001 --provider-b openrouter --prompts prompts/default.yaml --output output/or_emb.html --use-embeddings --threshold 0.85
Current CLI expects --provider-a/--provider-b and --prompts (not --provider and --suite), which is reflected in the run examples above.
Architecture Walkthrough
The llm behavior diff detector repository is organized around a clear pipeline, so you can trace the full flow from input handling to final output without guesswork. This makes onboarding easier for new contributors and helps teams debug faster when behavior changes after updates.
Practical Use Cases
If you are evaluating llm behavior diff detector for production, start with a small real-world dataset, run the included commands end to end, and compare output quality, latency, and operational complexity. This gives a practical signal that is stronger than a toy demo.
Implementation Notes
The project is useful as both a standalone tool and a reference implementation. You can copy patterns from this codebase into your own stack, especially around evaluation discipline, reproducibility, and operator visibility.
Try NEO in Your IDE
Install the NEO extension to bring AI-powered development directly into your workflow:
- VS Code: NEO in VS Code
- Cursor: Install NEO for Cursor