All articles

Local Model Behavior Prober: Systematic Behavioral Testing for Local LLMs

Local Model Behavior Prober: Systematic Behavioral Testing for Local LLMs
LLM Evaluation & Benchmarking·HeyNEO Team·May 5, 2026·8 min

Local Model Behavior Prober: Systematic Behavioral Testing for Local LLMs

View on GitHub

Pipeline Architecture

The Problem

You download a new model, run a few prompts, and it feels good. You swap the quantization level, run the same prompts, and it still feels good. But "feels good" is not a regression test. When you need to pick between four local model variants for a production use case, different sizes, different quantization levels, different fine-tunes, you need a repeatable, structured behavioral comparison, not a vibe check.

NEO built Local Model Behavior Prober to give local model evaluation the same rigor you'd apply to software testing: structured probe suites, captured baselines, and diff-style regression reports you can run entirely on-device.

Structured Probe Suites

The prober runs models through categorized probe sets that test specific behavioral dimensions:

  • Instruction following: does the model do what it's told, in the format requested?
  • Factual accuracy: does it get basic facts right on your domain?
  • Refusal calibration: does it refuse the right things and not over-refuse?
  • Format compliance: JSON output, markdown, structured lists: does the format hold?
  • Edge case handling: empty inputs, ambiguous requests, very long contexts.

Each probe is a YAML-defined prompt with expected properties (not exact outputs, properties). The scorer checks properties: does the output contain valid JSON? Does it mention the requested format? Did the model follow the instruction?

Baseline Capture and Regression Diff

The first run establishes a behavioral baseline for a given model. Subsequent runs produce a diff, which probes regressed, which improved, which are stable. This is the workflow for:

  • Comparing a 4-bit quantization against the full-precision baseline
  • Checking whether a fine-tune improved the target domain without breaking general capability
  • Validating that a model update from the same family preserved expected behaviors
prober baseline --model llama3.2:3b --suite default
prober run --model llama3.2:3b-q4 --suite default --compare baseline
prober diff baseline.json q4-run.json

Python Library Integration

The prober is a pip-installable Python package, usable in test suites and CI pipelines:

from local_model_prober import Prober, Suite

prober = Prober(model="llama3.2:3b", backend="ollama")
suite = Suite.load("default")
results = prober.run(suite)
print(results.summary())

This makes behavioral testing a first-class step in your model evaluation pipeline, not a manual step you do when something breaks.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a Python package for probing local LLM behavior. Define probe suites in YAML with categorized prompts for instruction following, factual accuracy, refusal calibration, format compliance, and edge case handling. Score responses against behavioral properties (not exact outputs). Capture baselines per model and produce diff-style regression reports comparing two runs. Support Ollama as the backend. Package as pip-installable with both a CLI and a Python library interface. Support custom probe suites via YAML. Run entirely offline with no external API calls."

Build with NEO →

NEO scaffolds the YAML probe format, the property-based scorer, the baseline capture and diff logic, the Ollama backend integration, and the pip package structure. From there you iterate: add a new probe category for your specific use case, wire the prober into GitHub Actions so every model PR gets a behavioral regression check, or extend the scorer to use an LLM judge for harder-to-specify properties.

To run the finished project:

git clone https://github.com/dakshjain-1616/Local-Model-Behavior-Prober
cd Local-Model-Behavior-Prober
pip install -e .

prober baseline --model llama3.2:3b --suite default
prober run --model llama3.2:3b-q4 --suite default --compare baseline
prober diff baseline.json latest-run.json

NEO built a structured behavioral testing framework for local LLMs with YAML probe suites, property-based scoring, baseline capture, and regression diffs, all running on-device without API calls. See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

Want to try what NEO built?

Try Neo AI Engineer →
← Back to Blog