All articles

From Dataset Generation to Dataset Engineering: What Changed When We Added NEO MCP

From Dataset Generation to Dataset Engineering: What Changed When We Added NEO MCP
LLM Evaluation & Benchmarking·HeyNEO Team·June 17, 2026·14 minGitHub

From Dataset Generation to Dataset Engineering: What Changed When We Added Neo MCP

View on GitHub

The Problem

Synthetic agent failure datasets are easy to start and hard to trust.

A thousand JSONL rows with the right field names can look complete on first inspection. The real risk shows up later: duplicate fingerprints that slip through post-hoc dedup, audit reports that cannot gate a CI job, and pipelines that produce different files on every rerun. Teams building evaluation harnesses do not just need records. They need dataset engineering: reproducibility, validation, auditability, and traceability from spec to shipped artifact.

That is the gap this benchmark targeted.

Teams do not just need records. They need dataset engineering: reproducibility, validation, auditability, and traceability from spec to shipped artifact.

The Benchmark Task

We asked two automated workflows to produce the same contractual dataset:

  • 1,000 records of synthetic AI agent failures
  • 10 failure categories with balanced severity and difficulty
  • 15-field JSONL schema with full audit coverage
  • Evidence artifacts a reviewer can replay without manual inspection

Claude Code orchestrated both runs. The difference was whether Neo MCP sat in the loop as a deterministic engineering layer on top of generation.

Benchmark overview: generation quality versus engineering quality across 1,000 records

Starting Point

The first workflow used Claude Code as a multi-subagent orchestrator: publish a spec, fan out parallel authors, merge batches, and run validation after the fact. It delivered a rich dataset and met the contract. What it did not deliver was a replayable pipeline you could pin in CI, rerun on demand, and gate on a machine-readable audit score.

What Neo MCP Changed

Neo MCP reframed the job from "generate 1,000 rows" to "operate a three-script dataset pipeline with a published plan, deterministic seeds, and a final audit gate."

Neo MCP shifts the question from "did we hit 1,000 rows?" to "can we rerun, re-audit, and gate this in CI on every release?"

Agent failure dataset pipeline: Claude dispatches, Neo MCP runs generate, audit, and report scripts

Deterministic reproducibility

Generation seeds at 42. Parameterized template filling reproduces the same dataset and metrics byte-for-byte on rerun. Methodology documents the seed contract explicitly so reviewers can verify claims without trusting a one-off export.

Generation-time duplicate guard

Near-duplicate rejection happens during generate_dataset.py, not in a cleanup pass after merge. That shifts dedup from a forensic exercise into a pipeline invariant.

CI-gatable audit infrastructure

audit_pipeline.py emits final_audit.json with nine scored components. The composite lands at 98.4 / 100, high enough to treat as a release gate rather than a narrative footnote.

Three-script modular pipeline

ScriptRole
generate_dataset.pyBuild 1,000 records with seed-42 guards
audit_pipeline.pyScore schema, balance, and quality components
gen_reports.pyPublish methodology and dataset summary artifacts

Planning lives in versioned docs. Execution is script-bound. Reviewers read JSON, not chat logs.

Results

Both workflows met the published contract. The engineering distinction shows up in how you operate the dataset afterward:

DimensionNeo MCP outcome
ReproducibilitySeed-42 replay documented in methodology
Audit gatefinal_audit.json, 98.4 / 100 composite
DedupGeneration-time guard in generate_dataset.py
OperabilityThree-script pipeline suitable for CI wiring
TraceabilityPlan + audit + summary artifacts shipped together

The headline is not "more prose per record." It is a dataset you can version, re-audit, and ship with evidence.

Neo MCP is built for the second phase: turning a successful generation sprint into a maintainable dataset operation.

Lessons for Dataset Engineers

Generation quality and engineering quality are different optimization targets. Volume and variety matter for exploration. Reproducibility, audit gates, and modular pipelines matter when a dataset becomes infrastructure.

If your team is moving from one-off synthetic exports to governed evaluation data, invest in:

  1. Deterministic seeds and documented rerun contracts
  2. Machine-readable audits that CI can fail on
  3. Duplicate guards at generation time, not merge time
  4. Script-bound pipelines with explicit planning artifacts

Neo MCP is built for that second phase: turning a successful generation sprint into a maintainable dataset operation.

Practical Implications

  • Platform teams can wire final_audit.json into release checks the same way they wire unit tests.
  • Eval owners get a rerun story when stakeholders ask "prove this still holds."
  • Data engineers inherit a three-script surface area instead of a one-off merge notebook.

Conclusion

Adding Neo MCP did not change the contract. It changed the class of system producing the dataset: from orchestrated generation to an engineered, auditable pipeline. For teams scaling synthetic eval data, that is the difference between a demo export and a dataset you can run in production.

Neo MCP is how you graduate from "we generated a dataset once" to "we operate a dataset program."

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

Want to try what NEO built?

Try Neo AI Engineer →
← Back to Blog