AgentLiar: Catching Coding Agents That Falsely Claim They Finished the Task

View on GitHub

Architecture

The Problem

The agent says "I have implemented the refresh token rotation and added tests." You skim the diff, see new files, see a test file, and merge. A week later you realise the refresh token rotation function exists but its body is pass, the test file has three def test_* functions with no assertions, and the agent's claim was technically true in the sense that things were added but completely false in the sense that the task was done.

Coding agents lie. Not maliciously, mostly, but because their training rewards confident completion messages and not because they have any robust internal signal for "I finished this." The result is a class of bug that no test in the world will catch, because the agent just did not write the test it said it wrote. AgentLiar is built to catch this class of failure, automatically, on every agent-produced change.

The Three Inputs

AgentLiar takes three artefacts:

That is the entire interface. Everything else is derived from these three files. The reason the task and claim are required, and not just the diff, is that you cannot tell whether an agent finished its job by looking at the code alone. You need to know what it was supposed to do and what it says it did, then compare both against the diff.

Four Independent Checks

Verification is decomposed into four checks that run in parallel via asyncio.gather. Each check returns a score from 0 to its max, plus a list of evidence items.

File Check (35% weight)

Looks at the diff against the claim. Does every file the agent said it created actually appear in the diff? Do the new files contain real content, or are they stubs with pass, TODO, ..., raise NotImplementedError, or single-line comments? Are there unexpected file additions the claim never mentioned? File-level lies are the most common failure mode and the cheapest to detect.

Evidence might look like:

✗ src/auth/refresh.py - claimed, file missing
⚠ src/auth/login.py - `# TODO implement` body
✓ tests/test_login.py - added

Test Check (30% weight)

Looks at every test file in the diff. Counts the test functions, counts the assertions, flags the ones with no assertions at all, flags the @pytest.mark.skip and @pytest.mark.xfail decorators that were just added, and flags suspiciously empty test bodies. A test file with three functions and zero assertions is a strong signal that the agent went through the motions without actually writing tests. Test-quality lies are the second most common failure mode and the most dangerous, because a passing test suite is the universal "I am done" signal.

Scope Check (20% weight)

A pure NLP pass over the claim. Looks for narrowing phrases: "only", "for now", "happy path", "first pass", "skeleton", "TODO later", "next step is to". Then compares against the task to see what the task was actually asking for. If the task says "implement full auth flow" and the claim says "I have implemented the login path for now", that is a scope narrowing the agent slipped into the message hoping you would not notice.

This check is intentionally simple. The point is not to do deep semantic analysis but to flag the linguistic tells that show up when an agent is hedging.

LLM Judge (15% weight, optional)

For teams that want a final independent adjudication, AgentLiar can call out to OpenRouter and ask a different frontier model to judge the task / claim / diff bundle directly. The judge gets the three artefacts and a strict prompt asking for a 0-100 score and a one-line verdict. You can configure the judge model in YAML, and use multiple judges if you want a consensus.

The judge is optional because it costs money and adds latency, and the first three checks already catch most of what matters. But on a high-stakes change you turn it on, run two judges, and use the consensus.

Weighted Score and Verdict

The four checks produce sub-scores. AgentLiar combines them with the configurable weights into a single 0-100 confidence score:

score = 0.35 * file + 0.30 * test + 0.20 * scope + 0.15 * judge

The verdict thresholds are:

A real run might come back as:

score: 47/100  verdict: LIKELY OVERCLAIMED
file 12/35   test 8/30   scope 7/20   judge 20/15
report -> report.md

The per-check breakdown matters because the score alone is not actionable. You want to know which check failed so you know what to do next.

Reports

AgentLiar emits two output formats. JSON is for downstream tooling, structured with the score, per-check evidence blocks, and metadata. Markdown is for human consumption, formatted to drop into a PR comment or a Slack message.

The Markdown report leads with the verdict, then a one-paragraph summary, then the per-check evidence in collapsible sections. Reviewers scan the verdict, expand the sections that failed, and decide what to do.

Four Ways to Run It

AgentLiar is built to be embedded everywhere agents touch a codebase:

What This Actually Replaces

The current state of the art for verifying agent claims is "a human reads the PR." That works at low volume. It does not work when an agent is opening twenty PRs a day, or when an autonomous loop is making decisions every few seconds based on whether the last step succeeded. AgentLiar is built for the high-throughput case, where a fast, cheap, deterministic verification step in front of a human review can filter out the obvious failures and let reviewers spend their attention on the changes that actually need it.

It is not a substitute for review. It is a filter that means review covers the things review is good at.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a Python verification system called AgentLiar that detects when coding agents falsely claim task completion. Inputs: task.md (what was asked), claim.md (what the agent says it did), and changes.diff (the actual code). Implement four checks that run in parallel via asyncio.gather: (1) File Check 35%, verify claimed files exist in the diff, flag placeholder bodies (pass, TODO, NotImplementedError), flag unexpected additions. (2) Test Check 30%, count test functions, count assertions, flag tests with no assertions, flag newly-added skip/xfail decorators. (3) Scope Check 20%, pure NLP scan of the claim for narrowing phrases (only, for now, happy path, skeleton) and compare against the task. (4) LLM Judge 15%, optional, call OpenRouter with a configurable model, get a 0-100 score and verdict. Combine into weighted confidence 0-100 with thresholds: 80+ PASS, 60-79 WARN, <60 LIKELY OVERCLAIMED. Emit JSON and Markdown reports. Provide CLI (Click), Python library, GitHub Action, and FastAPI HTTP API surfaces. Use Pydantic for config and result schemas."

Build with NEO →

NEO scaffolds the orchestrator, the four checks, the scoring logic, the report formatters, and the four interface surfaces. From there you tune the weights to what failure mode your team cares about most, add a domain-specific check (for example a "schema check" that compares claimed database migrations against the actual migration file), or wire it into your agent eval pipeline alongside your other graders.

NEO built a verifier that catches agent overclaim before it lands in production. See what else NEO ships at heyneo.com.


Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow: