Prompt Compression Benchmarker: Measure Token Reduction vs Quality Loss Across Five Algorithms

View on GitHub

Pipeline Architecture

The Problem

Prompt compression libraries all promise 50%+ token reduction with "minimal" quality loss - but "minimal" is measured on their benchmarks, not your RAG context, your summarization prompts, or your code. Picking the wrong compressor can silently cost accuracy while the token savings look great on the invoice.

NEO built Prompt Compression Benchmarker (pcb) to run every major compression technique against realistic RAG, summarization, and coding workloads and report token reduction, proxy quality, latency, and optional LLM-as-judge scores side by side.

Five Compressors Across Three Task Types

Prompt Compression Benchmarker ships implementations of five compressors under src/pcb/compressors/ and evaluates each against three task modules under src/pcb/tasks/.

CompressorMechanismBest For
tfidfSentence-level TF-IDF scoringGeneral documents, factual passages
selective_contextGreedy token-budget selection from document startOrdered narratives where prefix matters
llmlinguaSentence-level coarse pruningNarratives and transcripts
llmlingua2Word-level stopword and low-content token removalCode and RAG contexts
no_compressionPassthrough baselineQuality reference

The bundled datasets under /data include rag_samples.jsonl (20 real factual passages, 400-450 tokens), summarization_samples.jsonl (10 news-style articles), and coding_samples.jsonl (10 Python code contexts, 370-800 tokens), so the numbers you see are from real prompts, not synthetic filler.

Metrics That Actually Matter

Every run reports five numbers per (compressor, task) cell:

The terminal reporter under src/pcb/reporters/ colours each row by quality loss severity - cyan for improvements, green below 5%, yellow 5-15%, red above 15% - so Pareto-dominant compressors jump off the screen. The HTML reporter renders Chart.js scatter plots with Pareto highlights so you can share the results without anyone installing Python.

# Full benchmark with daily-traffic cost projection
pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6

# Compress a single file with stats
pcb compress context.txt --compressor llmlingua2 --rate 0.45 --stats

# Add LLM-as-judge scoring on a sample
pcb run --llm-judge --judge-model claude-sonnet-4-6 --max-samples 10

# Point at your own JSONL dataset
pcb run --data-dir ./my_data --task rag --max-samples 50

Trade-off Numbers From a Real Run

On the shipped RAG benchmark the numbers tell a clear story - more aggressive compression bleeds quality faster than token savings justify, and the right choice depends on what your downstream LLM is actually doing.

CompressorToken ReductionProxy Drop
selective_context56.9%34.4%
llmlingua245.0%27.3%

selective_context is cheaper per token but drops quality harder; llmlingua2 holds up better at a smaller reduction. The whole point of benchmarking is that this verdict flips by task - llmlingua2 dominates on coding, tfidf competes on factual passages, and selective_context can win on narrative summarization where prefix bias is accurate.

Outputs land in three formats under src/pcb/reporters/: JSON with per-sample detail, CSV with one row per (compressor, task) pair for spreadsheets and pandas, and the HTML interactive report with scatter plots. An MCP server (src/pcb/mcp_server.py) exposes compress_text, estimate_savings, list_compressors, and recommend tools so Claude or any MCP client can compress prompts inline before shipping them to an expensive model.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a Python CLI called pcb using Typer that benchmarks five prompt compression techniques (TF-IDF, selective context, LLMLingua, LLMLingua2, no-compression baseline) across RAG, summarization, and coding tasks loaded from JSONL. Report token reduction %, proxy score (F1/ROUGE-L/BM25), proxy drop %, latency, and an optional LLM-as-judge score through OpenRouter. Render results to terminal with colour-coded quality loss, JSON, CSV, and an interactive HTML report with Chart.js scatter plots and Pareto highlights. Include daily-token cost projection against configurable model rate cards and an MCP server exposing compress_text, estimate_savings, list_compressors, and recommend tools."

Build with NEO →

NEO generates the project structure and compressor plugins. From there you iterate - add a new compressor (e.g. sentence-embedding clustering), swap in a different judge model, plug a custom task module for your production prompts, or extend the cost projection to a multi-tier rate card so the recommendation reflects cache hits versus cold calls. Each request builds on what's already there.

To run the finished project:

git clone https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
cd Prompt-Compression-Benchmarker
pip install -r requirements.txt
pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6

The terminal prints the decision matrix with colour-coded quality loss, results.html opens the interactive Pareto scatter, and the MCP server is ready to wire into any MCP-compatible client for inline compression.

NEO built a benchmarking harness that makes "which compressor should we use" a measured answer on your own data, not a vendor claim. See what else NEO ships at heyneo.com.


Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow: