MoE Cost Analyzer: Benchmarking Dense vs Mixture-of-Experts Models on Your Own Workload

Pipeline Architecture

The Problem

Everyone claims Mixture-of-Experts models are cheaper and faster than dense models of comparable quality — but nobody tells you whether those claims hold on your prompts, your latency SLA, and your production traffic shape.

NEO built MoE Cost Analyzer to run hard numbers against OpenRouter — real latency, real tokens, real dollars — on a benchmark you define, then recommend whether switching is worth it.

Live API Benchmarking Against Your Own Prompts

MoE Cost Analyzer takes a JSON benchmark file describing the prompts you actually run in production and fires them through both a dense and a MoE model on OpenRouter concurrently. The included reference benchmark uses 100 sentiment-analysis tasks comparing google/gemma-4-31b-it (dense) against google/gemma-4-26b-a4b-it (MoE).

The benchmark file format is intentionally minimal — an id, the prompt, and optional expected labels for correctness checks:

{
  "name": "My Production Benchmark",
  "tasks": [
    {
      "id": "task_001",
      "prompt": "Extract all dates mentioned in this support ticket: ...",
      "expected_labels": ["2026-01-15", "2026-02-03"]
    },
    {
      "id": "task_002",
      "prompt": "Classify the urgency of this message as Low / Medium / High: ...",
      "expected_labels": ["High"]
    }
  ]
}

The runner uses asyncio.gather with a semaphore capped at 5 concurrent requests to stay inside OpenRouter rate limits, and wraps each call in a retry loop that handles 429s and 5xx errors with exponential backoff (1s, 2s, 4s). A --dry-run flag simulates calls for local iteration without spending real tokens.

Latency, Cost, and SLA Decision Matrix

After the run, analyzer.py computes per-model statistics — average, P50, and P95 latency; total and per-query cost; total tokens; error rate — and assembles a decision matrix. Token cost is computed from a ModelPricing dataclass keyed on the OpenRouter rate card, with separate input and output per-million prices.

On the published 100-task reference run, the MoE variant beat the dense model on every dimension without sacrificing output quality:

Metric	Dense gemma-4-31b-it	MoE gemma-4-26b-a4b-it	Delta
Avg Latency	1,721 ms	1,283 ms	-25.5%
P50 Latency	833 ms	605 ms	-27.3%
P95 Latency	6,748 ms	5,879 ms	-12.9%
Cost per Query	$0.00000494	$0.00000395	-20.0%
Total Tokens	4,939	4,939	0.0%
Error Rate	0.0%	0.0%	—

The recommend() function compares both models against configurable SLAs and emits one of three decisions: USE MoE (meets both latency and cost caps), MARGINAL (meets one), or STICK WITH DENSE (fails both). The P95 narrowing is a deliberate warning — MoE wins the median but its tail is closer, which matters for latency-sensitive traffic.

Rich Terminal Output and CSV Export

Results render as a Rich-formatted table with a coloured recommendation line, and every raw data point is written to CSV for downstream analysis.

python analyze.py my_benchmark.json \
  --sla-latency-ms 1500 \
  --sla-cost-per-1k 0.005 \
  --output my_results.csv

The CSV has one row per (task_id, model_id) pair with latency_ms, prompt_tokens, completion_tokens, total_tokens, cost_usd, and error columns — a shape that drops straight into pandas for per-prompt regression analysis, outlier hunting, or extrapolation to production volume. On a 1M-query/day workload the reference delta projects to roughly $30/month in savings; at 100M queries/day it extrapolates to almost $3K/month.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build an async benchmarking tool that takes a JSON file of prompts, sends each through both a dense and a MoE model on OpenRouter with a semaphore-limited concurrency of 5 and retry with exponential backoff, computes avg/P50/P95 latency and per-query cost from a pricing dataclass, exports results to CSV, and prints a Rich-formatted decision matrix with a USE MoE / MARGINAL / STICK WITH DENSE recommendation against configurable SLA thresholds."

Build with NEO →

NEO generates the project structure and core implementation. From there you iterate — ask it to add a dry-run mode that simulates OpenRouter responses for offline testing, extend the analyzer with a correctness scorer that compares completions against expected_labels, or add monthly-cost extrapolation across configurable daily query volumes. Each request builds on what's already there.

To run the finished project:

git clone https://github.com/dakshjain-1616/MoE-Cost-Analyzer
cd MoE-Cost-Analyzer
pip install -r requirements.txt
python analyze.py benchmark.json --sla-latency-ms 2000 --sla-cost-per-1k 0.01

The terminal prints the decision matrix with the coloured recommendation, and results.csv holds every measurement for pandas-side analysis.

NEO built a benchmarking harness that turns the "should we switch to MoE" debate into hard numbers measured on your own workload, with SLA-aware recommendations. See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor