LLM SLA Gatekeeper: Automated Deployment Gating for Language Models

Pipeline Architecture

The Problem

Teams swap LLM versions, adjust hardware, or change quantization levels and have no automated check to confirm the new configuration meets latency and throughput requirements before traffic hits it. Manual spot-checks miss edge cases and are not reproducible.

NEO built LLM SLA Gatekeeper to run repeatable benchmarks against configurable SLA targets and return a deterministic PASS or FAIL with structured output and CI/CD-compatible exit codes.

The Validation Pipeline

LLM SLA Gatekeeper runs a straightforward four-step process:

Input — provide a Hugging Face model ID or local path plus an SLA config with max_latency_ms, min_throughput_tokens_per_sec, and max_cost_per_token.
Benchmark — run N inference passes (default 5), collect per-token latency samples, compute p50, p95, average latency, and throughput.
SLA Check — compare results against thresholds: average latency, p95 latency (using a configurable multiplier), and throughput.
Verdict — emit PASS (exit 0) or FAIL (exit 1), with a confidence score, recommendations, and results appended to a JSONL history file.

The exit codes integrate directly with CI/CD: 0 is PASS, 1 is FAIL, 2 is ERROR.

Built-in SLA Profiles

Five named profiles cover the common LLM deployment scenarios:

Profile	Max Latency	Min Throughput	Use Case
`chatbot`	150 ms	10 tok/s	Interactive chat, customer support
`realtime`	50 ms	50 tok/s	Streaming, voice assistants
`batch`	2000 ms	1 tok/s	Document processing, summarization
`edge`	500 ms	2 tok/s	IoT, on-device inference
`dev`	5000 ms	—	CI/CD dry runs, local development

Every threshold is overridable via environment variable, so you can tune SLA_PROFILE_CHATBOT_LATENCY_MS without touching the code.

Confidence Scoring

Every result includes a confidence_score between 0.0 and 1.0 that reflects how trustworthy the verdict is:

run_score   = min(n / 20.0, 1.0)
var_score   = max(0.0, 1.0 - (std / mean) × 2)
mode_factor = 1.0 (real hardware) | 0.75 (simulation)
confidence  = min(1.0, (run_score × 0.5 + var_score × 0.5) × mode_factor)

Simulation mode caps confidence at 0.75 because synthetic latency figures are estimates. Increase --runs to raise confidence on real hardware.

Simulation Mode

When no GPU is available, simulation mode generates synthetic benchmark data using a linear formula:

latency_ms = 5 × model_size_in_B + 15

A 7B model yields 50 ms simulated latency. A 1.7B model yields 23.5 ms. This makes the tool practical for CI/CD pipelines on CPU-only runners.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a Python LLM deployment gating tool that benchmarks a given Hugging Face model ID against configurable SLA profiles (chatbot, realtime, batch, edge, dev). The tool should run N inference passes, compute p50 and p95 latency plus throughput, compare results against thresholds, and return a PASS or FAIL verdict with a confidence score, exit code 0 or 1 for CI/CD integration, and a JSONL history file that accumulates results across runs. Include a simulation mode that uses a linear formula based on model size when no GPU is available."

Build with NEO →

NEO generates the project structure and core implementation. From there you iterate: ask it to implement the confidence scoring formula that combines run count and variance, add the five named SLA profiles with environment variable overrides for each threshold, or build the Gradio UI with Validate, Compare, History, and About tabs. Each follow-up builds on what's already there.

To run the finished project:

git clone https://github.com/dakshjain-1616/llm-sla-gatekeeper
cd llm-sla-gatekeeper
pip install -r requirements.txt
python run_validation.py --model=Qwen/Qwen3-8B --profile=chatbot --simulate

Add the GitHub Actions snippet from the README to your CI pipeline so that exit code 1 automatically blocks merges when a model swap fails the latency gate.

NEO built a deployment gate that gives LLM teams a repeatable, CI-compatible answer to the question "is this model fast enough to ship." See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor