RAG Pipeline Stress Tester: Find Where Your Retrieval Breaks Before Users Do

The Problem
RAG pipelines that nail a happy-path eval crumble on ambiguous queries, out-of-scope questions, multi-hop reasoning, and prompt injections - and the failure rarely shows up until production traffic reveals it.
NEO built RAG Pipeline Stress Tester to expose those failure modes under load before the first user hits them.
Seven Adversarial Query Categories
RAG Pipeline Stress Tester ships a curated suite of adversarial queries across seven categories: ambiguous, out_of_scope, multi_hop, needle_in_haystack, contradictory_context, prompt_injection, and long_context. Each category targets a specific failure mode that vector search plus naive prompting tends to miss. The suite is extensible - YAML files describe new categories and load seamlessly alongside the built-ins.
category: needle_in_haystack
queries:
- id: nih_017
prompt: "What is the serial number of the fridge in Appendix C, section 4?"
expected_behaviour: cite_appendix_c
grading: exact_match_token
The runner sends each query through the target pipeline via a standard HTTP contract, so any framework - LangChain, LlamaIndex, custom - can plug in by implementing one endpoint.
Concurrent Load with Async Workers
Tests run through an asyncio worker pool configurable up to 50+ concurrent virtual users. The runner records per-query latency, error rate, retrieval hit precision (when ground-truth chunk IDs are provided), and generator verdicts graded by either heuristic rules or an LLM judge. Error isolation keeps one worker's failure from cascading into the rest of the run.
| Load Profile | Users | Queries | Typical Duration |
|---|---|---|---|
| Smoke | 5 | 50 | 1-2 min |
| Standard | 20 | 300 | 5-8 min |
| Stress | 50 | 1000 | 20-30 min |
| Soak | 10 | 10k | 4-6 hrs |
The soak profile is the one most teams skip and regret later - it surfaces memory leaks and connection-pool exhaustion that shorter runs never trigger.
Health Score and Reports
At completion, every run is reduced to a 0-100 health score derived from category accuracy, error rate, p95 latency, and retrieval precision. The HTML report leads with the score, drills into per-category breakdowns, and lists the 20 worst-performing queries with full context and retrieval traces for triage. JSON output mirrors everything for downstream dashboards and CI gating.
python stress.py \
--endpoint https://rag.internal/query \
--suite suites/standard.yaml \
--users 20 --queries 300 \
--judge openrouter/meta-llama/llama-4-maverick \
--report report.html
The test harness ships with 58 pytest cases covering the runner, scorer, and report generation, so extensions to the suite stay safe.
How to Build This with NEO
Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:
"Build an async stress-testing framework for RAG pipelines. It should load YAML suites of adversarial queries across seven categories (ambiguous, out-of-scope, multi-hop, needle-in-haystack, contradictory, prompt-injection, long-context), run them concurrently against an HTTP endpoint using an asyncio worker pool up to 50 users, grade responses with heuristic rules and an LLM judge, and produce an HTML report with a 0-100 health score and per-category breakdowns plus a JSON artifact for CI."
NEO generates the project structure and core implementation. From there you iterate - add domain-specific query suites, wire the JSON output into a CI gate that fails builds when health drops below 80, or build a soak-mode dashboard that streams live metrics during multi-hour runs. Each request builds on what's already there.
To run the finished project:
git clone https://github.com/dakshjain-1616/RAG-pipeline-stress-tester
cd RAG-pipeline-stress-tester
pip install -r requirements.txt
python stress.py --endpoint http://localhost:8000/query --suite suites/smoke.yaml --users 5
Open report.html for the health score and category breakdown; pipe report.json into your observability stack for trend tracking.
NEO built a focused stress harness that exposes the RAG failure modes eval datasets never catch, so teams ship with confidence instead of surprises. See what else NEO ships at heyneo.com.
Try NEO in Your IDE
Install the NEO extension to bring AI-powered development directly into your workflow:
- VS Code: NEO in VS Code
- Cursor: Install NEO for Cursor →