Evaluate & Benchmark
Benchmarking LLMs on Real Tasks
An async LLM benchmarking platform that evaluates models from OpenAI, Anthropic, Google, and more across 150+ real-world tasks covering coding, reasoning, structured output, and...
Use cases
Real projects built by NEO — from LLM benchmarks to agent swarms. Pick a workflow below to browse, or start with a featured use case.
Evaluate & Benchmark
An async LLM benchmarking platform that evaluates models from OpenAI, Anthropic, Google, and more across 150+ real-world tasks covering coding, reasoning, structured output, and...
Evaluate & Benchmark
Closed-loop system: an optimizer LLM writes prompts and reads failure summaries, a target LLM runs batches against synthetic data, and a JSON ledger tracks every iteration until scores converge.
Build Agents
10 specialized agents coordinating over async message bus: +4.62% returns across 250 days of S&P 500 data.
Agents with brittle tool calls. Prompts that need another pass. Evals before you trust a model swap. NEO lives in VS Code or Cursor and helps you turn that work into real code and runs, so you iterate on behavior, not boilerplate.
Get started