Qwen 3.5 27B GGUF: Automated Quantization with Quality Gating

Pipeline Architecture

The Problem

Quantizing a 27B model from fp16 to Q4 cuts size by 70% but can silently degrade quality. Most quantization workflows give you a smaller file without telling you whether the model still works. Running your own benchmark after every quant takes manual effort and does not integrate into CI.

NEO built this pipeline to automate the full cycle for Qwen3.5-27B: download the fp16 weights, quantize to multiple GGUF formats, benchmark each one, and exit non-zero if any format degrades beyond a threshold. The result is a CI-ready script that protects against silent accuracy regression.

The Quantization Formats

GGUF is the binary format used by llama.cpp for quantized model weights. Each format represents a different tradeoff between file size and model quality.

The pipeline benchmarks eight formats in a single run: F16 (full precision, baseline), Q8_0 (8-bit, near-lossless), Q6_K (6-bit, k-quant), Q5_K_M (5-bit medium), Q4_K_M (4-bit medium, the most popular tradeoff), Q4_0 (4-bit simple), Q3_K_M (3-bit medium), and Q2_K (2-bit, extreme compression). The k-quant variants use a calibration dataset to determine which weight groups to quantize more aggressively, producing better quality per bit than simple linear quantization.

For Qwen3.5-27B, the fp16 model is approximately 54GB. Q4_K_M brings this to around 16GB, small enough to fit in 24GB of VRAM or run entirely in RAM on a machine with 32GB.

Quality Gating

The key feature of this pipeline is the quality gate: a configurable threshold (default 5%) on accuracy degradation. After quantizing each format, the script runs a benchmark against the fp16 baseline. If any format degrades accuracy by more than the threshold, the process exits with a non-zero code.

This makes the script CI-compatible. Add it to a GitHub Actions workflow and every merge that changes the quantization configuration gets an automated quality check. A failed check means the quant did not meet your standards; a passing check means you have a verified artifact.

# The pipeline exits non-zero if degradation > 5%
python pipeline.py --model qwen3.5-27b --threshold 0.05
echo $?  # 0 = all formats passed, 1 = at least one failed

The benchmark results are stored in a JSONL log, one entry per format per run. The regression detector reads this log and flags when a format that previously passed now fails. This catches cases where a llama.cpp update changes quantization behavior.

GPU Hardware Detection

Before running the benchmark, the pipeline detects available VRAM and recommends the optimal quantization format. If you have 8GB of VRAM, it recommends Q4_K_M. If you have 16GB, it recommends Q5_K_M or Q6_K. If you have no GPU, it recommends running benchmarks on CPU with a small prompt set.

This detection is read-only. The pipeline never changes its quantization targets based on hardware. It just prints the recommendation and proceeds with all configured formats. You can override the recommendation by passing a specific format list.

A dry-run mode skips all GPU operations and model downloads. It validates the pipeline configuration, checks that llama.cpp binaries are present, and prints what would run. Dry-run mode is how you test CI configuration without paying for GPU time.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a CI-ready quantization pipeline for Qwen3.5-27B using llama.cpp that benchmarks eight GGUF formats in a single run: F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q4_0, Q3_K_M, and Q2_K. After quantizing each format, run a benchmark against the fp16 baseline and exit non-zero if any format degrades accuracy by more than a configurable threshold (default 5%). Log all results to JSONL, one entry per format per run. Add GPU VRAM detection that recommends the optimal format before running. Include a --check-regression flag that reads historical JSONL and flags formats that previously passed but now fail. Add dry-run mode that validates llama.cpp binary presence and pipeline config without downloading models or using GPU."

Build with NEO →

NEO generates the project structure and core implementation. From there you iterate — ask it to add a --formats flag that accepts a subset of the eight formats so CI jobs can target specific variants, add Markdown table output comparing size and accuracy across all formats, or add a GitHub Actions workflow template that wires the pipeline into automated quality gating on every config change.

To run the finished project:

git clone https://github.com/dakshjain-1616/qwen3-5-27b-gguf
cd qwen3-5-27b-gguf
pip install -r requirements.txt
python examples/01_quick_start.py

The quick start runs without a GPU or model download. For real hardware, run pipeline.py with your target formats and a --threshold 0.05 quality gate.

NEO built a CI-ready quantization pipeline for Qwen3.5-27B that shrinks the model from 54GB to 16GB with automated quality gating on every format. See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor