Gemma 4 Quantization: Running Google's Latest Model on Consumer Hardware

Pipeline Architecture

The Problem

Gemma 4 in full BF16 precision requires around 52 GB of VRAM — out of reach for most developer machines. Without a systematic pipeline, picking the right quantization level is guesswork: you lose too much quality at Q4 or leave memory on the table at Q8.

NEO built this quantization pipeline to automate the export, benchmark perplexity across bit depths, and generate a hardware-aware recommendation so you can make an informed tradeoff without manual trial and error.

GGUF Export Pipeline

The export pipeline converts a Gemma 4 checkpoint from HuggingFace format to GGUF at three quantization levels: Q4_K_M, Q5_K_M, and Q8_0. Each level uses llama.cpp's quantization tooling under the hood, invoked programmatically through a Python wrapper that handles the conversion steps, output naming, and file size verification automatically.

The K-quant variants (Q4_K_M, Q5_K_M) use importance-matrix–aware quantization, which preserves precision on the layers that matter most for output quality — attention projections and the final unembedding layer — while aggressively compressing feed-forward weights. Q8_0 is near-lossless and included as the quality ceiling against which the others are measured.

python export.py \
  --model-id google/gemma-4-9b \
  --quant-levels Q4_K_M Q5_K_M Q8_0 \
  --output-dir ./gguf/

Expected output sizes for Gemma 4 9B:

Quantization	File Size	VRAM Required
BF16 (base)	18.4 GB	~20 GB
Q8_0	9.8 GB	~11 GB
Q5_K_M	6.3 GB	~7 GB
Q4_K_M	4.9 GB	~6 GB

Perplexity Benchmarking and Quality Scoring

After export, the pipeline runs a perplexity benchmark on each quantized model against the WikiText-103 validation set. Perplexity is computed relative to the BF16 baseline, expressed as a percentage degradation. A well-quantized Q5_K_M typically shows less than 1% perplexity increase; Q4_K_M typically lands between 2–4%.

The pipeline also runs a task quality benchmark using a sample of 200 prompts across three categories: instruction following, coding, and factual QA. Each response is scored by a lightweight judge model on accuracy and coherence. This catches cases where perplexity looks acceptable but task performance has degraded — a known failure mode for models with large vocabulary sizes.

Results are written to report/quality_report.json:

{
  "Q4_K_M": { "perplexity_delta": "+3.1%", "task_score": 0.847 },
  "Q5_K_M": { "perplexity_delta": "+0.8%", "task_score": 0.901 },
  "Q8_0":   { "perplexity_delta": "+0.1%", "task_score": 0.934 }
}

Hardware-Aware Recommendation Engine

The final stage takes a hardware profile as input — total VRAM, whether Metal or CUDA is available, and target inference speed in tokens per second — and outputs a ranked recommendation. The engine models inference speed using empirical throughput measurements taken during the benchmark phase, not theoretical estimates.

Apple Silicon support is included via llama.cpp's Metal backend, which gives M-series chips near-native GPU acceleration for quantized GGUF models. CUDA support covers NVIDIA cards from the RTX 3000 series up. The recommendation engine accounts for unified memory on Apple Silicon, where VRAM and RAM share the same pool, giving M2/M3 Pro and Max chips a significant effective-memory advantage over discrete GPU setups.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a Python quantization pipeline for Gemma 4 that exports to GGUF at Q4_K_M, Q5_K_M, and Q8_0 using llama.cpp, benchmarks perplexity on WikiText-103 relative to the BF16 baseline, runs a task quality benchmark across instruction following, coding, and QA prompts, and generates a hardware-aware recommendation for which quantization level to use given VRAM, Metal or CUDA availability, and target tokens-per-second."

Build with NEO →

NEO generates the project structure and core implementation. From there you iterate — ask it to add support for additional Gemma 4 model sizes, extend benchmarking to include inference latency percentiles, or build a web UI for the recommendation engine. Each request builds on what's already there.

To run the finished project:

git clone https://github.com/dakshjain-1616/gemma4-quantization
cd gemma4-quantization
pip install -r requirements.txt
python export.py --model-id google/gemma-4-9b --quant-levels Q4_K_M Q5_K_M Q8_0
python benchmark.py --hardware-profile vram=8gb backend=cuda target-tps=30

The pipeline exports all three quantized models, runs the full benchmark suite, and writes a recommendation report to report/.

NEO built a Gemma 4 quantization pipeline that exports to GGUF at three bit depths, benchmarks perplexity and task quality against full-precision, and recommends the right tradeoff for your specific hardware. See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow:

VS Code: NEO in VS Code
Cursor: Install NEO for Cursor