Q-Heatmap: Visualizing Quantization Error Across Model Layers

View on GitHub

Pipeline Architecture

The Problem

Quantizing a model to 4-bit cuts memory in half, but uniform quantization silently destroys accuracy in a handful of sensitive layers while leaving most layers nearly lossless.

NEO built Q-Heatmap to identify exactly which layers cannot survive aggressive compression and which can be pushed further without measurable degradation.

Per-Layer Error Metrics

Q-Heatmap loads both the full-precision model and its quantized counterpart and computes three complementary error signals for each layer. This multi-metric approach matters because different layers fail in different ways some show large weight distance but survive functionally, while others show small weight shift but produce dramatically different activations.

The three metrics computed per layer are:

from q_heatmap import QuantizationAuditor

auditor = QuantizationAuditor(
    fp16_model="meta-llama/Llama-3.1-8B",
    quantized_model="./models/llama-3.1-8b-Q4_K_M.gguf",
    format="gguf",
    calibration_data="./calib/c4_512samples.jsonl"
)

results = auditor.compute_layer_errors()
auditor.render_heatmap(results, output="heatmap.html")

Interactive Heatmap Rendering

The heatmap renders as a Plotly HTML file with layers on the Y-axis and bit-depth configurations on the X-axis (Q2, Q3, Q4, Q5, Q6, Q8). Each cell is colored by the composite error score the weighted average of normalized weight distance, activation MSE, and KL divergence. Hovering a cell shows the raw values for all three metrics.

Typical output reveals a clear pattern: the first two and last two transformer layers are consistently high-error under aggressive quantization, while middle layers tolerate Q3-Q4 with near-zero activation drift. Attention projection layers (q_proj, k_proj) tend to be more sensitive than MLP layers (gate_proj, down_proj):

LayerQ4 Activation MSEQ4 KL DivQ8 Activation MSEQ8 KL Div
layer.0 (embed)0.00410.0890.00020.004
layer.1 (attn)0.03120.2410.00080.011
layer.16 (attn)0.00190.0170.00010.002
layer.31 (final)0.02980.1980.00070.009

Mixed-Precision Recommendations

Q-Heatmap goes beyond visualization and outputs a mixed-precision configuration file compatible with llama.cpp's --tensor-type argument and AutoGPTQ's per-layer bit assignment. High-error layers are assigned Q6 or Q8; low-error layers drop to Q3 or Q2.

In practice, mixed-precision configurations produced by Q-Heatmap recover 80-90% of the perplexity lost in uniform Q4 quantization while adding only 5-8% to the total model size:

python q_heatmap.py \
  --fp16 meta-llama/Llama-3.1-8B \
  --quantized ./models/llama-3.1-8b-Q4_K_M.gguf \
  --format gguf \
  --calibration ./calib/c4_512samples.jsonl \
  --output heatmap.html \
  --export_config mixed_precision.json

The exported mixed_precision.json maps each tensor name to its recommended bit depth and can be passed directly to llama-quantize for re-quantization.

How to Build This with NEO

Open NEO in VS Code or Cursor and describe what you want to build. A good starting prompt for this project:

"Build a quantization error analysis tool that loads a full-precision model and its GGUF or GPTQ quantized counterpart, computes per-layer weight distance, activation MSE, and KL divergence on a calibration dataset, then renders an interactive Plotly heatmap with layers on one axis and bit depths on the other, and exports a mixed-precision configuration file."

Build with NEO →

NEO generates the project structure and core implementation. From there you iterate ask it to add AWQ format support, integrate perplexity benchmarking against Wikitext-2, or build a CLI batch mode that audits an entire model zoo and compares quantization sensitivity across architectures. Each request builds on what's already there.

To run the finished project:

git clone https://github.com/dakshjain-1616/q-heatmap
cd q-heatmap
pip install -r requirements.txt
python q_heatmap.py --fp16 meta-llama/Llama-3.1-8B --quantized ./models/model-Q4_K_M.gguf --format gguf

Open heatmap.html to explore per-layer quantization sensitivity and download the mixed-precision config for re-quantization.

NEO built a per-layer quantization error visualizer that identifies which transformer layers break under compression and generates mixed-precision configurations to recover accuracy. See what else NEO ships at heyneo.com.

Try NEO in Your IDE

Install the NEO extension to bring AI-powered development directly into your workflow: