Quantization Tradeoff Chart Interactive

bits-per-byte vs artifact size · quant gap before/after STE QAT · click any scheme for details

val_bpb (y) vs artifact size MB (x) — all public entries, colored by quant scheme

Quantization schemes

int8 [-128, 127]
Size efficiency40%
bits/weight:8
quant gap:~0.007
status:baseline
int6 + STE QAT [-32, 31]
Size efficiency75%
bits/weight:6
quant gap:~0.0001
status:in base
int5 MLP + STE QAT [-16, 15]
Size efficiency90%
bits/weight:5
saves vs int6:~1.86 MB
status:P1 target
FP16 embed IEEE 754 half
Quality preservation100%
bits/weight:16
quant gap reduction:×32
status:in base
Mixed (SOTA) int5 MLP / int6 attn / fp16 emb
Optimal balance95%
val bpb:1.1428
artifact:~15.9 MB
status:SOTA

How quantization works in this contest

Why quantize at all?

The 16MB budget covers code + compressed weights. A full-precision (float32) transformer would easily exceed this. Quantization maps each weight to a small integer, then dequantizes at inference time. Fewer bits per weight = more weights = better model quality per byte.

The quant gap problem

Naively quantizing a trained model creates a quantization gap — the difference between pre-quant bpb and post-quant bpb. For int8 this was ~0.007. The key breakthrough was STE QAT: training with simulated quantization so the model learns to work within integer constraints, shrinking the gap to ~0.0001.

Straight-Through Estimator (STE)

The rounding operation round(x) has zero gradient — you can't backprop through it. STE is a trick: during the forward pass, apply real quantization. During backward, pretend the round was the identity function (gradient passes straight through). This enables end-to-end training with quantization.

Per-row scaling

Instead of one global scale factor for a whole layer (which gets dominated by outliers), each row of the weight matrix gets its own scale: scale = max(|row|) / 15 for int5. This dramatically reduces the representational error for rows with small magnitudes while still capturing rows with large activations.

Why int5 for MLP, int6 for attention?

MLP weights are more numerous and less sensitive to the tighter range [-16,15] — they benefit from a smoother learned distribution. Attention weights carry position-sensitive geometry where the extra precision of int6 [-32,31] matters. The split scheme saves ~1.86 MB (funds a 10th layer) with negligible quality cost.

zstd-22 vs zlib

After quantizing to integer arrays, the weights are compressed before packaging. zstd at level 22 achieves better compression ratios than zlib on weight data (which has structured, near-uniform distributions from quantization). This is a free size saving — no model changes required.