bits-per-byte vs artifact size · quant gap before/after STE QAT · click any scheme for details
The 16MB budget covers code + compressed weights. A full-precision (float32) transformer would easily exceed this. Quantization maps each weight to a small integer, then dequantizes at inference time. Fewer bits per weight = more weights = better model quality per byte.
Naively quantizing a trained model creates a quantization gap — the difference between pre-quant bpb and post-quant bpb. For int8 this was ~0.007. The key breakthrough was STE QAT: training with simulated quantization so the model learns to work within integer constraints, shrinking the gap to ~0.0001.
The rounding operation round(x) has zero gradient — you can't backprop through it. STE is a trick: during the forward pass, apply real quantization. During backward, pretend the round was the identity function (gradient passes straight through). This enables end-to-end training with quantization.
Instead of one global scale factor for a whole layer (which gets dominated by outliers), each row of the weight matrix gets its own scale: scale = max(|row|) / 15 for int5. This dramatically reduces the representational error for rows with small magnitudes while still capturing rows with large activations.
MLP weights are more numerous and less sensitive to the tighter range [-16,15] — they benefit from a smoother learned distribution. Attention weights carry position-sensitive geometry where the extra precision of int6 [-32,31] matters. The split scheme saves ~1.86 MB (funds a 10th layer) with negligible quality cost.
After quantizing to integer arrays, the weights are compressed before packaging. zstd at level 22 achieves better compression ratios than zlib on weight data (which has structured, near-uniform distributions from quantization). This is a free size saving — no model changes required.