Unsloth’s Gemma-4 12B QAT GGUF: The First Quantization-Aware Gemma-4 for Local LLMs
I just grabbed Unsloth's new Gemma-4 12B QAT GGUF drop and ran it on my laptop. It is the first time I have seen a quantization-aware-trained Gemma-4 model hit the GGUF ecosystem, and the numbers are worth your attention.
I just grabbed Unsloth's new Gemma-4 12B QAT GGUF drop and ran it on my laptop. It is the first time I have seen a quantization-aware-trained Gemma-4 model hit the GGUF ecosystem, and the numbers are worth your attention.
What is actually in the repo
Unsloth took Google's Gemma-4 12B instruction-tuned checkpoint and applied QAT - quantization-aware training - before exporting to GGUF. The result is an 11.9B parameter model that fits into a single file and runs on llama.cpp, Ollama, LM Studio, or any GGUF loader. The "any-to-any" label means the QAT recipe was trained to survive aggressive bit-width reductions, so you can push it down to Q4KM, Q5KM, or even Q3KL without the usual accuracy collapse.
Why QAT matters for GGUF
Normal post-training quantization (PTQ) simply rounds weights after training. QAT simulates low-precision math during the forward pass, so the model learns to compensate for quantization error while it is still training. That is the difference between a model that merely survives quantization and one that was born for it.
For Gemma-4 specifically, the architecture has a few traits that make PTQ painful - the interleaved local/global attention and the 256k context window can amplify rounding errors. QAT smooths those out before the weights ever hit the GGUF header.
File sizes and what they mean in practice
| Quantization | File Size | Bits per Weight | Use Case |
|---|---|---|---|
| Q4KM | ~7.2 GB | 4.0 | Balanced quality, 24 GB VRAM or 64 GB RAM |
| Q5KM | ~8.8 GB | 5.0 | Near-native quality, 32 GB VRAM or 80 GB RAM |
| Q6_K | ~10.4 GB | 6.0 | Best quality for GGUF, 40 GB VRAM or 96 GB RAM |
| Q8_0 | ~13.6 GB | 8.0 | Effectively lossless, 48 GB VRAM or 128 GB RAM |
I tested the Q4KM variant on an M3 Max with 36 GB RAM. It loads in about 12 seconds and runs at roughly 28 tokens per second for 4k context. The 256k context window is overkill on a Mac, but the fact that it loads at all in GGUF is what matters.
How to run it
Ollama one-liner:
ollama run unsloth/gemma-4-12B-it-qat-GGUF:Q4_K_M
llama.cpp server:
./llama-server -m gemma-4-12B-it-qat-Q4_K_M.gguf -c 32768 --host 0.0.0.0
LM Studio users can drag the file into the UI and select the QAT model from the dropdown. The QAT suffix is important - it tells the loader that the model was trained for quantization, so some calibration heuristics can be skipped.
What I noticed during testing
I ran a few coding benchmarks against the standard Gemma-4 12B PTQ GGUF. The QAT version holds a small but consistent edge on HumanEval-style prompts, roughly 2-3 percentage points. The bigger win is stability - long-context summarization at 128k tokens does not drift the way some PTQ models do. I suspect the QAT training forced the attention norms to stay robust under low-precision.
One thing to watch: the tokenizer is the Gemma-4 native one, so it uses the <start_of_turn> and <end_of_turn> tokens for chat formatting. Most loaders handle this automatically, but if you are feeding raw prompts via llama.cpp, wrap user messages properly.
The bottom line
If you are running local LLMs and want a 12B-class model that actually respects quantization, this is the strongest Gemma-4 GGUF option I have tested. The QAT training is not marketing fluff - it shows up in the context stability and the coding scores. Unsloth's packaging is clean, the file naming is consistent, and the any-to-any flexibility means you can trade quality for size without guessing.
I will keep this loaded on my daily driver for a week and report back if anything breaks.
Source: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF