Google DeepMind released DiffusionGemma on June 10, 2026, an experimental open model that generates text in parallel blocks rather than one token at a time. The approach delivers up to 4x faster single-user generation on dedicated GPUs, hitting 1,000+ tokens per second on a single NVIDIA H100. It also lands with a trade-off Google does not hide: benchmark scores trail the standard autoregressive Gemma 4 across most of the published test suite.
The release targets the part of AI work that feels slow: one user, one local box, one streaming response. Most large language models in production are autoregressive, generating each token from the one before it, which leaves a single-user GPU underused. DiffusionGemma shifts the workload from memory-bound to compute-bound and lets the GPU denoise a 256-token block in a single forward pass, per the launch announcement introducing DiffusionGemma. The result is a model that thinks in blocks, and that comes with quality, length, and serving caveats Google flags in the same announcement.
A New Way to Write Text
Most language models behave like a typewriter. They emit one token, wait for the next, and let the GPU sit idle while memory fetches the next weights. Google DeepMind’s new DiffusionGemma inverts that pattern. The model starts with a 256-token block of random placeholder tokens and denoises the whole block in parallel, the way image diffusion starts from static and refines a picture. The difference is that the canvas is text, not pixels.
Each forward pass touches every token on the canvas at once. That bidirectional attention lets the model resolve constraints a left-to-right model cannot, like closing a markdown tag, filling a Sudoku grid, or infilling a function body. The mechanism draws on the same Gemini Diffusion research that preceded it, then layers a new diffusion head on the Gemma 4 backbone.
For developers, the practical shift is from memory-bound to compute-bound. The old model spent most of its time waiting on memory bandwidth. The new model pulls a full 256-token block through the transformer in parallel, exactly the workload NVIDIA Tensor Cores are tuned for. Sebastian Flennerhag, a research scientist at Google listed on the release, frames it in the launch post as a move from a typewriter to a printing press.
Speed, Hardware by Hardware
That speedup lands on a specific set of hardware. Google published the headline numbers on its launch post and NVIDIA matched them with the full hardware matrix in its own writeup the same day. The model is open under Apache 2.0 and runs locally with no per-token cost.
The 1,000+ tokens per second on a single NVIDIA H100 is the figure both announcements lead with. On a GeForce RTX 5090, the launch post reports 700+ tokens per second. NVIDIA’s breakdown adds the local-AI form factors: 150 tokens per second on the DGX Spark deskside supercomputer, and up to 2,000 tokens per second on the DGX Station with 748GB of coherent memory. Google labels the speedup “up to 4x” against an equivalent autoregressive model running in the same single-user regime. The table below shows the published figures and the conditions each requires.
| Hardware | Tokens/sec | Source |
|---|---|---|
| NVIDIA H100 Tensor Core GPU | 1,000+ | Google launch post |
| GeForce RTX 5090 | 700+ | Google launch post |
| DGX Spark | 150 | NVIDIA blog |
| DGX Station | up to 2,000 | NVIDIA blog |
Where Does the Model Fall Short?
Google does not pretend the trade-off isn’t there. The launch post states plainly that DiffusionGemma’s “overall output quality is lower than standard Gemma 4.” The DiffusionGemma model card with full benchmark tables backs that statement up with the full benchmark numbers, side by side with the autoregressive Gemma 4 26B A4B it is built on. The gaps run deep, and they are the cost of the 4x speed claim.
On knowledge and reasoning tests, DiffusionGemma lands five points or more below Gemma 4 in most places. MMLU Pro drops from 82.6% to 77.6%, GPQA Diamond falls from 82.3% to 73.2%, and the Tau2 tool-use average slips from 68.2% to 56.2%. AIME 2026 shows the widest gap, with the autoregressive version scoring 88.3% and DiffusionGemma scoring 69.1%. The benchmark table below captures the full published comparison.
| Benchmark | DiffusionGemma 26B A4B | Gemma 4 26B A4B |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 (no tools) | 69.1% | 88.3% |
| LiveCodeBench v6 | 69.1% | 77.1% |
| Codeforces ELO | 1429 | 1718 |
| GPQA Diamond | 73.2% | 82.3% |
| Tau2 (avg over 3) | 56.2% | 68.2% |
| HLE (no tools) | 11.0% | 8.7% |
| HLE (with search) | 11.9% | 17.2% |
Math and code benchmarks show the same direction. LiveCodeBench v6 falls from 77.1% to 69.1%. Codeforces ELO drops from 1718 to 1429, a difference of 289 points in a competitive-programming rating system where 100 points is already a wide gap. HLE with search is a clear loss for DiffusionGemma, 11.9% to 17.2%; HLE no tools is the only published benchmark where DiffusionGemma narrowly beats its sibling, 11.0% to 8.7%. Ars Technica’s coverage of the release adds a structural caveat: diffusion wastes effort on very short outputs, since the model has to do the same parallel work to whittle down to a few tokens an autoregressive model can emit in just a few steps.
Because it prioritizes speed and parallel layout generation, DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4.
Sebastian Flennerhag, a research scientist at Google, made that statement in the DiffusionGemma launch post. He is the named author on the release, and his recommendation ends the suspense: deploy standard Gemma 4 for quality-sensitive production. The 4x headline ships with a clear, named ceiling on what the model is good for.
The Hardware NVIDIA Optimized For
DiffusionGemma’s design hits the strengths of NVIDIA’s GPU stack. The 256-token parallel block is a compute-bound workload, and NVIDIA Tensor Cores are tuned for dense parallel math. The CUDA software stack runs the model without bespoke tuning, so day-zero support is in place across Hugging Face Transformers, vLLM, and Unsloth. The hardware-side optimization breakdown for RTX and DGX systems frames the fit as the model playing directly to the GPU’s strengths.
Native NVFP4 support accelerates compute throughput while keeping accuracy close to higher precision. llama.cpp support is listed as coming soon, which would extend the model to the open-source inference community beyond the optimized paths. The four primary deployment targets are below.
- NVIDIA DGX Spark: Deskside personal AI supercomputer powered by the GB10 Grace Blackwell Superchip with 128GB of unified memory.
- NVIDIA RTX PRO 6000 workstations: For developers, researchers, and AI professionals running local low-latency generation and agentic loops.
- NVIDIA DGX Station: 748GB of coherent memory, up to 2,000 tokens/sec for local high-speed inference.
- GeForce RTX GPUs: Quantized for RTX 5090 and RTX 4090 out of the box, with llama.cpp coming soon.
How Developers Run It Today
Open weights and day-zero framework support mean developers can start without waiting on a custom build. The launch post links to the weights on Hugging Face under the Apache 2.0 license, with no cloud requirement and no per-token cost. The same release ships with serving code for vLLM and integration recipes for Unsloth and NVIDIA NeMo fine-tuning. The Gemma 4 26B A4B backbone means the model inherits the same multimodal input support the rest of the Gemma 4 family ships with.
The developer guide with serving and fine-tuning recipes published the same day includes the exact vLLM command for running the model locally. The flag list is short: a max model length of 262,144 tokens, a canvas length of 256, and an entropy-bound diffusion sampler. The model handles text, image, and video inputs through the same weights, so document parsing, screen understanding, and chart comprehension all run on the same checkpoint.
A demonstration of what the parallel design buys you: the developer guide includes a fine-tuned DiffusionGemma that solves Sudoku puzzles at an 80% success rate, after a simple JAX SFT recipe on a base model that scores near 0%. An autoregressive model cannot solve the same puzzles, because each token depends on tokens not yet generated. The frameworks and tools available today are below.
- Hugging Face Transformers: Runs DiffusionGemma on a GeForce RTX 5090 or DGX Spark out of the box.
- vLLM: Day-zero serving support with an OpenAI-compatible local server, with integration contributed by Red Hat.
- Unsloth: Fine-tuning recipe and adapter support for adapting the model to specific tasks.
- NVIDIA NeMo framework: Adapter support and DGX Spark playbooks for local development and deployment.
The Cases the 4x Figure Doesn’t Cover
DiffusionGemma’s speedup is regime-specific. Google’s launch post is explicit that in high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, and DiffusionGemma’s parallel decoding “offers diminishing returns and can result in higher serving costs.” The 4x figure is a single-user, single-accelerator number; the moment a server batches requests from many users, the comparison flips.
Apple Silicon is the other platform Google flags. The launch post includes a footnote that unified-memory architectures like Apple Silicon Macs are often memory-bandwidth-bound rather than compute-bound during inference, so they “may not see the same acceleration over autoregressive models like Gemma 4.” DiffusionGemma’s CUDA-first optimization means the fastest path to the 4x figure is an NVIDIA box. The model will not see the same wall-clock speedup on a Mac, where memory bandwidth, not compute, is the bottleneck. The same caveat applies to most non-NVIDIA accelerators in the consumer market today.
The Trade-Off Google Puts in Plain View
DiffusionGemma is not a Gemma 4 replacement. It is a Gemma 4 sibling tuned for the case where a single user, on a single box, wants a response faster than any autoregressive model can deliver. The launch post is careful to position it as experimental, faster, and lower in output quality than the autoregressive standard, with a recommendation to deploy Gemma 4 for any application where quality is the bottleneck. The framing is consistent across the launch post, the model card, and NVIDIA’s hardware writeup.
The four published facts that define the trade-off sit inside the announcement. Up to 4x faster on dedicated GPUs, against an equivalent autoregressive model in the same single-user regime. Open weights, Apache 2.0, no cloud required. And benchmark scores that trail standard Gemma 4 on every published metric except HLE no tools.
For developers, the practical question is whether the speed is worth the quality cost in this workflow. The release gives them the data to answer that question themselves, and the open license to build on whichever side of the trade-off fits their use case. The 4x number is the easy thing to repeat; the trade-off is the part that has to be repeated with it.
Frequently Asked Questions
What is DiffusionGemma?
DiffusionGemma is an experimental open-weights text generation model from Google DeepMind, released on June 10, 2026. It is built on the Gemma 4 26B A4B mixture-of-experts backbone and uses a diffusion-based decoder that denoises 256 tokens in parallel rather than predicting them one at a time. The result is single-user text generation that runs up to 4x faster than the equivalent autoregressive model on dedicated GPUs.
How fast is DiffusionGemma on local GPUs?
Google and NVIDIA published the same set of numbers on the day of release. The model delivers 1,000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on a GeForce RTX 5090, 150 tokens per second on the NVIDIA DGX Spark deskside supercomputer, and up to 2,000 tokens per second on the DGX Station. The figures are measured in the same single-user regime where the 4x speedup applies.
Is DiffusionGemma open source?
Yes. The weights are released under the Apache 2.0 license, the same license used by the rest of the Gemma 4 family. The model can be downloaded from Hugging Face, served with vLLM, run with Hugging Face Transformers, and fine-tuned with Unsloth or NVIDIA NeMo. No cloud connection is required for local use, and there is no per-token cost.
What hardware does DiffusionGemma run on?
NVIDIA is the optimization partner, with day-zero support for GeForce RTX 5090 and RTX 4090 cards, the RTX PRO 6000 workstation, DGX Spark, and DGX Station. Native NVFP4 support accelerates compute throughput. llama.cpp support is listed as coming soon, which would broaden the open-source inference options. The launch post flags Apple Silicon Macs as a caveat, since their memory-bandwidth-bound design may not see the same speedup.
Should developers use DiffusionGemma or standard Gemma 4?
The two models answer different questions. Standard Gemma 4 is the recommendation for any application where output quality is the bottleneck; its benchmark scores are higher across the published test suite. DiffusionGemma is positioned for speed-critical, single-user work: in-line editing, agentic loops, and rapid local iteration. Google’s own researchers call the quality gap a trade-off and recommend the standard model when maximum quality is required.








