I was talking with a developer friend last week who works at a small startup building insurance claim processing tools. He asked me which open model I’d recommend for their backend. When I said Gemma, he almost immediately said “yeah but the license.” I had no good answer for that. So we spent most of that conversation looking at Mistral Small 4 and Alibaba’s Qwen options instead, and left it there without deciding anything.
That conversation was March 28. On April 2, Google released Gemma 4 under the Apache 2.0 license.
So. Timing.
The benchmark numbers are good and I’ll get to them. But I think most of the coverage is spending too much time on leaderboard positions and not enough on the thing that actually changes the decision for a lot of teams. That thing is the license, and I want to explain why before going into model sizes and token counts.

Why the License Switch Actually Matters
Google’s earlier Gemma releases came with a custom usage policy. The models were called “open-weight” but that’s not the same as open in the way most developers understand the word. The terms were Google’s to update at any time. They included restrictions that compliance teams at companies routinely flagged, and the wording had enough edge cases that you could not just assume your use case was fine without asking someone from legal.
And legal review adds friction. Friction kills adoption. Teams went elsewhere.
This is not a niche problem. The same developer friend I mentioned had looked at Gemma 3 three months ago with his team and the licensing question came up within about ten minutes of the conversation. They ended up on Qwen, not because Qwen was better for their specific use case, but because Qwen’s license was clean. I’ve heard a similar story from at least three other developers over the past year who were building products and couldn’t get Gemma through their compliance review. What’s frustrating is that Gemma 3 was actually a pretty capable model for structured output tasks. But “open” with asterisks is not the same as open.
Apache 2.0 is the standard open source license that developers already work with everywhere. You can modify the model, use it commercially, redistribute it, build products on top of it, integrate it into things you sell. No special terms. No “subject to Google’s usage policy.” No conditions that can change three months after you’ve shipped something and built your whole pipeline around the model. Clement Delangue, CEO of Hugging Face, called it “a huge milestone” when the announcement came out on April 2. I think he’s right. This change has been overdue since at least Gemma 3.
But there’s a second angle here that is even more practically important for a lot of teams. Organizations in healthcare, fintech, and government sectors often can’t use cloud APIs at all for data processing. Not because they don’t want to, but because regulations and client contracts don’t allow it. Data residency requirements. HIPAA. Government procurement rules. For these teams, self-hosting is the only path. With the old Gemma license, doing that commercially had enough ambiguity that legal teams flagged it. Apache 2.0 removes that problem entirely. You run it on your own servers, you modify it, you build on it. Google is not in that relationship anymore, at least not in a licensing sense.
Google said Gemma has been downloaded over 400 million times since the original release in February 2024. The community has built more than 100,000 variants of Gemma models, which Google calls the Gemmaverse. Those are large numbers for a model family with a complicated license. With Apache 2.0, that community should grow faster. And a lot of enterprise teams that wrote off Gemma due to the old licensing situation will now reconsider.
What the Four Models Actually Are
Gemma 4 comes in four sizes: E2B, E4B, 26B Mixture of Experts, and 31B Dense. All four are built on the same underlying research as Google’s proprietary Gemini 3, according to Google.
The naming confused me when I first read the announcement. E2B and E4B stand for “Effective 2B” and “Effective 4B,” and “effective” here refers to the active parameter count during inference, not the total model footprint. So these are not just tiny models with limited capabilities — the naming is doing specific technical work. Keep that in mind when comparing parameter counts across different model families, because the number you see in the name is not the same thing as the number in other models’ names.
Google worked with the Pixel hardware team, Qualcomm, and MediaTek specifically on E2B and E4B. These are designed for mobile devices and IoT hardware. They run completely offline on smartphones, Raspberry Pi boards, and Jetson Nano modules. The context window is 128K tokens. They natively handle images, video, and audio input including speech recognition, all locally, with no server call required. If you’re building a voice assistant that processes audio on-device without sending anything to a server, E4B is probably where you start.
The bigger models work differently. The 26B Mixture of Experts model activates only 3.8 billion of its 26 billion parameters during inference. This is the MoE architecture doing what it’s supposed to do, which is routing each input to the subset of the model that’s most relevant for that particular task. The result is that you get quality roughly comparable to a full 26B model but at token generation speed much closer to a 4B model. For latency-sensitive applications, this is a real advantage over dense models of the same scale.
The 31B Dense uses all its parameters on every inference pass. It’s slower than the MoE but more consistent across different types of tasks, and Google has specifically positioned it as a fine-tuning base rather than a daily-use model. The idea is you take the 31B Dense, adapt it to your specific domain with your own data, and deploy the fine-tuned version. Given how the Gemmaverse grew to 100,000+ community variants under a restrictive license, I expect the 31B Dense to generate a lot of specialized fine-tunes once Apache 2.0 removes the commercial friction from that process.
All four models natively support function calling, structured JSON output, and system instructions. This is not an add-on. It’s built into the base training. For anyone building agentic workflows where models need to call tools, interact with APIs, and produce reliable structured output, this matters a lot. You’re not fighting the model to produce clean JSON. It knows how to do that from the start.
The Benchmark Numbers
On the Arena AI text leaderboard as of this week, the 31B Dense is ranked #3 among all open models in the world. The 26B MoE is at #6. The Arena AI ranking is based on aggregated human preference votes across many different prompt types, not just a fixed academic benchmark, so it reflects real usage patterns at least partially.
On Artificial Analysis’s GPQA Diamond benchmark, which tests scientific reasoning, the 31B Dense scored 85.7%. That puts it second among all open models under 40 billion parameters, just behind Qwen3.5 27B at 85.8%. Effectively tied. The 26B MoE scored 79.2% on the same test. To give that a reference point: OpenAI’s gpt-oss-120B scored 76.2% on GPQA Diamond. A 26 billion parameter model beating a 120 billion parameter model on scientific reasoning is not a trivial result. Both Gemma 4 models in this Artificial Analysis evaluation were running on a single H100 GPU — single-machine deployments, not multi-GPU clusters.
I want to be honest about benchmark numbers though. They tell you how a model performs on the specific tasks that specific benchmark designers thought were worth measuring. They don’t tell you how the model handles your actual use case, your prompting style, your edge cases. Every model I’ve used that looked good on leaderboards had something it was quietly bad at that didn’t show up in the rankings. With Gemma 4 being released literally yesterday, the community hasn’t had nearly enough time to find those failure cases yet. I’d hold off on making any strong production decisions until independent testing catches up.
But the numbers are competitive enough that this model family should be on your evaluation list, not skipped.
Running These Models on Real Hardware
Here’s where the clean marketing story gets messy.
The 31B Dense, running unquantized in bfloat16 format, requires an 80GB H100 GPU. That card costs roughly $20,000. Technically local, but not laptop-local, not consumer-local. NVIDIA has optimized Gemma 4 for RTX-series GPUs and the DGX Spark personal AI supercomputer, which helps at the high end. But for the unquantized 31B, you’re still looking at data-center-class hardware.
If you quantize to 4-bit via Ollama or GGUF format, the 31B model fits on a machine with 16GB of RAM. An RTX 4090 has 24GB VRAM and handles it comfortably. An RTX 4080 in most configurations can manage it too. But quantization always costs something in quality. How much depends on the task. For code completion, document summarization, and structured output generation, the quality drop from 4-bit quantization is usually acceptable based on what I’ve seen with similarly sized models. For the complex multi-step reasoning tasks where the GPQA Diamond numbers look best, quantization can hurt more noticeably. I’d want to run specific tests on my use case before making any production commitment here.
The 26B MoE is probably the better practical choice for most developers anyway. Because only 3.8B parameters are active at inference time, the memory footprint is much more manageable than a full dense 26B model would be. You get strong output quality with faster token generation and lower hardware requirements. Google positioned the 31B Dense as a fine-tuning base, and I think that framing is correct. Use the 26B MoE if you want something good out of the box right now. Use the 31B Dense if you’re planning to adapt it to something specific.
NVIDIA worked with Google to optimize Gemma 4 for their GPU lineup. They also collaborated with Ollama and llama.cpp for day-one local deployment support, so pulling the model locally is straightforward. Unsloth has quantized versions ready for fine-tuning as of April 2. The tooling ecosystem responded fast to this release, which is a good sign.
One thing I still can’t fully square: the Google AI for Developers documentation page shows a rollout date of March 31, but the public announcement was April 2. The model weights appeared on Hugging Face a few hours after that announcement. In my experience there’s always some documentation weirdness in the first few days after a major model release. Pages are incomplete, some links 404, benchmarks sometimes get revised. It’s probably all settled by the time you read this. But if something seems inconsistent, check again in a day or two.
The Multimodal Stuff
All four models handle images and video natively. OCR, chart reading, document understanding, visual question answering — these work across the whole family, not just the bigger models.
So you could run E4B on a phone, point it at a receipt or a form, and process the content completely offline. No network required. For field workers, healthcare applications in areas with no reliable connectivity, or any setting where internet access can’t be assumed, this is a practical option that didn’t exist cleanly before at this model quality level. Six months ago you were either making an API call or using a much weaker local model. The E4B changes that equation.
Audio input is only on E2B and E4B, which I found a bit odd at first. The bigger models don’t have native audio support. Maybe the on-device voice assistant use case is more obvious for edge hardware, or maybe there was a resource overhead decision for the 26B and 31B. The documentation doesn’t explain it. Worth watching whether Google adds audio to the large models in a future update.
Interleaved input is also supported on the edge models. You can mix text and images in any order within a single prompt. Text, image, more text, another image, all in one context. This sounds like a minor feature but it matters a lot for any application where you need to reference visual content mid-conversation rather than putting all images at the start or end of a prompt.
The 256K context window on the large models gets mentioned in most coverage but I don’t think people fully think through what it enables. 256K tokens is enough to pass an entire medium-sized code repository in one prompt. I’ve been doing this kind of thing with Gemini via API for a few months and it genuinely saves time on large refactoring tasks, especially when a change in one file has ripple effects across many others. Whether Gemma 4’s retrieval quality across 256K tokens matches Gemini in practice, I honestly can’t say yet. That kind of careful testing takes more than 24 hours to do properly. But the window is there.
Language Support and Who This Helps Globally
Gemma 4 is pretrained on more than 140 languages. Not a translation wrapper over English-first training — the multilingual data is in the base training from the start.
For developers building applications in non-English markets, this matters more than it might seem at first. Models primarily trained in English and then adapted for other languages often have subtle quality gaps in those languages. Grammar quirks, idioms, cultural context, register differences. A model natively trained on multilingual data doesn’t have that same underlying imbalance in the base weights.
Google specifically mentioned that Gemma is being used for Project Navarasa, which works across India’s 22 official languages. They also mentioned Gemma is being used to automate state licensing processes in Ukraine, which is a different kind of use case but shows the model being deployed in real government infrastructure. If you’re building anything for regional Indian languages like Telugu, Tamil, Kannada, Bengali, or others — the native multilingual training is worth testing rather than assuming it works. I don’t have personal test results on regional language quality to share here. I’m not going to make confident claims about something I haven’t actually measured. But the architecture is set up correctly for this work.
The edge models can handle this offline and on-device at 128K context. So the combination of multilingual support, audio input, and fully offline operation on cheap hardware is actually a meaningful capability set for certain deployment scenarios that didn’t have a good local option before.
Who Should Actually Care About This Release
So who is Gemma 4 actually for? I keep seeing coverage that says “developers and enterprises” which is not very useful. Let me be more specific.
The most direct beneficiaries are teams that were evaluating Gemma 3 and walked away because of the license. That group is bigger than the announcement coverage seems to assume. Teams that chose Mistral or Qwen not because the model was better but because the licensing was cleaner — those teams should now reconsider. Apache 2.0 is the same license they’re already using for their other open source dependencies. The decision is now a technical one, not a legal one.
People building for edge hardware are also a main audience. If you’re working with Jetson Nano modules, Android devices, industrial equipment, medical devices, or IoT setups in low-connectivity environments — the E4B with multimodal support and 128K context running completely offline is a real production option. Six months ago this level of capability wasn’t available at this size with this license. Now it is.
Self-hosting teams in regulated sectors have been waiting for something like this. Healthcare startups that can’t route patient data through Google’s servers. Fintech companies with data residency requirements. Government developers who need full infrastructure control. The combination of Apache 2.0 plus competitive model performance plus solid local deployment tooling covers most of what these teams need to get past their compliance review.
And fine-tuning researchers and teams building specialized models. The 31B Dense is specifically designed as a fine-tuning base. With Apache 2.0, you can fine-tune it on proprietary data, serve the result commercially, share derivatives if you want, all without asking Google’s permission. The 100,000+ variants that the community already built with earlier Gemma models happened under a more restrictive license. With Apache 2.0, that number should grow faster and into more commercial territory than before.
What We Still Don’t Know
The models went live yesterday. The community hasn’t had time to find the failure cases.
Every model has them. Confident wrong answers on certain edge cases. Tasks where benchmark performance doesn’t translate well to real-world quality. Inputs where the model produces clean-looking output that’s actually wrong in a subtle way. The r/LocalLLaMA community will spend the next two weeks stress-testing Gemma 4 across different tasks and hardware configurations, and that’s when the real picture starts forming. I’d wait at least two weeks before making any strong production decisions based on this release.
The 26B MoE’s long-context performance is my main specific open question. When most parameters are inactive and you’re doing retrieval across 200,000 tokens of context, does the routing mechanism handle that correctly? The theory is that expert selection adapts to the input. But long-context retrieval in MoE models can sometimes miss things that a dense model catches, because the routing was trained on more typical input lengths. Maybe it’s fine. I don’t know. I want independent tests.
And the “near-zero latency” claim for edge models needs actual numbers on actual devices. Near-zero latency for a short text classification task on a Pixel 9 is one thing. Near-zero for generating a few hundred tokens in response to a complex multimodal prompt on a Jetson Nano is a different thing. What does that actually look like in milliseconds across different device types? I’d want external benchmarks before building any product where latency is part of the value proposition.
The models are on Hugging Face right now. You can also pull them through Ollama or access the 31B and 26B via Google AI Studio if you want to test before downloading anything. Unsloth has quantized fine-tuning-ready versions.
So when I talk to my developer friend about this — probably this weekend — I’ll finally have a clean answer. The license problem is gone. The model is competitive. The hardware story works for most setups with quantization. And if his startup needs something fine-tuned or something that runs completely offline, both options are now open without any legal friction. That’s a different conversation than the one we had in March.