Gemma 4 AMD GPU Support: vLLM, SGLang & ROCm Guide

The model dropped on April 2, 2026. Within hours, two of the biggest silicon vendors in AI had published detailed deployment guides, tested Docker images, and verified compatibility across their full product lines. That doesn’t happen by accident. It takes months of quiet engineering work before the announcement, early model access, and a degree of coordination that’s become standard practice between Google and the hardware companies building the infrastructure its models run on.

Google’s Gemma 4 family arrived with something developers have been quietly wanting for years: real, across-the-board hardware parity from launch day. AMD announced Day Zero support spanning Instinct data center GPUs, Radeon graphics cards, and Ryzen AI processors — all at once, no asterisks. NVIDIA confirmed the same for everything from Jetson Orin Nano edge modules to RTX-powered workstations and the DGX Spark personal AI supercomputer. For anyone who’s been burned before by “supported” models that took weeks to actually run well on specific hardware, this is a meaningful shift worth paying attention to.

That said, “supported” and “optimal” aren’t the same thing. The details of how each company actually handles Gemma 4 — which tools they back, which hardware tiers they prioritize, where the real engineering investment went — reveal quite a bit about where AI inference is heading in 2026 and who’s best positioned to capture it.

What Gemma 4 Actually Is

Before getting into the hardware, it’s worth understanding why this model family is genuinely interesting rather than just another Google release.

Gemma 4 ships in four variants: E2B, E4B, a 26B dense model, and a 31B model, with a 26B Mixture of Experts (MoE) variant, called 26B-A4B, rounding out the lineup. MoE architecture means the model selectively activates only a portion of its parameters for any given input. In practical terms, you can run what’s nominally a large model with the memory footprint of a much smaller one — which is exactly why the 26B-A4B exists alongside the 31B dense model rather than instead of it.

Context windows go up to 256K tokens. Most consumer-grade workflows will never come close to that ceiling, but it matters enormously for document analysis, long-form reasoning chains, and agentic tasks where a model needs to track many intermediate steps without losing the thread. That length wasn’t designed for chatbots. It was designed for agents.

The multimodal capabilities deserve more attention than they’re getting in the initial wave of coverage. Gemma 4 ingests text, images, and audio, and generates text. It handles optical character recognition, object recognition, and automatic speech recognition natively, not through bolted-on adapter modules. It was trained for thinking, coding, and function calling as first-class tasks. It understands up to 140 languages, with out-of-the-box fluency in more than 35 for generation tasks. For a family of models operating at the 2B-to-31B scale, that breadth is uncommon. Most compact open-weights models make hard tradeoffs and pick a specialty. Gemma 4 is attempting something closer to general competence at a size that actually fits on local hardware.

One architectural decision worth noting: Gemma 4 supports interleaved multimodal input, meaning you can mix text and images in any order within a single prompt rather than being forced to put images at the start. That sounds like a small quality-of-life feature, but it matters significantly for document analysis tasks where a natural prompt would reference an image mid-sentence, and for agentic workflows where the agent assembles a context window from heterogeneous sources incrementally.

AMD’s engineering team called out something that Google’s marketing didn’t emphasize loudly enough: the architecture changes from Gemma 3 aren’t just incremental improvements. Improved long-context quality, enhanced efficiency, and entirely new architectures for vision and audio processing signal a deeper revision than a version bump would usually imply. Gemma 4 isn’t Gemma 3.5 with a fresh coat of paint.

AMD’s Full-Stack Approach

AMD’s Day Zero announcement covered every tier of its AI hardware portfolio without qualification. Instinct GPUs for cloud and enterprise data centers, Radeon GPUs for AI workstations, and Ryzen AI processors for consumer and commercial AI PCs — each with verified deployment paths, not placeholder documentation.

For production inference on AMD GPUs, vLLM is the primary path. Users pull the dedicated Docker image built for the Gemma 4 launch:

docker pull vllm/vllm-openai-rocm:gemma4

And invoke the server with the Triton attention backend:

vllm serve vllm/vllm-openai-rocm:gemma4 --attention-backend TRITON_ATTN

The Triton backend is required — not optional — because of the bidirectional image-token attention Gemma 4 uses for multimodal input processing. This is the kind of implementation detail that bites people who try to run the model with default settings and then spend an afternoon figuring out why the vanilla attention backend fails. AMD published this explicitly rather than burying it in a compatibility matrix.

What makes vLLM worth using for production over simpler inference setups is its multi-request handling. When you’re serving actual users rather than running single-request benchmarks, the difference between naive sequential processing and vLLM’s batching optimizations shows up immediately in throughput numbers. AMD confirmed that multiple generations of both Instinct and Radeon GPUs are covered, not just the latest silicon.

SGLang support lands specifically on MI300X, MI325X, and MI35X hardware. A single MI300X GPU, carrying 192GB of HBM, can run the 31B model at full context length with tensor parallelism set to 1 (TP=1). For higher-volume deployments where throughput matters more than raw context size, tensor parallelism scales to TP=2 or beyond. These aren’t theoretical specs — they’re the numbers AMD’s team validated before the launch announcement went out. Planning capacity around confirmed figures is different from planning around theoretical maximum specifications.

On the consumer and prosumer side, LM Studio integration brings Gemma 4 to Ryzen AI and Ryzen AI Max processors as well as Radeon and Radeon PRO cards through a GUI that most people already know. The workflow is download LM Studio, install the current AMD Software: Adrenalin Edition drivers, load the model. No command line required.

Lemonade Server deserves specific mention here because it solves a problem most discussions of local AI inference overlook. It’s an open-source local LLM server with an OpenAI-compatible API, which means any code already written for GPT-4 API calls can talk to a local Gemma 4 instance without modification. ROCm acceleration handles GPU deployment. The XDNA 2 NPU inside Ryzen AI processors gets Gemma 4 E2B and E4B support arriving with the next Ryzen AI software update, accessible to developers directly through OnnxRuntime APIs.

The ROCm deployment path through Lemonade involves one extra step that’s easy to miss: you need to point Lemonade at a platform-specific ROCm build of llama.cpp, not the generic build. AMD published pre-built binaries for specific GPU architectures — for example, the llama-windows-rocm-gfx1151-x64 build targets the Radeon RX 8060S. Getting that architecture string right is the difference between hardware-accelerated inference and falling back to CPU. It's the kind of setup detail that feels minor until it's the reason your throughput numbers look wrong.

Quantization is worth addressing directly since it affects which models you can actually run on which hardware. The E2B and E4B models at common quantization levels like Q4_K_M fit comfortably on mid-range Radeon cards with 16GB or more of VRAM. The 31B model at full precision is a different story — you need the MI300X’s 192GB HBM or similar. At 4-bit quantization, the 31B model becomes accessible to a much wider range of hardware, though with some quality tradeoff that’s more noticeable on tasks requiring precise mathematical reasoning than on general language tasks. Knowing that tradeoff exists and planning around it is more useful than treating quantization as a free performance upgrade.

That NPU path is quietly significant. Running a 2B or 4B model on the neural processing unit rather than the discrete GPU means the graphics card stays available for other work — gaming, video encoding, rendering — rather than being monopolized by background inference. It’s the difference between AI being a discrete application you launch and AI being ambient infrastructure that doesn’t compete with everything else you’re doing on the machine.

NVIDIA’s RTX and Edge Play

NVIDIA approached Gemma 4 from a different direction. Rather than documenting every inference framework that happens to work, the collaboration between Google and NVIDIA focused on deep GPU-level optimization — Tensor Core acceleration for AI inference workloads, the CUDA software stack providing broad compatibility from launch without extensive per-model engineering.

The benchmark methodology NVIDIA published is specific enough to be reproducible: Q4_K_M quantization, batch size of 1, input sequence length of 4,096, output sequence length of 128, measured via llama.cpp build b7789, on a GeForce RTX 5090. The comparison point was a Mac M3 Ultra desktop. Apple Silicon is probably the strongest consumer alternative for local AI inference right now, so that’s the right target to benchmark against rather than a less capable machine. If you want to verify the numbers on your own RTX hardware, you have enough detail to do so.

The Jetson Orin Nano plays a role in NVIDIA’s story that gets underemphasized in mainstream AI coverage. The E2B and E4B models run completely offline on Jetson modules with near-zero latency. That positions them as candidates for industrial edge deployment — embedded vision systems, real-time speech transcription in offline environments, sensor fusion at sites where cloud connectivity is either physically unavailable or a security liability. That’s a market most open-source AI conversations don’t spend much time on, but it’s substantial and growing fast.

For RTX PC users, Ollama and llama.cpp are the two main deployment paths, and both are mature enough that “mature” is no longer a polite word for “barely functional.” Verified optimized support from launch day means the usual gap between a model releasing and the community figuring out the optimal inference configuration doesn’t apply. You don’t have to wait for someone to publish the right command-line flags two weeks after launch.

Unsloth adds something the other tools don’t: day-one fine-tuning support with pre-quantized models optimized for local deployment via Unsloth Studio. If you need to adapt Gemma 4 to a specific domain — a particular codebase style, legal document formats, specialized terminology — you can start immediately rather than waiting for community-built training pipelines to catch up. That’s a meaningful practical difference for teams that need domain-specific rather than general-purpose behavior.

DGX Spark sits at the high end of NVIDIA’s Gemma 4 story. It’s NVIDIA’s personal AI supercomputer, a desktop-class machine with data center-grade specifications aimed at researchers and serious developers who need frontier-model performance without cloud latency or data privacy tradeoffs. The 26B and 31B Gemma 4 models running on DGX Spark make a genuine case for local alternatives to cloud-based models for agentic workflows. OpenClaw compatibility extends this further: always-on AI assistants that access local files, running applications, and active workflows in real time, building context from your actual computing environment rather than a static conversation history.

To be fair, DGX Spark sits at a price point that puts it well outside most developers’ personal budgets. Its relevance isn’t as a consumer product — it’s as an existence proof that the performance envelope for local inference keeps expanding. What runs on DGX Spark today is a useful indicator of what RTX hardware will comfortably run eighteen months from now as quantization techniques improve and model compression research matures.

The Shared Infrastructure Layer

When AMD and NVIDIA both publish Day Zero support for the same model family, it surfaces which tools have actually become load-bearing infrastructure in the open-source AI inference stack.

Ollama appears on both sides. llama.cpp appears on both sides. These aren’t experimental projects anymore. They’re the baseline that model releases get tested against first, the tools that matter to the largest number of actual users. A model that doesn’t run well in llama.cpp has a problem that needs fixing before almost anything else.

vLLM and SGLang show up more prominently in AMD’s documentation, which makes sense given the Instinct GPU focus on production serving. Both are legitimate production infrastructure — vLLM particularly has become standard for teams running multi-user inference servers rather than single-developer local deployments. Its request batching and scheduling are categorically better than simpler setups under real load.

What both companies support reflects a recognition that the open-source AI inference stack has matured. Two years ago, getting a new model working on AMD hardware with ROCm often meant waiting for community patches, tracking down compatibility flags, and hoping someone had tested your specific GPU generation. AMD’s simultaneous, full-portfolio support for Gemma 4 is evidence of how far ROCm has come — not just as a technology, but as a software ecosystem that hardware vendors and model developers both treat as a first-class deployment target.

NVIDIA’s advantage here remains real but is narrowing. The CUDA ecosystem’s depth — the years of accumulated tooling, optimization libraries, and compatibility guarantees — means new models genuinely do work better out of the box on NVIDIA hardware in most benchmarked scenarios. That breadth is the result of sustained investment, not an accident of hardware architecture. But the gap that once felt like a permanent structural advantage now looks more like a head start that’s being closed methodically.

What Agentic AI Actually Needs From Hardware

Both companies used the phrase “agentic AI” independently in their Gemma 4 announcements. That’s not marketing alignment. It reflects where the model was actually designed to operate.

Agentic workflows are different from conversational AI in ways that have direct hardware implications. An agent doesn’t respond to a single question and stop. It maintains context across many steps, calls external tools and processes their results, decides what to do next based on intermediate outputs, and may run for minutes or hours on a single task. The 256K context window wasn’t included because it tests well on benchmarks. It was included because agents need it.

Native function calling — the ability to emit structured tool-use requests rather than just text — is similarly architectural. It’s not a prompting trick. It’s a trained capability that changes how reliably an agent can interact with external APIs, databases, and code execution environments.

AMD’s vision for how this plays out at scale is worth quoting directly in spirit, if not word for word: small models running on the NPU for fast, cheap queries and ambient awareness; medium models on discrete GPUs for reasoning-intensive tasks; large models on Instinct data center hardware for fleet-level orchestration across many agents. That’s not one use case. That’s an entire deployment architecture covered by a single model family at different quantization levels and hardware tiers.

NVIDIA’s version of the same bet is OpenClaw running on DGX Spark, with Gemma 4 as the reasoning layer — an always-on agent with local context, processing power, and direct access to the user’s working environment. Whether this framing holds up under production conditions is a fair question. The demos are compelling. The real-world track record for multi-step AI agents completing complex tasks reliably, without hallucinating intermediate results or taking unintended actions, is still being established. Supporting a model designed for agentic use is not the same as agentic AI being reliable at scale. The infrastructure is ready. The software and the prompting practices that make agents actually trustworthy are still catching up.

What Day Zero Really Signals for Developers

Read past the press release phrasing and Day Zero support represents a genuine technical commitment. Working deployment guides, verified Docker images, and tested code on the same day a model releases means the hardware vendor had early model access and invested engineering time before anyone else was running it publicly. AMD validated the MI300X’s 192GB HBM capacity against the 31B model’s actual memory requirements. NVIDIA ran llama.cpp benchmarks on RTX 5090 hardware and published methodology specific enough to reproduce. These are the outputs of real work, not post-hoc documentation.

For developers making infrastructure decisions, this reduces a pain point that used to be significant. You don’t have to wait for community testing to confirm your hardware tier is actually supported. You don’t have to track down the community-maintained ROCm compatibility patch that makes the model run without crashing. You get working code from day one, across a wide range of hardware. That’s a lower bar than it sounds, but anyone who’s spent a weekend debugging a model that “supports” a GPU generation that was quietly only tested on one specific variant knows how much the difference matters in practice.

The broader signal is that coordinated multi-vendor releases on open-weight models are now standard practice rather than exceptional events. Google, AMD, and NVIDIA shipping Gemma 4 support simultaneously means the open-source AI inference ecosystem has matured enough to support the kind of launch coordination that proprietary model providers have always had with their own hardware. That’s a meaningful change from the period — not long ago — when open-weight releases landed and hardware support arrived weeks or months later, assembled from community contributions.

Gemma 4 model weights are available on Hugging Face under Google’s open-weights license. AMD’s full documentation lives on the ROCm AI Developer Hub. NVIDIA’s deployment guide, including the detailed benchmark methodology, is on the NVIDIA technical blog. If you’re running any AMD Instinct, Radeon, or Ryzen AI hardware, or any NVIDIA RTX card released in the last few years, there’s a working deployment path available right now. The infrastructure question has been answered. What you do with that access is the more interesting question.