Gemma 4 changes the self-hosted AI equation

Google's Gemma 4, released on 2 April 2026 under a fully permissive Apache 2.0 licence (Google Blog, MindStudio), is not just another open-weights refresh. It is the first open model family that credibly closes the gap with frontier proprietary systems while remaining deployable on hardware most technology teams already own. For CTOs evaluating whether to build or buy their AI capabilities, Gemma 4 shifts the calculus decisively.

Built from the same research underpinning Google's Gemini 3, the family spans four models — from edge devices drawing under 1.5 GB of memory to a flagship dense model ranking third among all open models globally on the Arena AI leaderboard (Google DeepMind). The Apache 2.0 licence removes every commercial restriction that constrained earlier Gemma releases, meaning firms can fine-tune on proprietary data and deploy commercially without licensing complexity (Beam AI).

Four models, from Raspberry Pi to data centre

The Gemma 4 family is designed around a simple insight: different business problems demand different compute budgets. Rather than shipping a single monolithic model, Google delivered four variants that share a common architecture but target radically different deployment envelopes (Labellerr, WaveSpeedAI).

Gemma 4 E2B packs 2.3 billion effective parameters into a footprint that runs on a Raspberry Pi 5 at 7.6 tokens per second. It handles text, images, audio, and video with a 128K-token context window — enough to process a short book in a single pass (MindStudio). Gemma 4 E4B doubles the effective parameter count to 4.5 billion, adding richer reasoning while still fitting comfortably on a laptop GPU. Both small models use Per-Layer Embeddings (PLE), a technique inherited from Gemma 3n that injects a lightweight conditioning signal at every transformer layer, squeezing more capability from fewer active parameters (Hugging Face).

Gemma 4 26B A4B introduces a Mixture-of-Experts (MoE) architecture with 128 fine-grained experts, of which only 8 (plus one oversized shared expert) activate per token (MindStudio). The result is a model with 25.2 billion total parameters but only 3.8 billion active at inference — delivering intelligence comparable to the 31B dense model at roughly the compute cost of a 4B model. At the top of the range, Gemma 4 31B is a conventional dense transformer with 30.7 billion parameters, a 256K context window, and benchmark scores that embarrass models five to twenty times its size (Gemma 4 Specs, NVIDIA NIM).

All four models share hybrid attention that alternates local sliding-window layers with global full-context layers, proportional RoPE for stable long-context reasoning, unified keys-and-values in global attention to halve KV-cache memory, and native function-calling tokens baked into the vocabulary during pre-training (Google AI Model Card, Datature). Compared to Gemma 3, the architectural refinements are targeted rather than radical — but the training recipe, data quality, and built-in chain-of-thought reasoning mode deliver a generational leap in output quality (Maarten Grootendorst).

Running a frontier-class model on a single GPU

The most consequential detail for enterprise adopters is where these models actually run. Gemma 4 31B quantised to 4-bit (Q4_K_M) requires approximately 16–20 GB of VRAM — within reach of an NVIDIA RTX 4090 or RTX 3090 (Oflight Hardware Guide). The 26B MoE variant fits in a similar envelope at roughly 15–18 GB quantised to INT4. The smaller E2B and E4B models need as little as 5 GB, making them viable on an RTX 3060 or even integrated laptop GPUs (LM Studio, Spheron).

For production serving, the tooling ecosystem has matured considerably. vLLM offers day-one support with PagedAttention, continuous batching, tensor parallelism, and an OpenAI-compatible API — including native Gemma 4 tool-call parsing (vLLM Recipes, GitHub vLLM PR). Ollama provides a single-command deployment path ideal for prototyping: ollama run gemma4:31b downloads, quantises, and serves the model on localhost (Ollama). llama.cpp and LM Studio support GGUF quantisation formats from 2-bit through 8-bit, with CPU offloading for machines without dedicated GPUs (Codex Blog).

On cloud infrastructure, the numbers are equally compelling. A g6.2xlarge instance on AWS (single NVIDIA L4, 24 GB VRAM) can serve the 31B model at INT4 for roughly $0.98 per hour — under $710 per month (LushBinary AWS Guide). Reserved instances bring that below $460. For organisations already running Kubernetes, deploying on AWS EKS with GPU-optimised node groups and Karpenter autoscaling provides elastic capacity that scales to zero during quiet periods (NVIDIA GPU Operator Docs). Google Cloud Run now offers serverless Gemma 4 deployment with cold starts of approximately three minutes for the 26B variant — you pay only when processing requests (Google Cloud Blog, Medium/Gwerzman).

The practical implication: a professional services firm spending $30,000–50,000 per month on API calls to frontier models can likely achieve 40–60% cost savings by self-hosting a fine-tuned Gemma 4, while simultaneously eliminating data egress to third parties (TokenMix Self-Host Analysis, Prem AI Guide).

Benchmark performance that punches well above its weight

Gemma 4's benchmark results represent a step-change, not an incremental improvement. The 31B dense model scores 89.2% on AIME 2026 (mathematical reasoning), up from 20.8% for Gemma 3 27B — a fourfold improvement (Google AI Model Card). On LiveCodeBench v6, it reaches 80.0% versus 29.1% previously. Its Codeforces ELO of 2,150 places it in the top tier of competitive programming. On the Arena AI text leaderboard, it holds an ELO of approximately 1,452, outranking Meta's Llama 4 Maverick (which deploys 400 billion total parameters, thirteen times more than Gemma 4's 31B) (Medium Benchmarks, Tech Insider).

The MoE variant tells an equally striking story. With only 3.8 billion active parameters, the 26B A4B scores 82.6% on MMLU Pro, 88.3% on AIME 2026, and 82.3% on GPQA Diamond — beating OpenAI's GPT-OSS 120B on science reasoning despite a 94-billion-parameter gap (Artificial Analysis).

Where Gemma 4 still trails is at the very top of the leaderboard. The Artificial Analysis Intelligence Index places it at 39 points versus 42 for Qwen 3.5 27B and leading proprietary models in the mid-50s (Artificial Analysis 31B Profile). The gap is concentrated in agentic task completion on third-party evaluations, not in raw reasoning or knowledge. Notably, Gemma 4 achieves its scores while consuming 2.5 times fewer output tokens than Qwen 3.5 — a meaningful efficiency advantage in production where token throughput directly translates to cost per query.

Agentic fine-tuning is where the real value unlocks

Raw benchmark performance matters, but the strategic opportunity for professional services firms lies in fine-tuning Gemma 4 for domain-specific agentic workflows. The model family was purpose-built for this. Six dedicated special tokens — <|tool>, <|tool_call>, <|tool_response> and their closing counterparts — are first-class vocabulary entries trained during pre-training, not fragile prompt-engineering conventions (LushBinary Agents Guide, Google Developers Blog). This makes structured function calling deterministic and parser-friendly in production pipelines.

The results on agentic benchmarks confirm the design intent. On τ2-bench (a multi-step tool-use evaluation), Gemma 4 31B scores 76.9% across three domains, with retail-specific agentic tasks hitting 86.4%. Gemma 3 scored 6.6% on the same retail benchmark (Google AI Model Card, Beam AI). That is not improvement — it is the difference between a capability that does not work and one that does.

QLoRA fine-tuning makes this accessible on modest hardware. The E4B model can be fine-tuned on an RTX 4090 in approximately three hours using 500–1,000 task-specific examples (LushBinary QLoRA Guide). Unsloth's optimised training kernels reduce memory consumption by 50–80%, enabling the E2B to train on a free Kaggle T4 GPU in under five minutes (Unsloth Fine-tuning Docs). A well-tuned E4B on a narrow domain task can match a prompted 31B model at one-seventh the inference cost — a powerful lever for firms deploying at scale (BSWEN QLoRA Guide, DataCamp Tutorial).

The practical recipe for an agentic workflow fine-tune involves curating examples of multi-step tool use (database queries, API calls, document retrieval, decision logic), formatting them in Gemma 4's conversational JSON schema with tool definitions in the system prompt, and running QLoRA with rank-16 adapters targeting all attention and feed-forward projections. The resulting adapter weights are roughly 50 MB — trivially deployable alongside the quantised base model (Softmax Data).

Why this matters for professional services and TIC firms

For firms in professional services, Testing, Inspection and Certification (TIC), and adjacent industries, Gemma 4 addresses three blockers that have kept AI adoption in pilot limbo (Firstminute Capital TIC thesis).

Data sovereignty becomes non-negotiable. TIC firms handle sensitive client data — product specifications, test results, compliance documentation, audit findings. Sending this to a third-party API creates regulatory exposure under GDPR, the EU AI Act, and sector-specific frameworks. A self-hosted Gemma 4 deployment keeps every token on-premise or within a controlled cloud tenancy. The Apache 2.0 licence ensures no usage data flows back to Google (Zammad Self-Hosting Guide, MindStudio Apache 2.0).

Multimodal capability matches the work. Inspection workflows combine photographs, scanned documents, sensor data, and narrative reports. Gemma 4's native vision encoder processes variable-aspect-ratio images at configurable resolution, while the E2B and E4B variants add audio understanding (Datature). A fine-tuned model can ingest an inspection photograph, cross-reference it against a regulatory standard, and draft a compliance finding — the kind of multi-step reasoning that previously required a human analyst or a brittle pipeline of single-purpose models.

The economics finally work at mid-market scale. A TIC firm processing 50,000 inspection reports per month no longer needs to choose between expensive API calls and a seven-figure ML infrastructure programme. A single RTX 4090 running a quantised, fine-tuned Gemma 4 E4B can serve that volume at under $500 per month in cloud compute — or on hardware that costs less than a single quarter's API bill (TokenMix, LushBinary AWS).

Conclusion

Gemma 4 is not the highest-scoring model on every benchmark, and it is not the fastest at raw token generation. What it represents is something more consequential: the point at which self-hosted, fine-tuned open models become the rational default for organisations with domain-specific workflows, sensitive data, and predictable inference volumes. The combination of frontier-adjacent performance, consumer-grade hardware requirements, native agentic architecture, and a genuinely permissive licence removes the last credible objection to bringing AI inference in-house (Google Blog, Build Fast with AI). For technology leaders at professional services firms, the question is no longer whether to self-host — it is which workflows to migrate first.