Gemma 4: Byte for byte, the most capable open models

Google DeepMind

Gemma 4: Byte for byte, the most capable open models

Beginner

Google DeepMind4/2/2026

Key Summary

•Gemma 4 is a family of open AI models that are unusually smart for their size, so they run well on phones, laptops, and servers.
•It comes in four sizes (E2B, E4B, 26B MoE, 31B Dense) and handles text, images, video, and, on edge models, audio too.
•These models are built for agents: they can call functions, return perfectly structured JSON, and follow system instructions reliably.
•Long context windows (128K on edge and up to 256K on larger models) let them read big documents or codebases in one go.
•The 26B Mixture-of-Experts activates only part of the model each token for speed, while the 31B Dense aims for maximum quality.
•Developers can fine-tune Gemma 4 easily and legally thanks to the Apache 2.0 license.
•On Arena.ai’s leaderboard, the 31B ranks #3 and the 26B ranks #6 among open models, punching far above their sizes.
•Edge-focused E2B and E4B run offline with low latency on devices like phones, Raspberry Pi, and Jetson Orin Nano.
•Gemma 4 supports 140+ languages, strong reasoning, code generation, and reliable tool use for real agentic workflows.
•The main idea: byte-for-byte, Gemma 4 delivers frontier-level capability without demanding frontier-level hardware.

Why This Research Matters

Gemma 4 puts powerful, trustworthy AI into everyday devices, so help arrives exactly where people work: in the field, on a laptop, or on a phone with bad internet. It enables private, offline workflows for health, finance, and government without shipping sensitive data to the cloud. It lets small teams and startups build real agent apps—calling tools, returning clean JSON, and handling long documents—without renting huge clusters. Its multimodal skills mean it can read charts, parse screenshots, and listen on-device, matching how humans actually communicate. The open Apache 2.0 license lowers legal friction, speeding up research and product development. By delivering top-tier scores per byte, Gemma 4 reduces costs and energy use while expanding who can access advanced AI.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how some backpacks are huge but still feel disorganized, while a well-packed smaller backpack can carry exactly what you need? Before Gemma 4, many AI models were like those huge backpacks: very big and powerful but hard to carry around (they needed lots of expensive hardware) and tough to use everywhere, especially on phones or small devices.

The world before: Big, closed models did amazing things—reasoning through problems, writing code, and understanding pictures—but they often required special servers, fast internet, and strict licenses. Smaller open models existed, but they usually felt like "lite" versions: decent at chatting, less great at planning steps, less reliable at returning clean data formats, and not always able to see or hear the world (images, video, audio). If you wanted a handy helper that could also act like a teammate—call tools, fill forms perfectly, and think through tasks—you often had to choose size (big) over accessibility (small and open).

The problem: Developers needed models that were both capable and easy to deploy. They wanted strong reasoning and agent skills (like calling APIs and following instructions) on everyday hardware: phones, laptops, and single GPUs. They also wanted multimodal understanding (text + images + video, and for edge, audio), long memory for big documents, and an open license to build commercial products without legal headaches.

Failed attempts: People tried simply shrinking big models, but that often cut out the brainy parts—reasoning and tool use started wobbling. Others focused only on compressing weights (quantization) without redesigning the model for edge devices, which saved memory but sometimes hurt quality. Some models added tools in a bolted-on way, so agents felt fragile: they might return messy outputs or ignore instructions. And many models stayed text-only or had short memory, so they couldn’t digest an entire code repository or a long report in one go.

The gap: We needed a family of open models that felt “frontier-class” where it counts—reasoning, planning, and agent reliability—while still being small enough to run nearly anywhere. We also needed these models to be born multimodal, trained to follow system instructions, return structured data, and think over long contexts. Finally, we needed a permissive license so businesses, schools, and researchers could adopt them freely.

Real stakes: Imagine a doctor’s tablet that can summarize patient histories offline in a clinic with spotty internet; a farmer’s phone analyzing crop images on the field; a coder’s laptop assistant writing functions even on a plane; a local government building a private chatbot without sending data to the cloud; or a startup fine-tuning a strong, trustworthy agent without renting massive compute. That’s why Gemma 4 exists: to bring high-quality, agent-ready intelligence to the places people actually work and live, byte for byte, in a truly open way.

Concepts in the order you’ll need them (each with a quick sandwich explanation):

🍞 Hook: You know how a library card lets you borrow books and even use them in your own projects if the rules say it’s okay? 🥬 The Concept (Apache 2.0 License): It’s a permission slip that lets you use, change, and sell software with very few restrictions. How it works: 1) You can use it commercially, 2) you can modify it, 3) you include the license and notices, 4) no warranty is promised. Why it matters: Without it, businesses might be unsure if they’re allowed to ship products. 🍞 Anchor: A startup can fine-tune Gemma 4 and sell their app without special negotiations.

🍞 Hook: Imagine asking a friend to “look up the weather” and they actually open a weather app for you. 🥬 The Concept (Function Calling): The model can call tools (like a calculator or calendar) when needed. How it works: 1) You define tool functions, 2) the model chooses a tool, 3) it sends structured inputs, 4) it reads the tool’s reply, 5) it continues the plan. Why it matters: Without it, agents talk but can’t act. 🍞 Anchor: The model books a meeting by calling your calendar API.

🍞 Hook: Think of a neatly labeled Bento box where every food goes in a clear slot. 🥬 The Concept (Structured JSON Output): The model returns data in a predictable format your code can read. How it works: 1) You give a JSON schema, 2) the model fills each field, 3) your app parses it safely. Why it matters: Without structure, apps break on messy text. 🍞 Anchor: A helpdesk agent returns {"ticke $t_i$ d": "123", "priority": "high"} exactly as required.

🍞 Hook: An artist can still draw during a flight with no internet. 🥬 The Concept (Offline Code Generation): The model writes code locally, with no cloud. How it works: 1) Load the model on your device, 2) provide files, 3) request code, 4) test locally. Why it matters: Without offline, planes and private labs are blocked. 🍞 Anchor: Your laptop suggests Python fixes while you’re camping.

🍞 Hook: When you say “Hey!” to your friend, they understand your words. 🥬 The Concept (Speech Recognition): The model turns speech into text it can use. How it works: 1) Capture audio, 2) convert sound to features, 3) decode words, 4) use them. Why it matters: Without it, voice assistants can’t listen. 🍞 Anchor: Your phone transcribes a lecture offline.

🍞 Hook: Zipping a photo makes it smaller but still useful. 🥬 The Concept (Quantization): Store model numbers in fewer bits to save memory and speed up. How it works: 1) Pick lower precision, 2) map weights, 3) run faster, 4) keep quality high. Why it matters: Without it, models might not fit on your device. 🍞 Anchor: A 31B model fits on a consumer GPU after quantization.

🍞 Hook: Remembering a whole chapter helps you answer questions better. 🥬 The Concept (Long Context Windows): The model can pay attention to very long inputs. How it works: 1) Break input into tokens, 2) attend across far-apart parts, 3) keep important bits, 4) answer using all of it. Why it matters: Without it, the model forgets earlier details. 🍞 Anchor: It reads a 200-page report and summarizes the key risks.

🍞 Hook: Looking at photos or videos without needing instructions. 🥬 The Concept (Natively Processing Video and Images): The model directly understands visual content. How it works: 1) Encode frames/images, 2) link to text tokens, 3) reason jointly, 4) output insights. Why it matters: Without it, screenshots and charts are just noise. 🍞 Anchor: It reads a chart image and explains the trend.

🍞 Hook: A Swiss Army knife has tools for many jobs. 🥬 The Concept (Multimodal Capabilities): The model works with text, images, video, and on edge, audio. How it works: 1) Combine encoders, 2) align features, 3) reason across types, 4) respond. Why it matters: Without multimodality, you miss half the story. 🍞 Anchor: It watches a how-to clip and writes the steps in text.

🍞 Hook: A detective strings clues together step by step. 🥬 The Concept (Advanced Reasoning): The model follows logic to solve complex problems. How it works: 1) Break the problem, 2) plan steps, 3) check results, 4) refine. Why it matters: Without reasoning, answers sound nice but can be wrong. 🍞 Anchor: It explains why a math solution works, not just the answer.

🍞 Hook: A chess player thinks several moves ahead. 🥬 The Concept (Multi-step Planning): The model plans actions in order. How it works: 1) Set a goal, 2) list steps, 3) choose tools, 4) execute and adjust. Why it matters: Without planning, agents stall or loop. 🍞 Anchor: It outlines: find flights → book hotel → send itinerary.

🍞 Hook: A teammate who can carry out tasks without being told each tiny move. 🥬 The Concept (Agentic Workflows): The model acts autonomously using tools and plans. How it works: 1) Read system rules, 2) plan steps, 3) call functions, 4) return structured results. Why it matters: Without agency, you do all the glue work. 🍞 Anchor: A support bot triages tickets, queries a database, and emails summaries.

02Core Idea

You know how a well-trained relay team can beat a heavier, slower team because they coordinate perfectly? The core idea of Gemma 4 is exactly that: byte for byte, make the model do more useful thinking, so smaller models feel frontier-class without frontier-class hardware.

The "Aha!" moment in one sentence: If you design for intelligence-per-parameter from the start—building in agent skills, long memory, multimodality, and efficient architectures—then even modestly sized open models can deliver state-of-the-art experiences.

Three ways to picture it:

Packing smarter: Instead of stuffing a suitcase with random clothes (more parameters), fold and choose outfits you’ll actually wear (capabilities that compound: reasoning, planning, tool use). You carry less but live better.
Orchestra with section leaders: Not everyone plays at once (Mixture-of-Experts); only the right experts play for each note, making the music both rich and efficient.
Swiss Army teammate: A single teammate who can read, listen, and look, can also call tools, remember long stories, and follow house rules. That teammate outperforms a bigger but single-skill helper.

Before vs After:

Before: You often picked between big-and-closed (more capable, harder to deploy) or small-and-open (easier to deploy, but weaker reasoning and agents).
After: Gemma 4 offers open models that keep agent-grade reasoning, long context, and multimodality, yet run on a phone, laptop GPU, or a single 80 GB GPU for the largest sizes.

Why it works (intuition, no equations):

Selective compute (Mixture-of-Experts): For the 26B variant, only a small set of experts fire per token, so you get fast tokens-per-second without waking the entire model each time. This preserves quality where it counts while saving compute.
Dense quality path (31B): When you need maximum raw quality and a strong fine-tuning base, the dense model carries full capacity every step.
Agent-native training: Teaching the model to follow system instructions, call functions, and return structured JSON makes tool use reliable, not hacky.
Multimodal-first: Training to see images and video (and edge models to listen) means the model reasons over the same signals people use.
Long-context alignment: Optimizing attention and training data so the model stays coherent over long inputs lets it "read the whole chapter" before answering.
Edge awareness: Designing E2B and E4B to activate only a small effective footprint keeps latency low and battery life healthy.

Building blocks (each with a tiny sandwich):

🍞 Hook: Only calling the plumber you need saves time and money. 🥬 The Concept (Mixture of Experts, 26B): The model has many experts; a router picks a few per token to compute. How it works: 1) Router scores experts, 2) pick top ones, 3) combine outputs, 4) move on. Why it matters: Without MoE, latency and cost go up. 🍞 Anchor: Only 3.8B parameters activate per token for speed.

🍞 Hook: Carrying one heavy backpack for everything. 🥬 The Concept (Dense, 31B): Every part of the model works each step for maximum quality. How it works: 1) All layers on, 2) full capacity used, 3) rich signals, 4) strong base for fine-tuning. Why it matters: Without dense capacity, you might miss peak quality. 🍞 Anchor: The 31B ranks near the top among open models.

🍞 Hook: A recipe card that always comes out neat. 🥬 The Concept (Agent-native JSON + Function Calling): The model is trained to obey schemas and use tools. How it works: 1) Provide schema and tools, 2) follow system rules, 3) call tools, 4) return valid JSON. Why it matters: Without this, agents are brittle. 🍞 Anchor: A travel agent bot books flights and replies with perfect JSON details.

🍞 Hook: Reading glasses for long novels. 🥬 The Concept (Long Context): Designed to handle long inputs without losing the thread. How it works: 1) Attention over long ranges, 2) position strategies, 3) training on long docs, 4) evaluate retention. Why it matters: Without it, earlier facts vanish. 🍞 Anchor: Paste a whole code repo and get a refactor plan.

🍞 Hook: Watching a cooking video while reading the recipe. 🥬 The Concept (Multimodality): Jointly processing text, images, video (and edge: audio) improves understanding. How it works: 1) Encode each modality, 2) align spaces, 3) reason jointly, 4) output. Why it matters: Without it, you miss cross-signal clues. 🍞 Anchor: Upload a chart screenshot and get the exact trend explained.

Put together, these blocks make Gemma 4 feel like a careful, tool-using teammate that fits in your pocket or on a single GPU—while staying open and easy to adapt for your specific job.

03Methodology

At a high level: Input (text, images, video, and on edge models, audio) → Encode (turn into tokens/features) → Think (Dense or MoE core with long-context attention and reasoning) → Act (optionally call functions, follow system instructions, output structured JSON) → Output (stream results, offline if you want) → Optional: Quantize and deploy on your target hardware.

Step-by-step, like a recipe:

Input preparation

What happens: Your app sends text, and optionally images/video (all models), and audio (E2B/E4B) to the model.
Why this step exists: The model needs a clean, consistent way to receive different data types.
Example: A support agent gets a customer message (text) and a screenshot (image) of an error.

Multimodal encoding

What happens: The system converts each modality into features the model can understand. Text becomes tokens; images/video frames become visual embeddings; audio (on edge) becomes acoustic features.
Why this step exists: Different signals speak different “languages.” Encoding translates them into one shared space.
Example: A chart image is turned into numbers that capture lines, axes, and labels; the model can now read the trend.

Core model processing (Dense or MoE)

What happens: The model performs attention and transformations over the sequence. In Dense (31B), the full network participates every step. In MoE (26B), a router activates only a few experts per token (about 3.8B effective parameters), combining their outputs.
Why this step exists: This is the brain of the system—where reasoning, planning, and knowledge retrieval from its parameters happen.
Example: Given “Book a trip to Paris in June,” the model thinks: find dates → search flights (via tool) → ensure passport validity → propose itinerary.

Long-context handling

What happens: The model uses position strategies and training tuned for long sequences, so it can attend to and recall information from far back in the prompt.
Why this step exists: Real tasks often require reading long docs, codebases, or multi-turn histories.
Example: You paste a 100-page API spec; the model finds the exact parameter a function needs and uses it correctly.

Agent interface (function calling + system instructions + JSON)

What happens: The model follows system rules (like “always return JSON”), chooses tools to call when helpful, passes structured arguments, reads tool outputs, and then continues the plan. It finally returns clean results in your chosen schema.
Why this step exists: Without a reliable interface, agents break easily—messy text, wrong formats, or skipped tools.
Example: For an expense bot, the model calls a currency API, converts values, and outputs {"total": 841.23, "currency": "USD"} exactly.

Output generation and streaming

What happens: The model generates tokens you can stream to your UI. For code assistants, it may emit fenced code blocks; for structured tasks, it sticks to valid JSON.
Why this step exists: Users want quick, readable answers or machine-parseable outputs, sometimes as they’re being generated.
Example: In an IDE, suggestions appear token-by-token as you type.

Quantization and deployment

What happens: You optionally quantize the model to reduce memory and improve latency on your target device (edge phone, laptop GPU, workstation, or single 80 GB GPU for bf16 weights). Then you run it with your preferred runtime (e.g., vLLM, llama.cpp, MLX, Ollama, NVIDIA stacks, etc.).
Why this step exists: Right-sizing the model to your hardware makes it fast enough and battery-friendly.
Example: You pick a 4-bit quantized E4B for a phone app; for a workstation code assistant, you use a quantized 31B on a consumer GPU.

Fine-tuning or instruction-tuning (optional)

What happens: You adapt Gemma 4 on your task data (with TRL, Keras, Unsloth, etc.) to specialize behavior.
Why this step exists: Tailoring boosts accuracy and reliability on your domain.
Example: A hospital fine-tunes for clinical note styles and billing codes (privately, on-prem).

Secret sauce (clever bits):

Intelligence-per-parameter focus: The models are trained and shaped so every parameter “pulls its weight,” resulting in small models acting big.
MoE routing for speed: The 26B MoE activates only a subset per token, delivering fast tokens-per-second while keeping quality strong.
Agent-native design: System instruction following, robust JSON emission, and function calling are first-class citizens, not afterthoughts.
Mobile-first edge variants: E2B/E4B are built to run offline with near-zero latency and native audio on mobile/IoT.
Long context at scale: Up to 256K on larger models (128K on edge) means one-pass understanding of long repositories and documents.
Broad ecosystem: Day-one support across popular frameworks and clouds lowers friction from prototype to production.

More tiny sandwiches for clarity:

🍞 Hook: Zipping files to fit them on a USB stick. 🥬 The Concept (Quantization for deployment): Fewer bits per weight means smaller, faster models. How it works: 1) Choose precision, 2) quantize weights, 3) load runtime, 4) test quality. Why it matters: Without it, you might not fit on-device. 🍞 Anchor: A 26B MoE runs smoothly on a laptop GPU after quantization.

🍞 Hook: Reading an entire book, not just a page. 🥬 The Concept (Context windows: 128K/256K): The model can keep track of very long inputs. How it works: 1) Tokenize long text, 2) position strategies, 3) train/eval long-range tasks, 4) answer using all of it. Why it matters: Without it, the model forgets early facts. 🍞 Anchor: Paste a full repository; get a correct dependency map.

🍞 Hook: A team that sees and hears, not just reads. 🥬 The Concept (Vision and audio support): Images/video for all, audio for edge. How it works: 1) Encode signals, 2) fuse with text, 3) reason jointly, 4) output. Why it matters: Without senses, agents miss context. 🍞 Anchor: Phone assistant listens to a meeting and summarizes action items offline.

04Experiments & Results

The test: The team checked Gemma 4 on benchmarks that matter for real use: reasoning and instruction-following (can it think and obey?), coding (can it generate useful programs offline?), multimodal understanding (can it read images and videos, do OCR, understand charts?), and long-context tasks (can it remember earlier details across very long inputs?). They also looked at how reliably it returns structured JSON and how well it runs on different hardware.

The competition: Gemma 4 was compared with leading open models across sizes on public leaderboards and datasets, including Arena.ai’s chat arena. This matters because Arena.ai pools real human preferences at scale, giving a practical sense of which models people find more helpful.

The scoreboard (with context):

Gemma 4 31B Dense: Ranked #3 among open models on the Arena AI text leaderboard. Think of this like placing bronze in a global open-model Olympics—impressive especially for a model designed to run on a single 80 GB GPU or even consumer GPUs when quantized.
Gemma 4 26B MoE: Ranked #6 among open models. This is like coming in the top 10 while sprinting in lighter shoes—because only a subset of experts fire per token, it’s fast while staying smart.
Intelligence-per-parameter: On Arena.ai, Gemma 4 beats models as much as 20x its size in some comparisons. That’s like a smaller car out-accelerating a truck because it’s tuned precisely where it counts.
Multimodal tasks: The models excel at OCR and chart understanding, and all support images and video (edge models also support audio). This is the difference between guessing from text and actually “seeing” the screenshot.
Long context: Edge models at 128K and larger models up to 256K let Gemma 4 keep track of book-length inputs. It’s the difference between skimming a chapter and mastering the whole book.

Surprising findings:

The 26B MoE achieves very high tokens-per-second while ranking near the top—showing that selective compute can deliver both speed and quality.
The edge E2B/E4B models handle speech recognition and on-device tasks with near-zero latency, which feels snappy even on handheld devices.
Structured outputs (JSON) and function calling work with remarkable reliability in practice, making agents far easier to build without custom post-processing.

Interpreting the numbers:

A top-3 or top-6 placement on a widely watched leaderboard is like getting an A to A- when most strong peers score B to B+. More importantly, it does this while being easier to deploy and fine-tune, boosting practical value.
Beating much larger models in head-to-head comparisons shows that Gemma 4’s design squeezes more brain out of each byte—this is the paper’s central promise delivered in real tests.

Hardware context:

Unquantized bfloat16 weights fit on a single 80 GB H100 for the largest variants, while quantized builds run on consumer GPUs.
Edge models run fully offline on phones and small boards (Raspberry Pi, Jetson Orin Nano) with low latency, proving the mobile-first design works in the wild.

05Discussion & Limitations

Limitations:

While Gemma 4 ranks highly among open models, some closed, ultra-large models may still lead on niche or bleeding-edge tasks.
Long-context quality can still degrade over very long inputs; even with 128K–256K, models sometimes miss early details if prompts aren’t well structured.
Tool-use reliability is strong but not perfect; poorly designed tool schemas or APIs can cause failures.
Multilingual coverage (140+ languages) is broad, but depth varies by language and domain—rare languages or specialized jargon may need fine-tuning.
On-device performance depends on exact hardware and runtime; not all phones or boards will achieve the same latency.

Required resources:

For maximum quality: a single 80 GB GPU (e.g., H100) for bf16 weights on 26B/31B, or consumer GPUs for quantized variants.
For edge: modern Android devices or IoT boards (e.g., Jetson Orin Nano), plus runtimes like AICore Developer Preview or llama.cpp.
For fine-tuning: frameworks such as Transformers, TRL, Keras, Unsloth, and cloud or on-prem compute depending on dataset size.

When not to use:

Missions requiring provable correctness (e.g., high-stakes medical diagnosis) without retrieval, review, or guardrails.
Ultra-long document understanding beyond the 256K range, or tasks that need persistent memory across many sessions.
Situations where no local or cloud compute is available and latency must be near-instant on very old hardware.

Open questions:

How far can intelligence-per-parameter scale as we add more modalities and longer context—where are the diminishing returns?
What are the best training and alignment strategies to further stabilize function calling and JSON guarantees across all languages?
Can edge models support even richer audio/video pipelines without draining battery?
How do we measure and reduce hallucinations consistently across 140+ languages and many domains?
What new agent patterns (planning, tool orchestration, verification) unlock the next 10x reliability boost?

06Conclusion & Future Work

Three-sentence summary: Gemma 4 is an open family of models designed for intelligence-per-parameter, so even relatively small models deliver strong reasoning, long-context understanding, and reliable agent skills. It comes in edge-friendly E2B/E4B and larger 26B MoE/31B Dense variants, supports text, images, video (and audio on edge), and returns clean structured outputs with function calling. It ranks near the top of open-model leaderboards while staying easy to deploy on phones, laptops, and a single 80 GB GPU—and it’s all under Apache 2.0.

Main achievement: Turning open, right-sized models into agent-grade teammates that run almost anywhere, without giving up multimodality, long context, or quality.

Future directions: Expand sizes and hardware targets, deepen multilingual and domain expertise via fine-tuning, strengthen verification and planning for agents, and push energy efficiency and latency even further on edge devices. Expect richer tool ecosystems, better long-context training, and improvements to video/audio understanding.

Why remember this: Gemma 4 shows that being smart per byte matters more than being merely big. It makes advanced AI practical and portable, putting trustworthy, agent-ready intelligence into the hands of developers, researchers, and everyday users across the world.

Practical Applications

•Local-first code assistant in your IDE that runs on a laptop GPU and works on airplanes.
•On-device voice note transcriber and meeting summarizer that respects privacy.
•Field technician helper that reads equipment photos and manuals offline to suggest fixes.
•Customer support agent that triages tickets, calls internal APIs, and replies with valid JSON.
•Financial analyst tool that ingests long PDFs and spreadsheets to produce risk summaries.
•Healthcare scribe that drafts structured notes from doctor dictations on a clinic tablet.
•Education tutor that reads a student’s essay and a rubric, then gives step-by-step feedback.
•Manufacturing IoT monitor on Jetson Orin Nano that detects anomalies from camera feeds.
•Multilingual travel planner that books trips via function calls and returns a full itinerary.
•Document intelligence system that OCRs receipts and fills accounting forms automatically.

Version: 1