Can Gemma 4 E2B run on an iPhone?

Yes. Gemma 4 E2B has 2.3 billion active parameters and fits in under 1.5 GB of RAM when quantized. It runs on any iPhone with 6 GB or more of RAM using CPU and GPU inference, completely offline with near-zero latency. Google collaborated with Qualcomm, MediaTek, and Arm to optimize mobile deployment on both iOS and Android.

What is the difference between Gemma 4 E2B and E4B?

E2B has 2.3 billion active parameters (5.1 billion total with Per-Layer Embeddings) and needs under 1.5 GB of RAM quantized. E4B has 4 billion effective parameters and needs roughly 3 GB of RAM quantized. Both support text, image, audio, and video input plus native function calling. E4B delivers higher quality output at the cost of roughly double the memory.

How fast is Gemma 4 E4B on Apple Silicon?

According to LLMCheck benchmarks, Gemma 4 E4B generates approximately 125 tokens per second on an M5 Max. On an M3 MacBook Air with 8 GB RAM, expect around 45-55 tok/s. E2B is even faster at roughly 155 tok/s on the M5 Max due to its smaller footprint.

Can a Mac Mini run both Gemma 4 models at the same time?

Yes. A Mac Mini with 16 GB of Unified Memory can run both E2B (under 1.5 GB) and E4B (roughly 3 GB) simultaneously with plenty of headroom for the operating system and other applications. This makes it an ideal always-on edge AI server for home or office use.

Gemma 4 E2B & E4B: Run Google's AI on iPhone, iPad & Mac Mini

Q: Is Gemma 4 free for commercial use?

Yes. Both Gemma 4 E2B and E4B are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution with no royalties. This makes them suitable for shipping in production iOS and macOS apps without licensing concerns.

On-device AI just crossed a threshold. Gemma 4's E2B and E4B models pack multimodal understanding, speech recognition, and agentic function calling into packages small enough to run on a phone. No cloud API, no internet connection, no subscription. This guide covers what these models are, how they perform on Apple hardware, and why they matter for developers building the next generation of mobile AI apps.

What Are Gemma 4 E2B and E4B?

E2B and E4B are the edge-optimized variants of Google's Gemma 4 model family, designed specifically for on-device deployment where memory and power are constrained. The "E" stands for edge, and the numbers indicate rough parameter scale: 2 billion and 4 billion respectively.

The architectural innovation behind both models is Per-Layer Embeddings (PLE). Traditional transformers feed a single embedding vector into the first decoder layer and pass residual states forward. PLE injects a secondary embedding signal into every decoder layer, giving the network richer contextual information at each stage of processing. The result is that E2B's 2.3 billion active parameters perform as if they were a 5.1 billion parameter model (5.1B total with PLE weights included), while using only the compute budget of a 2.3B model during inference.

E4B scales this up to 4 billion effective parameters and serves as the default Gemma 4 model in Ollama (ollama run gemma4). Both models accept text, image, audio, and video (as frame sequences) as input, and both support native function calling for agentic workflows.

Why PLE matters for mobile: Per-Layer Embeddings let Google pack 5.1B-parameter-class intelligence into a 2.3B-parameter runtime footprint. You get the depth of a much larger model without the memory cost. This is the key architectural trick that makes serious multimodal AI viable on a phone.

Running on iPhone & iPad

Both E2B and E4B support CPU and GPU inference on iOS, running completely offline with near-zero latency once loaded into memory. Google collaborated with Qualcomm, MediaTek, and Arm to optimize the models for mobile silicon, and the results are striking.

E2B quantized to 4-bit precision fits in under 1.5 GB of RAM. Every iPhone from the iPhone 15 Pro onward (6 GB+ RAM) can run it comfortably while keeping the rest of the system responsive. On an iPhone 16 Pro, expect prompt processing in under a second and token generation fast enough for real-time conversational use.

E4B at roughly 3 GB quantized is a better fit for iPads and iPhone Pro Max models with 8 GB of RAM. According to LLMCheck, E4B on an 8 GB iPhone achieves quality rivaling Gemma 3 12B while fitting in roughly one-third the memory. That is a generational leap in capability-per-byte.

Offline voice assistant: Native audio input means E2B can process speech directly without a separate transcription step.
Photo analysis: Point your camera, feed the frame to E4B, get a detailed description or answer questions about what you see.
Private document processing: Summarize PDFs, extract data from receipts, or translate text without any data leaving the device.

Mac Mini & MacBook Air: The Perfect Hosts

Apple Silicon Macs with Unified Memory are ideal for running Gemma 4 edge models. A base Mac Mini with 16 GB of RAM can load both E2B and E4B simultaneously, with over 10 GB of headroom remaining for the OS and applications. This makes it a compelling always-on edge AI server.

The MacBook Air with 8 GB runs E4B as a single model with enough remaining memory for normal productivity tasks. For developers, this means you can test and iterate on Gemma 4 apps locally without needing a cloud API or a high-end workstation.

According to LLMCheck benchmarks, E2B hits approximately 155 tok/s on an M5 Max and E4B reaches around 125 tok/s on the same hardware. Even on a base M3 MacBook Air, E4B generates tokens fast enough for interactive use at 45-55 tok/s. The combination of Metal GPU acceleration and Unified Memory's zero-copy architecture means Apple Silicon extracts maximum performance from these already-efficient models.

Performance Benchmarks

How do Gemma 4's edge models compare to other small LLMs on Apple Silicon? Here is the data from LLMCheck testing:

Model	Params	RAM (Q4)	M5 Max tok/s	Multimodal	Function Calling
Gemma 4 E2B	2.3B (5.1B PLE)	~1.5 GB	~155	Text+Image+Audio+Video	Native
Gemma 4 E4B	4B	~3 GB	~125	Text+Image+Audio+Video	Native
Phi-4 Mini	3.8B	~2.8 GB	~130	Text only	No
Qwen 3.5 4B	4B	~3 GB	~120	Text only	Basic
Gemma 3 4B	4B	~3 GB	~118	Text+Image	No

The standout result is clear: Gemma 4 E2B delivers competitive speed at half the memory of its peers, while E4B matches or beats larger competitors on quality while being the only sub-4B model with full multimodal input and native function calling.

Multimodal + Audio: What Other Small Models Can't Do

Before Gemma 4, getting multimodal capabilities in a sub-4B model meant accepting severe trade-offs. Gemma 3 4B supported text and images but not audio. Phi-4 Mini is text-only. Qwen 3.5 4B handles text only at this parameter scale.

Gemma 4 E2B and E4B are the smallest models to combine all four input modalities: text, image, audio, and video (processed as frame sequences). Native audio input is particularly significant for mobile deployment because it eliminates the need for a separate speech-to-text pipeline. The model processes raw audio waveforms directly, reducing latency and memory overhead.

256K context window on a phone: Both E2B and E4B support a 256,000 token context window, even at their tiny parameter counts. This means a mobile app can process an entire book, hours of meeting notes, or long conversation histories without truncation. No other sub-4B model offers this context length.

Video understanding works by feeding frame sequences to the model. While not true real-time video processing, it enables practical applications like analyzing short video clips, processing security camera footage, or understanding screen recordings on-device.

Building Apps with Gemma 4

The combination of Apache 2.0 licensing, native function calling, and mobile-optimized inference makes Gemma 4 edge models a practical foundation for commercial mobile apps. Unlike models with restrictive licenses, Apache 2.0 permits commercial use, modification, and redistribution with no royalties or usage fees.

Native function calling is what transforms these from chat models into agentic tools. You can define functions that the model can invoke — checking calendar events, querying a local database, controlling smart home devices, or making API calls. The model decides when and how to call these functions based on the user's natural language input, enabling sophisticated automation workflows that run entirely on-device.

For Mac developers, Ollama provides the simplest path to integration. Running ollama run gemma4 pulls E4B by default and exposes a local API that any application can call. For iOS developers, the Arm-optimized builds and Google's MediaPipe framework provide the inference runtime needed for App Store deployment.

According to LLMCheck, the practical implication is this: a solo developer can now ship an iOS app with multimodal AI capabilities that would have required a cloud API costing thousands per month just eighteen months ago. The model runs on the user's device, the license is permissive, and the quality is competitive with much larger models.

Gemma 4 E2B & E4B: Run Google's AI on iPhone, iPad & Mac Mini

What Are Gemma 4 E2B and E4B?

Running on iPhone & iPad

Mac Mini & MacBook Air: The Perfect Hosts

Performance Benchmarks

Multimodal + Audio: What Other Small Models Can't Do

Building Apps with Gemma 4

Frequently Asked Questions

Can Gemma 4 E2B run on an iPhone?

What is the difference between Gemma 4 E2B and E4B?

How fast is Gemma 4 E4B on Apple Silicon?

Can a Mac Mini run both Gemma 4 models at the same time?

Is Gemma 4 free for commercial use?

Sources & References

Compare Gemma 4 Against Every Model