What Are Gemma 4 E2B and E4B?
E2B and E4B are the edge-optimized variants of Google's Gemma 4 model family, designed specifically for on-device deployment where memory and power are constrained. The "E" stands for edge, and the numbers indicate rough parameter scale: 2 billion and 4 billion respectively.
The architectural innovation behind both models is Per-Layer Embeddings (PLE). Traditional transformers feed a single embedding vector into the first decoder layer and pass residual states forward. PLE injects a secondary embedding signal into every decoder layer, giving the network richer contextual information at each stage of processing. The result is that E2B's 2.3 billion active parameters perform as if they were a 5.1 billion parameter model (5.1B total with PLE weights included), while using only the compute budget of a 2.3B model during inference.
E4B scales this up to 4 billion effective parameters and serves as the default Gemma 4 model in Ollama (ollama run gemma4). Both models accept text, image, audio, and video (as frame sequences) as input, and both support native function calling for agentic workflows.
Why PLE matters for mobile: Per-Layer Embeddings let Google pack 5.1B-parameter-class intelligence into a 2.3B-parameter runtime footprint. You get the depth of a much larger model without the memory cost. This is the key architectural trick that makes serious multimodal AI viable on a phone.
Running on iPhone & iPad
Both E2B and E4B support CPU and GPU inference on iOS, running completely offline with near-zero latency once loaded into memory. Google collaborated with Qualcomm, MediaTek, and Arm to optimize the models for mobile silicon, and the results are striking.
E2B quantized to 4-bit precision fits in under 1.5 GB of RAM. Every iPhone from the iPhone 15 Pro onward (6 GB+ RAM) can run it comfortably while keeping the rest of the system responsive. On an iPhone 16 Pro, expect prompt processing in under a second and token generation fast enough for real-time conversational use.
E4B at roughly 3 GB quantized is a better fit for iPads and iPhone Pro Max models with 8 GB of RAM. According to LLMCheck, E4B on an 8 GB iPhone achieves quality rivaling Gemma 3 12B while fitting in roughly one-third the memory. That is a generational leap in capability-per-byte.
- Offline voice assistant: Native audio input means E2B can process speech directly without a separate transcription step.
- Photo analysis: Point your camera, feed the frame to E4B, get a detailed description or answer questions about what you see.
- Private document processing: Summarize PDFs, extract data from receipts, or translate text without any data leaving the device.
Mac Mini & MacBook Air: The Perfect Hosts
Apple Silicon Macs with Unified Memory are ideal for running Gemma 4 edge models. A base Mac Mini with 16 GB of RAM can load both E2B and E4B simultaneously, with over 10 GB of headroom remaining for the OS and applications. This makes it a compelling always-on edge AI server.
The MacBook Air with 8 GB runs E4B as a single model with enough remaining memory for normal productivity tasks. For developers, this means you can test and iterate on Gemma 4 apps locally without needing a cloud API or a high-end workstation.
According to LLMCheck benchmarks, E2B hits approximately 155 tok/s on an M5 Max and E4B reaches around 125 tok/s on the same hardware. Even on a base M3 MacBook Air, E4B generates tokens fast enough for interactive use at 45-55 tok/s. The combination of Metal GPU acceleration and Unified Memory's zero-copy architecture means Apple Silicon extracts maximum performance from these already-efficient models.
Performance Benchmarks
How do Gemma 4's edge models compare to other small LLMs on Apple Silicon? Here is the data from LLMCheck testing:
| Model | Params | RAM (Q4) | M5 Max tok/s | Multimodal | Function Calling |
|---|---|---|---|---|---|
| Gemma 4 E2B | 2.3B (5.1B PLE) | ~1.5 GB | ~155 | Text+Image+Audio+Video | Native |
| Gemma 4 E4B | 4B | ~3 GB | ~125 | Text+Image+Audio+Video | Native |
| Phi-4 Mini | 3.8B | ~2.8 GB | ~130 | Text only | No |
| Qwen 3.5 4B | 4B | ~3 GB | ~120 | Text only | Basic |
| Gemma 3 4B | 4B | ~3 GB | ~118 | Text+Image | No |
The standout result is clear: Gemma 4 E2B delivers competitive speed at half the memory of its peers, while E4B matches or beats larger competitors on quality while being the only sub-4B model with full multimodal input and native function calling.
Multimodal + Audio: What Other Small Models Can't Do
Before Gemma 4, getting multimodal capabilities in a sub-4B model meant accepting severe trade-offs. Gemma 3 4B supported text and images but not audio. Phi-4 Mini is text-only. Qwen 3.5 4B handles text only at this parameter scale.
Gemma 4 E2B and E4B are the smallest models to combine all four input modalities: text, image, audio, and video (processed as frame sequences). Native audio input is particularly significant for mobile deployment because it eliminates the need for a separate speech-to-text pipeline. The model processes raw audio waveforms directly, reducing latency and memory overhead.
256K context window on a phone: Both E2B and E4B support a 256,000 token context window, even at their tiny parameter counts. This means a mobile app can process an entire book, hours of meeting notes, or long conversation histories without truncation. No other sub-4B model offers this context length.
Video understanding works by feeding frame sequences to the model. While not true real-time video processing, it enables practical applications like analyzing short video clips, processing security camera footage, or understanding screen recordings on-device.
Building Apps with Gemma 4
The combination of Apache 2.0 licensing, native function calling, and mobile-optimized inference makes Gemma 4 edge models a practical foundation for commercial mobile apps. Unlike models with restrictive licenses, Apache 2.0 permits commercial use, modification, and redistribution with no royalties or usage fees.
Native function calling is what transforms these from chat models into agentic tools. You can define functions that the model can invoke — checking calendar events, querying a local database, controlling smart home devices, or making API calls. The model decides when and how to call these functions based on the user's natural language input, enabling sophisticated automation workflows that run entirely on-device.
For Mac developers, Ollama provides the simplest path to integration. Running ollama run gemma4 pulls E4B by default and exposes a local API that any application can call. For iOS developers, the Arm-optimized builds and Google's MediaPipe framework provide the inference runtime needed for App Store deployment.
According to LLMCheck, the practical implication is this: a solo developer can now ship an iOS app with multimodal AI capabilities that would have required a cloud API costing thousands per month just eighteen months ago. The model runs on the user's device, the license is permissive, and the quality is competitive with much larger models.