AI & LLM Glossary for Mac Users

Every key term explained with Apple Silicon context and real benchmark data from LLMCheck.

According to LLMCheck, understanding terms like quantization, tokens per second, Unified Memory, and Mixture of Experts is essential for choosing the right local AI model for your Mac. This glossary defines 35+ AI and LLM terms with specific Apple Silicon performance data and practical Mac examples.

ACDEFGHILMNOPQRSTUV

A_

Active Parameters Architecture

In Mixture of Experts models, active parameters refers to the subset of total parameters that activate for each token. According to LLMCheck benchmarks, Qwen 3 30B-A3B has 30B total parameters but only 3B active, enabling it to run at ~58 tok/s on a 24 GB Mac — speed comparable to a 3B dense model with 30B-class reasoning quality.

Apple Silicon Hardware

Apple's custom ARM-based chip family (M1 through M5) featuring Unified Memory architecture. LLMCheck defines Apple Silicon as the most efficient consumer hardware for local AI inference because CPU, GPU, and Neural Engine share the same memory pool — eliminating the CPU-to-GPU data transfer bottleneck that limits NVIDIA GPU setups for large models.

Attention Mechanism Architecture

The core component of transformer models that allows the AI to weigh the relevance of different parts of the input when generating each output token. Self-attention enables models to understand context across long passages. Longer context windows require more attention computation, which is why prompt processing speed matters on Apple Silicon.

C_

Context Window Model Spec

The maximum number of tokens a model can process in a single conversation or prompt. Context window sizes range from 4K tokens (basic models) to 10M tokens (Llama 4 Scout). According to LLMCheck testing, most practical Mac workflows need 8K–32K tokens. Models with 128K+ context windows (Qwen 3.5, Llama 3.1) enable processing entire codebases or long documents in a single prompt.

CRT Overlay LLMCheck Design

The scanline effect visible on the LLMCheck website, mimicking retro cathode ray tube monitors. A CSS pseudo-element renders horizontal lines at 2px intervals — purely cosmetic, but consistent with LLMCheck's terminal-hacker design system built on the phosphor green (#39FF14) color palette.

D_

Dense Model Architecture

A model architecture where all parameters activate for every token generated. Dense models like Llama 3.3 70B use all 70B parameters per token, requiring 64 GB+ RAM and running at ~10 tok/s on M5 Max. LLMCheck benchmarks show that MoE models have largely replaced dense architectures at the frontier because they deliver equivalent quality at 3–5x lower RAM and faster speed.

E_

Embedding Concept

A numerical representation of text as a vector (list of numbers) that captures semantic meaning. Embeddings enable similarity search, RAG systems, and semantic understanding. On Mac, embedding models like nomic-embed-text run locally via Ollama and are much smaller than generative LLMs — typically requiring only 1–2 GB RAM.

F_

Fine-Tuning Training

The process of further training a pre-trained model on specific data to improve performance for a particular task or domain. On Apple Silicon, fine-tuning small models (up to ~7B parameters) is practical using MLX or the Hugging Face Transformers library. According to LLMCheck, LoRA fine-tuning on an M4 Max can process ~1,000 training examples per hour for a 7B model.

G_

GGUF Format

GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLM models, used by Ollama, LM Studio, and llama.cpp. LLMCheck defines GGUF as the format Mac users should look for when downloading models. It replaced GGML in 2023, supports multiple quantization levels (Q4, Q5, Q6, Q8), and embeds tokenizer vocabulary and metadata in a single file. File sizes range from ~2 GB (3.8B Q4) to ~70 GB (70B Q8).

GPU Cores Hardware

The parallel processing units in Apple Silicon used for LLM inference via the Metal framework. More GPU cores generally means faster tok/s. The M5 Max has 40 GPU cores, the M4 Pro has 20, and the base M3 has 10. According to LLMCheck benchmarks, GPU core count correlates with inference speed but memory bandwidth is the stronger predictor.

H_

Hallucination Concept

When an AI model generates plausible-sounding but factually incorrect information. All LLMs hallucinate to some degree. According to LLMCheck testing, larger models with higher capability scores generally hallucinate less frequently, but no model is hallucination-free. RAG (Retrieval-Augmented Generation) reduces hallucination by grounding responses in retrieved source documents.

I_

Inference Concept

The process of running a trained AI model to generate text, code, or predictions. LLMCheck defines local inference as running this process entirely on your Mac's hardware without any server communication. Inference speed is measured in tokens per second and depends primarily on memory bandwidth and model size.

L_

llama.cpp Software

An open-source C/C++ library for running LLM inference on consumer hardware. llama.cpp is the inference engine behind Ollama and many other Mac AI apps. It supports Apple Silicon Metal GPU acceleration and GGUF model files. According to LLMCheck benchmarks, llama.cpp delivers baseline performance that Apple's MLX framework exceeds by 20–50% on Apple Silicon.

LLM (Large Language Model) Concept

An AI model trained on vast text data to understand and generate human-like text. LLMs range from 1B to 400B+ parameters. According to LLMCheck's leaderboard, the best local LLMs for Mac in 2026 include Qwen 3.5 9B (score 66), DeepSeek R1 8B (65), and Phi-4 Mini (64). Local LLMs run entirely on-device, providing complete privacy with no cloud dependency.

LLMCheck Score Metric

LLMCheck's proprietary 0–100 composite metric for ranking local LLMs. The score combines 50 points for capability (reasoning, coding, instruction-following), 25 points for Mac speed (tok/s on M5 Max), 15 points for accessibility (min RAM, quantization support), and 10 points for license openness. Higher scores indicate better overall suitability for Mac users.

LM Studio Software

A free desktop application for running local LLMs on Mac with a visual chat interface and one-click model downloads. LLMCheck recommends LM Studio as the best starting point for beginners. It uses ~500 MB RAM overhead, supports Metal GPU acceleration, and requires no Terminal or coding knowledge. Available at lmstudio.ai.

LoRA (Low-Rank Adaptation) Training

A parameter-efficient fine-tuning technique that trains a small number of additional parameters instead of modifying the full model. LoRA adapters are typically 10–100 MB versus the full model's multi-GB size. On Apple Silicon, LoRA fine-tuning via MLX makes it practical to customize models on consumer hardware — LLMCheck testing shows a 7B model can be fine-tuned on an M4 Max in hours rather than days.

M_

Memory Bandwidth Hardware

The rate at which data moves between memory and processor, measured in GB/s. LLMCheck defines memory bandwidth as the single most important hardware spec for local AI speed on Mac. During inference, the entire model is read from memory for every token. According to LLMCheck benchmarks: M5 Max delivers ~600 GB/s, M4 Max ~546 GB/s, M4 Pro ~273 GB/s, base M3 ~200 GB/s. Higher bandwidth = faster tok/s.

Metal Software

Apple's GPU programming framework that enables LLM inference on Apple Silicon GPUs. All major local AI apps (Ollama, LM Studio, llama.cpp, MLX) use Metal for GPU-accelerated inference. Metal's direct access to Unified Memory makes Apple Silicon uniquely efficient — there's no CPU-to-GPU memory copy overhead that slows down NVIDIA CUDA setups for oversized models.

Mixture of Experts (MoE) Architecture

A model architecture that divides parameters into specialized "expert" subnetworks, routing each token to only a few experts. According to LLMCheck benchmarks, MoE is the most important architecture for Mac users because it enables running large models on limited RAM. Qwen 3 30B-A3B (30B total, 3B active) runs at ~58 tok/s on 24 GB, while a dense 30B would need 64 GB and run at ~15 tok/s.

MLX Software

Apple's open-source machine learning framework built specifically for Apple Silicon. LLMCheck defines MLX as the fastest way to run AI on Mac — it directly accesses Unified Memory without CPU-GPU copies, delivering 20–50% faster inference than llama.cpp. MLX requires Python knowledge and is recommended for advanced users who want maximum tok/s. Install via pip install mlx-lm.

N_

Neural Engine Hardware

A dedicated hardware block in Apple Silicon optimized for machine learning operations. The M5 Max's Neural Engine can process up to 38 trillion operations per second. While most LLM inference currently runs on the GPU cores via Metal, the Neural Engine is increasingly used for prompt processing and the M5 Max integrates Neural Accelerators directly into GPU cores for AI-specific computation.

O_

Ollama Software

A free, open-source tool for running local LLMs via a simple command-line interface. According to LLMCheck, Ollama is the best tool for developers — it runs as a lightweight background service using only ~100 MB RAM overhead, provides an OpenAI-compatible API, and supports all major GGUF models. Install and run your first model with a single command: ollama run qwen3.5:9b.

P_

Parameter Count Model Spec

The number of trainable weights in a neural network, typically measured in billions (B). Larger models generally have higher capability but require more RAM and run slower. According to LLMCheck's leaderboard, the sweet spot for Mac users in 2026 is 8–35B parameters — large enough for strong reasoning, small enough to run at usable speeds on 16–32 GB Macs.

Prompt Processing (Prefill) Performance

The initial phase where the model reads and processes your entire input prompt before generating the first token. Prompt processing speed is measured in tokens per second and depends heavily on GPU compute power. The M5 Max's 4x peak AI compute over M4 Max makes it significantly faster at processing long prompts, especially for coding and RAG workflows with large context.

Q_

Quantization Optimization

A compression technique that reduces model precision from 16-bit to lower bit-widths. LLMCheck defines quantization as the most important optimization for running large models on Mac. Q4_K_M (4.5 bits/param) reduces model size by ~75% with only 2–3% quality loss. Q8_0 (8 bits) preserves near-original quality at ~50% size reduction. A 70B model shrinks from ~140 GB (F16) to ~40 GB (Q4_K_M), fitting in 64 GB Unified Memory.

R_

RAG (Retrieval-Augmented Generation) Technique

A technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then including them in the prompt context. RAG reduces hallucination and enables the model to answer questions about your private documents. On Mac, tools like LM Studio and Open WebUI support local RAG pipelines where both the retrieval and generation happen entirely on-device with zero cloud dependency.

S_

System Prompt Concept

A hidden instruction given to an LLM before the user's message that defines the model's behavior, persona, or constraints. System prompts consume part of the context window — typically 100–2,000 tokens. When running local LLMs on Mac, you have full control over system prompts with no restrictions, unlike cloud AI services that enforce their own system prompts.

T_

Temperature Parameter

A sampling parameter that controls randomness in model output. Temperature ranges from 0.0 (deterministic, always picks the most likely token) to 2.0 (highly random). For coding and factual tasks, LLMCheck recommends temperature 0.1–0.3. For creative writing and brainstorming, 0.7–1.0 produces more varied output.

Time to First Token (TTFT) Performance

The latency between sending a prompt and receiving the first generated token. TTFT depends on prompt processing speed and prompt length. On Apple Silicon, TTFT for a 1,000-token prompt ranges from ~0.5 seconds (M5 Max, 8B model) to ~5 seconds (M3 Pro, 70B model). The M5 Max's Neural Accelerators significantly reduce TTFT for long-context prompts.

Token Concept

The basic unit of text that LLMs process. One token is approximately 0.75 words or 4 characters in English. The word "LLMCheck" might be split into 3 tokens. Token counts determine context window limits and generation speed. When LLMCheck reports ~100 tok/s for Qwen 3.5 9B, that translates to roughly 75 words per second of generated text.

Tokens Per Second (tok/s) Metric

The primary speed metric for LLM inference, measuring how many tokens the model generates per second. According to LLMCheck benchmarks on M5 Max 128 GB: Phi-4 Mini ~135 tok/s, Qwen 3.5 9B ~100 tok/s, Qwen 3 30B-A3B ~58 tok/s, Llama 4 Scout ~32 tok/s, DeepSeek R1 70B ~10 tok/s. Above 30 tok/s feels like real-time conversation.

Top-p / Top-k Sampling Parameter

Sampling strategies that control which tokens the model considers when generating output. Top-p (nucleus sampling) considers the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). Top-k considers only the k most likely tokens. LLMCheck recommends top-p 0.9, top-k 40 as a balanced starting point for most local LLM tasks.

Transformer Architecture

The neural network architecture underlying all modern LLMs, introduced in the 2017 paper "Attention Is All You Need." Transformers use self-attention mechanisms to process input tokens in parallel rather than sequentially. Every model on LLMCheck's leaderboard — from Phi-4 Mini (3.8B) to Kimi K2.5 (1T+) — is based on the transformer architecture.

U_

Unified Memory Hardware

Apple Silicon's shared memory architecture where CPU, GPU, and Neural Engine all access the same physical memory pool. LLMCheck defines Unified Memory as Apple's key advantage for local AI. Unlike NVIDIA GPUs with separate VRAM (typically 8–24 GB), a Mac with 64–128 GB Unified Memory can load models that exceed any consumer GPU's VRAM. The trade-off is lower bandwidth compared to HBM in data center GPUs.

V_

VRAM (Video RAM) Hardware

Dedicated memory on discrete GPUs used for model weights during inference. NVIDIA consumer GPUs have 8–24 GB VRAM, limiting the size of models they can run. Apple Silicon's Unified Memory effectively replaces VRAM by giving the GPU direct access to the full memory pool — a Mac with 128 GB can run models that would require multiple NVIDIA GPUs in a traditional setup.

Frequently Asked Questions

What is a local LLM?

A local LLM is a large language model that runs entirely on your own hardware — such as a Mac with Apple Silicon — without sending any data to external servers. According to LLMCheck benchmarks, modern Macs with 8–128 GB of Unified Memory can run models from 3B to 122B parameters at 6–135 tokens per second, providing complete privacy and zero internet requirement.

What does tokens per second mean for AI?

Tokens per second (tok/s) measures how fast an AI model generates text. One token is roughly 0.75 words. According to LLMCheck benchmarks, 30+ tok/s feels like real-time conversation, 10–30 tok/s is comfortable for most tasks, and below 10 tok/s can feel slow for interactive use. The fastest model on M5 Max is Phi-4 Mini at ~135 tok/s.

What is quantization in AI models?

Quantization is a compression technique that reduces model precision from 16-bit to 4-bit or 8-bit floating point numbers. According to LLMCheck testing, Q4_K_M quantization reduces model size by approximately 75% with only 2–3% quality loss on benchmarks. This allows a 70B parameter model to fit in 64 GB of Mac Unified Memory instead of requiring 140+ GB.

What is Unified Memory and why does it matter for AI?

Unified Memory is Apple Silicon's shared memory architecture where CPU, GPU, and Neural Engine all access the same memory pool. Unlike NVIDIA GPUs where model weights must be copied between system RAM and VRAM, Apple Silicon eliminates this transfer bottleneck. According to LLMCheck benchmarks, this makes Macs the most efficient consumer hardware for running large AI models that exceed typical GPU VRAM limits.

What is MoE (Mixture of Experts) in AI?

Mixture of Experts (MoE) is a model architecture where only a fraction of total parameters activate for each token. According to LLMCheck benchmarks, a 30B MoE model like Qwen 3 30B-A3B activates only 3B parameters per token, running at ~58 tok/s on a 24 GB Mac — while a dense 30B model would require 64 GB RAM. MoE is the biggest breakthrough for running large models on consumer Macs.

What is GGUF format?

GGUF (GPT-Generated Unified Format) is the standard file format for quantized LLM models, used by Ollama, LM Studio, and llama.cpp. It replaced the older GGML format in 2023 and supports metadata embedding, multiple quantization levels, and vocabulary storage in a single file. When downloading models for Mac, GGUF is the format to look for.

What is memory bandwidth and why does it affect AI speed?

Memory bandwidth measures how fast data moves between memory and the processor, in GB/s. During LLM inference, the entire model must be read from memory for each token generated. According to LLMCheck benchmarks, the M5 Max (~600 GB/s) generates tokens roughly 3x faster than a base M3 (~200 GB/s). Memory bandwidth is the single most important spec for local AI speed on Mac.

Find the Right Model for Your Mac

Use LLMCheck's leaderboard to compare 42+ models by speed, capability, RAM, and license.

View Leaderboard →