What is MLX?
MLX is Apple's answer to PyTorch and TensorFlow — but designed from the ground up for Apple Silicon chips. Released by Apple's machine learning research team, MLX takes full advantage of the hardware features that make M-series chips unique:
- Native Metal GPU acceleration. MLX uses Apple's Metal framework directly, bypassing the translation layers that other frameworks rely on. This means zero overhead when running computations on the GPU cores.
- Unified memory optimization. Unlike NVIDIA GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. MLX exploits this by avoiding unnecessary data copies, which is especially impactful for large models that barely fit in memory.
- Lazy evaluation. MLX only computes values when they are actually needed, reducing memory overhead and enabling more efficient execution graphs.
The practical result is that MLX delivers 20-50% faster token generation compared to llama.cpp (which powers Ollama) when running the same model at the same quantization level. The gap is largest on models that push memory limits, where unified memory efficiency matters most.
Prerequisites
Before installing MLX, make sure you have the following:
- Apple Silicon Mac — any M1, M2, M3, M4, or M5 Mac. MLX does not work on Intel Macs.
- Python 3.10 or later — check your version with
python3 --version. If you need to install or update Python, the easiest method is through Homebrew:brew install python@3.12. - pip — Python's package manager. It comes bundled with Python, but you can verify it with
pip3 --version.
Tip: We recommend using a Python virtual environment to keep MLX dependencies isolated. Run python3 -m venv mlx-env && source mlx-env/bin/activate before proceeding with installation.
Install MLX
Install the MLX language model package with pip:
pip install mlx-lm
This installs both the core MLX framework and the language model utilities for downloading, converting, and running LLMs. The installation is lightweight — under 50MB for the framework itself.
Verify the installation:
python3 -c "import mlx; print(mlx.__version__)"
You should see a version number like 0.22.x or later. If you encounter any errors, make sure you are running on Apple Silicon and have Python 3.10+.
Download a Model
MLX can convert any compatible Hugging Face model to its optimized format. Let us download and convert Qwen 3.5 9B as our first model:
mlx_lm.convert \
--hf-path Qwen/Qwen3.5-9B-Instruct \
--mlx-path ./qwen-9b-mlx \
-q
This command does three things:
- Downloads the model weights from Hugging Face (requires an internet connection and a free HF account for gated models).
- Converts the weights to MLX's native format for optimal Metal GPU utilization.
- Quantizes the model to 4-bit (the
-qflag) to reduce memory usage and increase speed.
The converted model will be saved in the ./qwen-9b-mlx directory. For a 9B model at 4-bit quantization, expect about 5GB on disk.
You can also find pre-converted MLX models on Hugging Face. Search for model names with "mlx" in them — many community members publish ready-to-use MLX weights that skip the conversion step entirely.
Tip: To quantize at a different level, replace -q with --q-bits 8 for 8-bit or --q-bits 6 for 6-bit. Higher bit counts produce better quality at the cost of more memory.
Run Inference
Generate text from the command line:
mlx_lm.generate \
--model ./qwen-9b-mlx \
--prompt "Explain the difference between TCP and UDP in simple terms"
You will see the model's response stream to your terminal, along with performance metrics including tokens per second. On an M3 Pro with 18GB, expect around 35-45 tok/s for a 9B model at 4-bit quantization.
Using the Python API
For more control, use MLX directly in Python:
from mlx_lm import load, generate
model, tokenizer = load("./qwen-9b-mlx")
prompt = "Write a Python function to check if a string is a palindrome."
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=512,
temp=0.7,
)
print(response)
The Python API gives you full control over generation parameters including temperature, top-p sampling, repetition penalty, and maximum token count. It is also the foundation for building your own applications — chatbots, code assistants, or document processors — powered by local models.
Interactive Chat
For a ChatGPT-like interactive experience:
mlx_lm.chat --model ./qwen-9b-mlx
This opens an interactive chat session in your terminal where you can have multi-turn conversations with the model. Type your message and press Enter to get a response. Type quit to exit.
MLX vs Ollama vs llama.cpp
Each framework has its sweet spot. Here is when to use each one:
- MLX — maximum speed. Choose MLX when you want the fastest possible inference on Apple Silicon and you are comfortable with Python. MLX is ideal for developers building applications, running batch processing, or anyone who wants to squeeze every last token per second from their hardware. It typically delivers 20-50% faster generation than Ollama.
- Ollama — maximum convenience. Choose Ollama when you want a dead-simple setup with a one-line install and single-command model downloads. Ollama is perfect for beginners, for quickly testing new models, and for use as a backend server that other apps connect to. It handles model management, quantization, and API serving out of the box. See our Ollama setup guide to get started.
- llama.cpp — maximum customization. Choose llama.cpp when you need fine-grained control over inference parameters, want to run custom GGUF models, or need features like grammar-constrained generation, speculative decoding, or custom sampling strategies. It is the most configurable option but requires more technical knowledge to set up and tune.
Many power users run both Ollama and MLX. Ollama for quick testing and API access, MLX for production workloads where speed matters. Check our benchmarks to see how MLX compares on your specific hardware, and visit the software directory for compatible tools.
Pro tip: If you are already using Ollama and want to try MLX without re-downloading models, you cannot directly share model files between them — they use different formats (GGUF vs MLX). You will need to download or convert models separately for each framework.