What is MLX?

MLX is Apple's answer to PyTorch and TensorFlow — but designed from the ground up for Apple Silicon chips. Released by Apple's machine learning research team, MLX takes full advantage of the hardware features that make M-series chips unique:

The practical result is that MLX delivers 20-50% faster token generation compared to llama.cpp (which powers Ollama) when running the same model at the same quantization level. The gap is largest on models that push memory limits, where unified memory efficiency matters most.

Prerequisites

Before installing MLX, make sure you have the following:

Tip: We recommend using a Python virtual environment to keep MLX dependencies isolated. Run python3 -m venv mlx-env && source mlx-env/bin/activate before proceeding with installation.

Install MLX

Install the MLX language model package with pip:

pip install mlx-lm

This installs both the core MLX framework and the language model utilities for downloading, converting, and running LLMs. The installation is lightweight — under 50MB for the framework itself.

Verify the installation:

python3 -c "import mlx; print(mlx.__version__)"

You should see a version number like 0.22.x or later. If you encounter any errors, make sure you are running on Apple Silicon and have Python 3.10+.

Download a Model

MLX can convert any compatible Hugging Face model to its optimized format. Let us download and convert Qwen 3.5 9B as our first model:

mlx_lm.convert \
  --hf-path Qwen/Qwen3.5-9B-Instruct \
  --mlx-path ./qwen-9b-mlx \
  -q

This command does three things:

  1. Downloads the model weights from Hugging Face (requires an internet connection and a free HF account for gated models).
  2. Converts the weights to MLX's native format for optimal Metal GPU utilization.
  3. Quantizes the model to 4-bit (the -q flag) to reduce memory usage and increase speed.

The converted model will be saved in the ./qwen-9b-mlx directory. For a 9B model at 4-bit quantization, expect about 5GB on disk.

You can also find pre-converted MLX models on Hugging Face. Search for model names with "mlx" in them — many community members publish ready-to-use MLX weights that skip the conversion step entirely.

Tip: To quantize at a different level, replace -q with --q-bits 8 for 8-bit or --q-bits 6 for 6-bit. Higher bit counts produce better quality at the cost of more memory.

Run Inference

Generate text from the command line:

mlx_lm.generate \
  --model ./qwen-9b-mlx \
  --prompt "Explain the difference between TCP and UDP in simple terms"

You will see the model's response stream to your terminal, along with performance metrics including tokens per second. On an M3 Pro with 18GB, expect around 35-45 tok/s for a 9B model at 4-bit quantization.

Using the Python API

For more control, use MLX directly in Python:

from mlx_lm import load, generate

model, tokenizer = load("./qwen-9b-mlx")

prompt = "Write a Python function to check if a string is a palindrome."
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    temp=0.7,
)
print(response)

The Python API gives you full control over generation parameters including temperature, top-p sampling, repetition penalty, and maximum token count. It is also the foundation for building your own applications — chatbots, code assistants, or document processors — powered by local models.

Interactive Chat

For a ChatGPT-like interactive experience:

mlx_lm.chat --model ./qwen-9b-mlx

This opens an interactive chat session in your terminal where you can have multi-turn conversations with the model. Type your message and press Enter to get a response. Type quit to exit.

MLX vs Ollama vs llama.cpp

Each framework has its sweet spot. Here is when to use each one:

Many power users run both Ollama and MLX. Ollama for quick testing and API access, MLX for production workloads where speed matters. Check our benchmarks to see how MLX compares on your specific hardware, and visit the software directory for compatible tools.

Pro tip: If you are already using Ollama and want to try MLX without re-downloading models, you cannot directly share model files between them — they use different formats (GGUF vs MLX). You will need to download or convert models separately for each framework.