Ollama taps Apple’s MLX to make local LLMs noticeably faster on Macs

Ask any developer who's tried to run a 30‑plus billion‑parameter model on a Mac and they'll tell you the same thing: it’s possible, but painfully close to the hardware limits. Ollama's newest preview release, 0.19, tries to push that boundary by running models on top of Apple’s MLX framework—yielding meaningful speed and memory gains on Apple silicon.

The headline: real speedups, narrow compatibility

Ollama says the MLX-backed runner speeds up prefill (prompt processing) by about 1.6× and nearly doubles decode (token generation) speed on supported Macs. The improvements are most dramatic on machines with Apple’s new M5 family because Ollama now uses the M5 GPU Neural Accelerators for both time‑to‑first‑token and steady generation throughput.

That performance boost, however, is currently limited in scope. The preview supports a single model variant so far: Alibaba’s Qwen3.5 35B (the 35 billion‑parameter build), using Nvidia’s NVFP4 compression format in some model builds. Ollama also improved caching behavior and memory handling in this release, which together help make larger models usable on unified‑memory Macs.

Why this matters (and why some people will care)

Local models are suddenly more than a hobbyist circus act. Community projects (and some viral experiments) have driven a wave of interest in running code‑focused and chat assistants locally—partly to avoid rate limits and subscription costs of cloud services. For macOS users, optimized local inference means you can run coding agents or chat assistants without sending everything to a server and with fewer latency surprises.

There are privacy advantages too: inference that stays on your machine reduces exposure of prompts and files to third‑party cloud providers. That said, giving models broad system access—as some playfully reckless setups have done—remains risky and is not something to take lightly.

How Apple’s MLX and NVFP4 help

Apple’s MLX is built to take advantage of unified memory on Apple silicon, where GPU and CPU share the same pool. Ollama’s MLX runner maps model data and key‑value caches in ways that reduce memory copying and make better use of GPU neural accelerators. In practice that means the model spends less time waiting on memory transfers and more time computing tokens.

NVFP4 is a lower‑precision storage format that trades a bit of numeric fidelity for much smaller memory footprints. Combined with smarter caching, Ollama can squeeze the working set of a 35B model into machines that previously wouldn’t have been usable for that model size.

If you want to dig into the platforms themselves, see Ollama’s site at Ollama and Apple’s MLX documentation at Apple MLX.

Who can realistically run this now

Short answer: users with high‑memory Apple silicon Macs. Ollama recommends at least 32GB of unified memory, and community reports suggest 48GB (or more) is far more comfortable if you plan to run long agentic sessions or tools that grow context size (like some code assistants). On a 32GB M2 Pro or M2 Max you can get interactive chats working, but extended sessions hit swap or memory pressure quickly.

A Mac with an M5, M5 Pro, or M5 Max will show the biggest gains because of the dedicated Neural Accelerators. If you’re in the market for a Mac largely to run local models, seasonal discounts make buying a higher‑memory machine more sensible—see how current spring MacBook deals are shaping PC buys in 2026 spring MacBook discounts.

Practical limits and developer realities

Ollama remains primarily a command‑line tool (though integrations like Visual Studio Code exist), and model support in this MLX preview is intentionally narrow. That means most users must be comfortable with CLI workflows and model plumbing to try this out.

Even with MLX acceleration, local models still trail the biggest cloud models in raw benchmark performance. They’re “good enough” for many interactive tasks—code completion, drafting, research assistants—but not yet a universal replacement for hosted frontier models. And for agentic frameworks that keep expanding context (think persistent memory, extensive tool use), the unified memory ceiling remains the main choke point.

If you’re watching Apple’s hardware and software arc—how the Mac platform keeps being tuned for on‑device AI—this is another stroke in a larger picture. Apple’s half‑century of shaping hardware and ecosystems has bearing on where on‑device AI can go next—there’s a useful read in Apple's 50th retrospective about design and ecosystem choices that frame moves like MLX.

Try it or wait? A short guide

If you have a 32GB Mac and curiosity: try the Ollama 0.19 preview with Qwen3.5 and watch for memory pressure; short interactive sessions will impress.
If you plan to run long agentic workflows or multiple heavy models, aim for 48GB+ or an M5 Max configuration.
Keep an eye on Ollama for expanded model support—MLX integration is new and wider compatibility will determine how broadly useful this becomes.

No single update solves every bottleneck, but this one narrows the gap between cloud convenience and on‑device control. For Mac users who want lower latency, better privacy, and fewer subscription bills, it’s a meaningful step forward.

Ollama taps Apple’s MLX to make local LLMs noticeably faster on Macs

The headline: real speedups, narrow compatibility

Why this matters (and why some people will care)

How Apple’s MLX and NVFP4 help

Who can realistically run this now

Practical limits and developer realities

Try it or wait? A short guide

Comments

Related Articles

Two‑tone Beats at the World Cup tip at customizable over‑ears

Nintendo clamps down on multi-region Switch 2 sales in Japan to curb scalpers

Google’s new ‘information agents’ keep an eye on the web for you — 24/7

The headline: real speedups, narrow compatibility

Why this matters (and why some people will care)

How Apple’s MLX and NVFP4 help

Who can realistically run this now

Practical limits and developer realities

Try it or wait? A short guide

Comments

Related Articles

Siri Rebuilt: Apple’s AFM Move, a Google Cloud Detour, and Why Siri Won’t Be Your AI Partner

macOS 27 Golden Gate: Siri AI arrives, Liquid Glass gets fixed — and some testers are already skipping the line

Siri AI arrives: useful, privacy‑minded, and quietly powered by the cloud

Siri, but New: Apple’s AI Upset the Calendar — and the EU