Running Qwen3.5-397B on a M3 MacBook Pro

A while back Apple published a paper entitled LLM in a flash: Efficient Large Language Model Inference with Limited Memory [DOI]

This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. … These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. 

The original Apple Silicon has a memory bandwidth of 68 GB/s on the M1 chip (non-Pro/Max), subsequent updates M2/M3 have over 102 GB/s (M4 120 GB/s), the Mx Pro have between 153 and 273 GB/s, the M4 Max has 410 or 546 GB/s, and the M3 Ultra has 819 GB/s.

In paper Daniel Woods (and Claude Opus 4.6) on GitHub https://github.com/danveloper/flash-moe/blob/main/paper/flash_moe.pdf, describes running a 397 billion parameter Mixture-of-Experts language model (Qwen3.5-397B-A17B) on a laptop with only 48GB of unified memory at 5.74 tokens per second with production-quality output.

MoE models turned out to work really well for this because they’re absurdly sparse at inference time. Qwen 3.5 397B has 512 experts per layer but only activates 10 per token, and we found you can prune that down to 4 with no quality degradation

The other aspect to this paper is the “co-author” wrote all the code, Claude Codewas used in an autoresearch pattern to produce highly optimized MLX Objective-C and Metal code. All the code is available on GitHub https://github.com/danveloper/flash-moe and was apparently written in 24h.

Related Posts