How iPhone 17 Pro runs a 400B parameter language model on a smartphone

The iPhone 17 Pro recently managed to run a large language model (LLM) with 400 billion parameters, despite the enormous memory requirements. Even in compressed form, the model needs at least 200 GB of RAM. So how is this possible on a smartphone with just 12 GB of LPDDR5X RAM? The answer lies in clever engineering solutions.

An open-source project called Flash-MoE bypasses the memory limit by using the iPhone's SSD storage to transfer data directly to the graphics processor. The MoE model also helps: generating each word requires only a portion of the 400 billion parameters, which reduces the load on the device.

Generation speed remains extremely slow—just 0.6 tokens per second, meaning roughly one word every 1.5 to 2 seconds. Despite this, the demonstration shows that such massive models can run on mobile devices. Using a local model ensures complete privacy and eliminates the need for a constant internet connection, but the iPhone 17 Pro's battery drains quickly under this workload.

Overall, this experiment demonstrates that even highly resource-intensive LLMs can run on a smartphone with optimizations and SSD usage. However, the real-world practicality of such setups is limited due to slow generation speeds and high system strain.