Inside Apple’s Manzano: one AI for image analysis and generation

Apple is developing a new AI model called Manzano that can not only analyze images but also generate them. This is one of the toughest areas in the field, and for now most contenders still lag well behind commercial heavyweights such as OpenAI’s GPT-4o and Google’s Gemini 2.5 Flash Image Generation (previously known as Nano Banana).

At the heart of Manzano is a hybrid tokenizer: a single encoder produces continuous tokens for visual understanding and discrete tokens for creation. The design aims to ease the friction between these tasks, helping one system handle both with confidence. The architecture pairs this tokenizer with a unified language model and a standalone image-decoding module. Several sizes are planned—from 900 million to 35 billion parameters—so it can work with images of different sizes.

Training unfolded in three stages across 1.6 trillion tokens, including 2.3 billion text–image pairs and 1 billion image–text pairs. Part of the dataset was generated with DALL-E 3 and ShareGPT-4o. In internal tests, Manzano delivered strong results on ScienceQA, MathVista, and MMMU, particularly when parsing charts and text-heavy documents. On the generative side, it follows complex instructions, handles style changes, and even performs depth reconstruction. As with any in-house benchmarks, the numbers are encouraging but deserve a measured reading.

Despite the progress, Apple says its base models still trail the market leaders. As a result, iOS 26 will continue to use OpenAI’s GPT-5 within Apple Intelligence alongside Apple’s own work. In that light, Manzano looks like a strategic step toward reducing dependence on third-party technology and building out Apple’s own multi-task AI.