Google Releases Gemma 4 12B — Multimodal Model Without External Encoders

DeepMind has published the weights for Gemma 4 12B, a multimodal model that handles text, images, and audio natively — without relying on separate encoders. Eliminating external modules reduces computational latency and memory overhead.

The model runs locally on devices with just 16 GB of RAM, yet delivers benchmark results competitive with 26B-class models. Weights are available on Hugging Face, with support already integrated into Ollama and LM Studio. Licensed under Apache 2.0, Gemma 4 12B is free for commercial use.

Introducing Gemma 4 12B