Microsoft Releases Multimodal Phi-4 Reasoning-Vision Model

Microsoft has launched a multimodal version of its Phi-4 model, called Phi-4-reasoning-vision-15B, built on the SigLIP-2 encoder and Phi-4’s logical architecture. The model features a mixed inference mechanism that adapts its reasoning depth based on task complexity, performing deep analysis for math and logic problems while handling simpler image descriptions and OCR tasks without reasoning.

Designed also for AI agents managing computer interfaces, it can interpret screen content, identify interactive elements, and select actions within GUIs. The model weights are available under the MIT license on HuggingFace and Microsoft Foundry.

Source