Alibaba has launched Qwen3.5-Omni, a unified native architecture AI model that simultaneously processes text, images, audio, and video from the first layer. Key innovations include Audio-Visual Vibe Coding, allowing users to describe tasks vocally for the model to generate working code for websites or games, and Script-Level Captioning, which converts videos into detailed scripts with timestamps and speaker annotations.

Benchmark results show Qwen3.5-Omni-Plus outperforming Gemini 3.1 Pro in most categories, with state-of-the-art scores in speech recognition, audio understanding, vision, and text tasks. The model supports 74 languages in ASR and 29 in TTS, includes WebSearch and Function Calling, and handles interruptions and background noise effectively.

Qwen3.5-Omni is accessible via Qwen Chat, HuggingFace, and Alibaba Cloud API.

For more information, visit Qwen Chat and the official blog.