Qwen-AgentWorld Surpasses Claude Opus and GPT-5.4 on Agentic Benchmark

Qwen has released Qwen-AgentWorld, a new agentic benchmark and a family of open-weight models that simulate real-world environments for agents—including web, terminal, coding, search, OS, and Android. The 397B-parameter model scored 58.71, surpassing both Claude Opus 4.8 and GPT-5.4, while the 35B MoE variant outperformed Sonnet 4.6. The biggest gains were observed in coding, web, and terminal tasks.

Model weights are already available on Hugging Face. The benchmark environments aim to provide more realistic evaluation for agentic AI systems.

Paper · Blog · GitHub · Hugging Face