Gemini Gets Agent Vision

Google has updated its Gemini model by adding agent vision capabilities. Previously, the model perceived images passively, but now it uses a “think-act-observe” cycle, writing and executing code to zoom, crop, or annotate specific image areas.

This improvement allows Gemini to examine fine details more effectively, count objects by drawing and numbering bounding boxes, and avoid hallucinations with tables by performing real calculations via code. Google promises a 5–10% quality boost. The update is already rolling out in AI Studio, Vertex AI, and the Gemini app (Thinking mode).

Source: Google Blog