Google has unveiled Gemini 2.5 Computer Use, a new AI model capable of interacting with web browsers and graphical interfaces just like a person—clicking, typing, scrolling, filling out forms, and navigating web pages.
What’s New: Human-Style Browser Control
Unlike AI that works via APIs or backend endpoints, Gemini 2.5 Computer Use operates visually and natively in user interfaces. It receives screenshots of the current UI, reasons about what needs to be done, and then issues function calls like click, type, drag, or scroll. After each action, the UI updates, a fresh screenshot is provided, and the loop continues until the goal is achieved or no further moves are possible.
The model is optimized for web browsers, with emerging support for mobile UI control, though it is not yet tuned for operating system–level command (e.g., file system or desktop apps).
Performance & Benchmarks
Google claims the Gemini 2.5 Computer “outperforms leading alternatives on multiple web and mobile benchmarks while maintaining low latency.” It reportedly leads in browser control tasks in the Browserbase harness for Online-Mind2Web.
️Use Cases
Developers can access the model via the Gemini API, with support in Google AI Studio and Vertex AI. With proper wrappers, one can build agents to:
- Automate data entry or filling repetitive web forms
- Perform end-to-end UI testing across web flows
- Crawl or gather information across multiple sites
- Orchestrate multi-step workflows in web apps lacking an API interface
Because it’s still in preview, Google warns developers to supervise its behavior closely. It may be error-prone or vulnerable to adversarial inputs.
Safety & Limitations
Safety mechanisms are integrated to block actions like bypassing CAPTCHA, compromising security, or interfering with system integrity. Also, the model’s scope is limited to browser and UI actions—not full system control.
Final Word
With Gemini 2.5 Computer Use, Google edges closer to making AI agents that can act in the digital world much like human users do. While still in preview, its browser-centric approach and benchmark-leading performance make it a significant step in AI automation. The model opens a promising pathway for agents that can interact with complex interfaces—especially in scenarios where APIs are unavailable.









