Hi all, I’m building a Computer Use Agent (CUA) with LangChain to control desktop apps (click/typing/scroll) using screenshots + coordinate-based actions.
I’m struggling with click precision (DPI scaling, multi-monitor offsets, window focus), so the agent often clicks a few pixels off or on the wrong element.
Any suggestions to make this more accurate and production-reliable?
-
Should I combine vision + accessibility tree (Windows UIA / macOS AX / Linux AT-SPI) instead of pure coordinates?
-
Which LLM/vision model has worked best for screen understanding + actions?
-
Any recommended pattern like observe → act → verify → retry, or repos/examples?
Thanks!