CUA how to improve click accuracy on desktop OS

Hi all, I’m building a Computer Use Agent (CUA) with LangChain to control desktop apps (click/typing/scroll) using screenshots + coordinate-based actions.

I’m struggling with click precision (DPI scaling, multi-monitor offsets, window focus), so the agent often clicks a few pixels off or on the wrong element.

Any suggestions to make this more accurate and production-reliable?

  • Should I combine vision + accessibility tree (Windows UIA / macOS AX / Linux AT-SPI) instead of pure coordinates?

  • Which LLM/vision model has worked best for screen understanding + actions?

  • Any recommended pattern like observe → act → verify → retry, or repos/examples?

Thanks!