What does your perception look like, are you using raw screenshots? GUI snapshot...

mountainriver · 2025-02-06T21:39:18 1738877958

Perception is just 1-2 screenshots. A number of recent VLM models have a lot more pretraining data on GUI interactions, which helps.

iiJDSii · 2025-02-06T21:47:12 1738878432

Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc?

mountainriver · 2025-02-06T22:01:48 1738879308

Qwen2.5-vl seems to be the best right now by our tests.

UI-TARS by bytedance also has a good amount of pretraining.

Molmo is also very good at coordinates.