Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
iiJDSii
3 months ago
|
parent
|
context
|
favorite
| on:
R1 Computer Use
What does your perception look like, are you using raw screenshots? GUI snapshots? Vision is very difficult for these, and snapshots are incomplete, is what I've found in some earlier experiments.
mountainriver
3 months ago
[–]
Perception is just 1-2 screenshots. A number of recent VLM models have a lot more pretraining data on GUI interactions, which helps.
iiJDSii
3 months ago
|
parent
[–]
Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc?
mountainriver
3 months ago
|
root
|
parent
[–]
Qwen2.5-vl seems to be the best right now by our tests.
UI-TARS by bytedance also has a good amount of pretraining.
Molmo is also very good at coordinates.
Consider applying for YC's Summer 2025 batch! Applications are open till May 13
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: