this is very cool, I've been playing around in the same space with a simple tracked robot and a 2dof gripper. you seem to be quite a bit ahead of me in functionality.
I'm using PaliGemma2 and MobileSAM for the vision part and Gemma for the thinking part. I'm hoping to stick with weights-available models as it's just a toy project.
for what it's worth this contraption cost under £200, but I'm using a desktop and a 3090 as the brains.
And like you did, a SAM + VLM is the first thing we tried and it felt high-potential already. It takes a lot of software work to put the right pieces together though, but we think we now ended up with something promising, scalable and extendable for a lot of people.
And on the price: same, our initial prototype was around $250 but I had to connect it to my computer. It's unclear to many others in the field whether we'll be able to offload compute with latency low enough to a computer somewhere else in the house or even in the cloud. In the meantime at least, we decided to have onboard compute so that you can get started quickly. Even for you it would be useful, just because we did the work of putting all the hardware and electronics together, it's a pretty good computer onboard :)
Same for us back then! If I did it today though I would love to try using a RPi 5, these look incredible. But honestly NVIDIA just released their new Jetson Nano Super for 250$ and I think at this point it's a no-brainer to use this instead of an rpi.
https://imgur.com/a/WAHUIjQ
I'm using PaliGemma2 and MobileSAM for the vision part and Gemma for the thinking part. I'm hoping to stick with weights-available models as it's just a toy project.
for what it's worth this contraption cost under £200, but I'm using a desktop and a 3090 as the brains.