> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
I'm not sure that's necessarily true for a lot of tasks.
A good way to measure this in your head is this:
"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.
> It therefore follows that robots should be able to learn with just RGB images too!
I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.
1. First create a model that can evaluate how well a task is going; the YT approach can be used here.
2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.
You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation
By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.
Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).
I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.
Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.
A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.
I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.
And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.
If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.
The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.
This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.
Also, LLMs these days aren't trained on just language
Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.
The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information
Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.
Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.
> because you as a human have really good intuition about the world.
This is the line that causes your logic to fail.
You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.
The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."
[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...
counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input
> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.
> It therefore follows that robots should be able to learn with just RGB images too!
That does not follow at all! It's not how you learned either.
Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.
>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.
Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.
a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback
I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.
And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).
Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.
I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)
I love uv, not just for local development, but it also makes it WAY easier to manage python environments you setup for running python workers / user code in the cloud.
I was just telling someone about the story of how he invented bitmapping for overlapping windows in the first Mac GUI in like two weeks, largely because he mis-remembered that being already a feature in the Xerox PARC demo and was convinced it was already possible.
I highly recommend starting with the SO-ARM101 and the LeRobot tutorial. They're super cheap, its insanely quick to get started, and you can even buy pre-made kits like at https://partabot.com . It's the "Hello World" of robotics now, imo.
Don't bother with a Jetson Nano, you don't need that to get started, and by the time you need that you'll know a lot already. You can just drive the robot from your laptop!
Getting to training your own VLA fine-tuned model is a super quick and easy process. You can see examples of other people completing the tutorial and uploading their training/evaluation datasets here (shameless plug for my thing): https://app.destroyrobots.com
I wouldn't bother much with ROS at first tbh. It'll bog you down, and startups are moving toward using other approaches that are more developer friendly, like Rust-based embedded.
You can go far with a robot connected to USB though!
Wow, this is incredible work! Blown away at how well the audio/video matches up, and the dialogue is better sounding / on-par with dedicated voice models.
> The article ignores Firefox switched the contract to Yahoo as the default search provider from 2014-2017
I worked at Mozilla when this deal was struck. The deal with Yahoo did require Yahoo be the default for Firefox, I'm not sure what you mean by "absence of any requirement"?
Mozilla broke that contract with Yahoo (there was a clause allowing them to do so without repercussion and keep the money, if they deemed it better for the users, wild contract) less than 3 years later because users hated Yahoo so much, and went back to Google.
Google is dominant because it just _is_ the best search engine.
> One of the biggest differences between Google selling Chrome and any old chromium fork is precisely that the "other" browsers no longer have to try to compete with Google's own browser to get users to monetize.
Isn't that literally anti-competitive? The DoJ is saying Google search is dominant partially because of Chrome pushing users to Google.
You're saying Chrome is dominant because users like it too much, and other browsers can't compete? Tough, that's the users' choice, though.
> I worked at Mozilla when this deal was struck. The deal with Yahoo did require Yahoo be the default for Firefox, I'm not sure what you mean by "absence of any requirement"?
Absence of any requirement to not choose Google (as the article argues would be the only possibility the choice would ever be made), not absence of some requirement in the specific contract itself.
> Mozilla broke that contract with Yahoo (there was a clause allowing them to do so without repercussion and keep the money, if they deemed it better for the users, wild contract) less than 3 years later because users hated Yahoo so much, and went back to Google.
Yes, 2014-2017, as was originally stated. This has two implications: first, Firefox was able to monetize the (much smaller) user base for $375,000,000/y for 3 years at the time using a non-Google search deal. Second, such deals still didn't make sense when you could get the same or more from Google and users would like it. The latter is not really a damnation blocking such action from Google would fail, it's just repeating the problems with Google's dominance in the space. Barring blocking, there is still clear leverage to get money out of the deal from Google without allowing Google to cross-influence markets.
A potential 3rd implication is Google search is (probably) still the favorite of users today and, if allowed, Google would have to fairly bid this deal to a 3rd party rather than operate the browser under fully in-house control.
> Isn't that literally anti-competitive? The DoJ is saying Google search is dominant partially because of Chrome pushing users to Google.
Companies found to be abusing operating a monopoly in an area generally have legal repercussions applied which would normally be considered unfair in order to restore balance to the market they were found to abuse. Maybe you don't like that, I can't argue that, but actions to break up an existing monopoly being unfair are not supported by the same arguments as why as actions of a company abusing a monopoly are unfair so disallowing one does not inherently negate the other.
> You're saying Chrome is dominant because users like it too much, and other browsers can't compete? Tough, that's the users' choice, though.
I don't recall saying that, no. Google Search and Chrome are dominant because exclusive agreements and many acts of attacking competition across markets by leveraging popularity in one area to unfairly stifle competition in another. To be the most liked/popular is orthogonal to these problems. A popular service is not inherently anti-competitive, but Google didn't keep ending up in court solely because it's popular.
The article is a neat read! The design of the blog itself is even more interesting. I don't love the right-aligned way it starts, but I love the inline activations of the left popup! So cool
Thanks! It has some cons, like worse scanability. But I think its really cool that you can have something open next to your paragraph, especially when you need to consult the popup quite often. Like, a table with a bunch of data would also be quite nice with this approach I feel.
i've been wanting to implement a design like this for blogs for 5 or 10 years. Great work on the inline detail on mobile. genuinely better than whatever i would have made.
did you consider pushing the word(s) directly following the activation button to below the detail pane, rather than doing it based on line break?