Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://aistudio.google.com/live is by far the coolest thing here. You can just go there and share your screen or camera and have a running live voice conversation with Gemini about anything you're looking at. As much as you want, for free.

I just tried having it teach me how to use Blender. It seems like it could actually be super helpful for beginners, as it has decent knowledge of the toolbars and keyboard shortcuts and can give you advice based on what it sees you doing on your screen. It also watched me play Indiana Jones and the Great Circle, and it successfully identified some of the characters and told me some information about them.

You can enable "Grounding" in the sidebar to let it use Google Search even in voice mode. The video streaming and integrated search make it far more useful than ChatGPT Advanced Voice mode is currently.



I got so hopefuly by your comment I showed it my current bug that I'm working on, I even prepared everything first with my github issue, the relevant code, the terminal with the failing tests, and I pasted it the full contents of the file and explained carefully what I wanted to achieve. As I was doing this, it repeated back to me everything I said, saying things like "if I understand correctly you're showing me a file called foo dot see see pee" and "I see you have a github issue open called extraneous spaces in frobnicator issue number sixty six" and "I see you have shared some extensive code" and after some more of this "validation"-speak it started repeating the full contents of the file like "backquote backquote backquote forward slash star .. import ess tee dee colon colon .." and so on.

Not quite up to excited junior-level programmer standards yet. But maybe good for other things who knows.


You just rediscovered that LLMs become much stupider when using images as input. I think that has been already shown for gpt-4 as well.


Even when using the web search tool GPT-4 becomes stupider.

Do they use a dumber model for tool/vision?


I’m guessing that it’s just a much harder problem. Images often contain more information but it is far less structured and refined than language.

The transformation process that occurs when people speak and write is incredibly rich and complex. Compared to images which are essentially just the outputs of cameras or screen captures — there isn’t an “intelligent” transformation process occurring.


I think images also have much higher information density than words, or at least they can. There is a reason a picture is worth 1000 words.


This is news to me. Any good examples of this outside of the above?


Vision language models are blind (192 comments) https://news.ycombinator.com/item?id=40926734


On the other hand if I take pictures of circuits, boards, electronic components, etc GPT4o is pretty reliably able to explain the me the pinouts, the board layouts, reference material in the data sheets, and provide pretty reasonable advice about how to use it (I.e., where to put resistors and why, what pins to use for the component on an esp32, etc). As a beginner in electronics this is fabulously helpful. Its ability to pass vision tests seems like a pretty dumb utility metric when most people judge utility by how useful things are.


> foo dot see see pee

Well there's your problem!


ccp is the Chinese version of c++. Or maybe they meant ссср, the Soviet version.


The ccp extension is just another c++ flavour /i


Not sure this is an AI limitation. I think you'd be better off here with the Gemini Code Assist plugin in VS Code rather than that. Sounds like the AI is provided with unstructured information compared to an actual code base.


THIS is the thing I'm excited for with AI.

I'm someone that becomes about 5x more productive when I have a person watching or just checking in on me (even if they're just hovering there).

Having AI to basically be that "parent" to kick me into gear would be so helpful. 90% of the time my problems are because I need someone to help keep the gears turning for me, but there isn't always someone available. This has the potential to be a person that's always available.


Just as an FYI: I recently learned (here on HN) that this is called Body Doubling[0] there's some services around (there's at least one by someone that hangs around here) that can do this too.

[0] https://en.m.wikipedia.org/wiki/Body_doubling


Also, there are co-working spaces in VRChat, which works wonders for me.

I went to the Glass Office co-working space to study for exams this summer and it worked out really well. I also met some nice people there.

A standalone Quest 3 is enough to get you started.


Do they support WFH setups?


The parent might be referring to us: https://workmode.net/. Most of our clients work from home. Do you have a specific concern about body doubling and working from home?


That’s interesting, I never considered something like that.

But at that low price, surely you have a bunch of customers being watched by each employee, and then talking to only one at a time — isn’t it distracting to see your “double” chatting away with the sound off?


No, nobody has ever complained about it (and yes, we did ask). When we first started, we were really concerned about it, so we tried to move as little as possible, avoid hand gestures, and so on. However, it turned out to be a non-issue.

Fun fact: I’d estimate that 50% of users don’t even look at their Productivity Partner while they work. WorkMode runs in another tab, and users rarely switch back to it. They don’t need to see us - they just need to know we’re watching. I’m in that group.


Some unsolicited feedback (please feel free to ignore):

When I click on "Pricing" in the nav bar, it scrolls down, and the first thing that catches my eye is "$2100 / month". I happened to see this time that this is the benefit you're projecting and it is actually $2.50/hour. On the previous times I've visited your website based on your HN comments, I've always thought $2100/month was what you were going to charge me and closed the tab.

I've been frustrated myself that people don't read what's right there on the page when they come to my startup app's landing page. Turns out I do the same. Hope this helps you improve the layout / font sizes and such "information hierarchy" so the correct information is conveyed at a glance.

IMHO $2.50/hour is great value, and stands on its own. I know how much my time is worth, so perhaps the page doesn't really have to shout that to convince me.

Again, please feel free to ignore this as it is quite possible that it is just me with the attention span of a goldfish with CTE while clicking around new websites.


Thank you! I hadn’t thought of it that way, but what you wrote makes total sense and explains the engagement issues we’re seeing with the calculator and the pricing section.

> Again, please feel free to ignore this as it is quite possible that it is just me with the attention span of a goldfish with CTE while clicking around new websites.

Most of our clients have issues with attention span, so your feedback is gold :-) Again, thank you!


Welcome! btw this is how it looked: https://i.imgur.com/qg8gNJF.png

I understand if the window were taller, I'd have seen the actual price cards. I think it's just that when you click "Pricing", you expect the next obvious number you see to be the price.


Clever service! I assume your employees watch several people at once. Is it engaging enough work for them?


Yes, they monitor several people simultaneously. Most clients ask us to check in on their progress every 15–30 minutes, and these interactions can last anywhere from a few seconds to three minutes, depending on the client and the challenges they're facing. It might be boring when working with a single person, but it gets more challenging as more people connect.

Also, we do more than just body doubling. Some clients need to follow a morning ritual before starting their work (think meditation, a quick house cleanup, etc.). Sometimes, we perform sanity checks on their to-do lists (people often create tasks that are too vague or vastly underestimate the time needed to complete them). We ask them to apply the 2-minute rule, and so on. It all depends on the client's needs.


Interesting! I see how this could work for inattentive procrastinators. By "inattentive procrastinators", I mean people who are easily distracted and forget that they need to work on their tasks. Once reminded, they return to their tasks without much fuss.

However, I doubt it would work for hedonistic procrastinators. When body doubling, hedonistic procrastinators rely on social pressure to be productive. Using AI likely won't work unless the person perceives the AI as a human.


You don't necessarily need to believe the AI is a human for it to tickle the ingrained social instincts you're looking for. For example, I'm quite aware that AI's are just tools, and yet I still feel a strong need to be "polite" in my requests to ChatGPT. "Please do ...." or "Can you...?" and even "Thanks, that worked! Now can you..." etc.


I do the same, but I think it's because we were taught to be polite and to conduct conversations in a certain way.

Do you put effort into being polite when ChatGPT makes a mistake and you correct it? Do you try to soften the blow to avoid hurting its "feelings"? Do you feel bad if you respond impolitely? I don't.


You only do that politeness as a novice.

My questions to copilot.ms.com today are more like the following, still works like a charm...

"I have cpp code: <enter><code snippet><enter> and i get error <piece of compilation output>. Is this wrong smart ponitor?"

[elaborate answer with nice examples]

"Works. <Next question>"


I dont feel this at all. I treat chatgpt like an investment banking intern.


So why not fire 3 of your colleges and have another whos new job is watching over/checking in on you and by your own account productivity would be about the same. Save your company some money it will be appreciated!

On an unrelated note, I believe people need to start quantifying their outrageous ai productivity claims or shut up.


I'm intrigued to know whether that actually ends up working. I am something like that myself, but I don't know whether it is an effect of getting feedback or of having a person behind the feedback.


There's definitely an ideal setup that's needed in order for it to work. I'm also not quite sure what part of the other person being present causes me to focus better (i.e., whether it's the presence vs good ideas and feedback).

I'm leaning toward saying that the main issue for me is that I need to keep my focus on things that are active engagement rather than more passive engagement like taking notes versus just reading a passage.


Your "parent" kicked you into gear because you have an emotional bond with them. A stranger might cause your guards to go up if you do not respect them as with wisdom. So too may go an AI.


I used the term "parent" here because it was the descriptor I thought people would understand best.

For me personally, I was awful at working when my parents were hovering over me.

In the past, I used to work with a professor on a project and we'd spend significant amounts of time on zoom calls working (this was during COVID). The professor wouldn't even be helping me the entire time, but as soon as I was blocked, I'd start talking and the ideas would bounce back and forth and I'd find a solution significantly quicker.


Shameless plug, I'm working on something like this https://myaipal.kit.com/prerelease


So I watched the demo video on your site, and honestly I'm not sure how this is really all that much better than what can already be done with ChatGPT.

The key is, I don't want to have to initiate the contact. Hand holding the AI myself defeats the purpose. The ideal AI assistant is one that behaves as if it's a person that's sitting next to me.

Imagine you're a junior that gets on a teams call to get help via pair programming with your boss. For anything more than just a quick fix, pair programming on calls tends to turn into the junior working on something, hitting a roadblock, and the boss stepping in to provide input.

Here's the really important part that I've realized: very rarely will the input that the boss provides be something that is leaps and bounds outside of the ability of the junior. A lot of it will just be asking questions or talking the problem through until it turns the gears enough for the junior to continue on their own. THAT right there. That's the gear turning AI agent I'm looking for.

If someone could develop a tool that "knows" the right time to jump in and talk with you, then I think we'd see huge jumps in productivity for people.


At least you can theoretically stop sharing with this one. Microsoft was essentially trying to do this, but doing it for everything on your PC, with zero transparency.

Here's Google doing essentially the same thing, even more so that it's explicitly shipping your activity to the cloud, and this response is so different from the "we're sticking this on your machine and you can't turn it off" version Microsoft was attempting to land. This is what Microsoft should have done.


This is great! I viscerally dislike the "we're going to do art for you so you don't have to... even if you want to..." side of AI, but learning to use the tools to get the satisfaction of making it yourself is not easy! After 2 decades of working with 2D art and code separately, learning 3D stuff (if you include things like the complex and counterintuitive data flow of simulations in Houdini and the like) was as or more difficult than learning to code. Beyond that, taking classes is f'ing expensive, and more of that money goes to educational institutions than the teachers themselves. Obviously, getting beyond the basics for things that require experienced critique are just going to need human understanding, but for the base technical stuff, this is fantastic.


This comment is better than entire ad google just showed. Who is still pointing at the building with camera and asking what is this building?


I do that in Manhattan. I also do it for yonder mountains.


Sounds interesting, but voice input isn't working for me there. I guess I'm too niche with my Mac and Firefox setup.


Actually plenty of tech people are using mac & firefox


Irony detectors malfunctioning perhaps?


'irony' meant 'something made of metal' last time I checked


Right, and Macs are made out of aluminium


What is Firefox made out of then?


The amount of rust indicates iron. So Firefox is very irony.


Fire and foxes, presumably.


Wood.


Aluminum. It's American.


This isnt entierly suprising as Google have been breaking things artificially on Firefox for years now (Google Map and YouTube at least). Maybe try spoofing Chrome's user-agent.


Console is throwing an error: "Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported."

Quick research suggests this is part of Firefox's anti-fingerprinting functionality.


I tried this, shared a terminal, asked it to talk about it, and it guessed that it was Google Chrome with some webUI stuff. Immediately closed the window and bailed.


Which terminal? Was it chromium based?


Nope. Just KiTTY on Windows.


Get started documentation on Multimodal Live API : https://ai.google.dev/api/multimodal-live


I don't know what's not working but I get "Has a large language model. I don't have the capability to see your screen or any other visual input. My interactions are purely based on the text that you provide"


This'll be so fantastic once local models can do it, because nobody in the right mind would stream their voice and everything they do on their machine to Google right? Right?

Oh who am I kidding, people upload literally everything to drive lmao.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: