Multi-modal audio is great. I talk to ChatGPT when I'm cooking or walking the dog.
For images I use it for things like helping draft initial alt text for images, extracting tables from screenshots, translating photos of signs in languages I don't speak - and then really fun stuff like "invent a recipe to recreate this plate of food" or "my CSS renders like this, what should I change?" or "How do you think I turn on this oven?" (in an Airbnb).
I've recently started using the share-screen feature provided for Gemini by https://aistudio.google.com/live when I'm reading academic papers and I want help understanding the math. I can say "What does this symbol with the squiggle above it?" out loud and Gemini will explain it for me - works really well.
Just last night I was digging around in my basement, pulling apart my furnace, showing pics of the inside of it, having GPT explain how it works and what I needed to do to fix it.
If there are no reputable sources to point to, then where exactly is GPT deriving its answer from? And how can we be assured GPT is correct about the furnace in question?
I mean.. I fed it all the photos of the unit and every diagram and instruction panels from the thing. I was confident in the information it was giving me about what parts did what and where to look and what to look for. You have to evaluate its output, certainly.
Getting it to fix a mower now. It's surfacing some good YouTube vids.
I use it like that all the time. There's so much information in the world which assumes you have a certain level of understanding already - you can decipher the jargon terms it uses, you can fill in the blanks when it doesn't provide enough detail.
I don't have 100% of the "common sense" knowledge about every field, but good LLMs probably have ~80% of that "common sense" baked in. Which makes them better at interpreting incomplete information than I am.
A couple of examples: a post on some investment forum mentions DCA. A cooking recipe tells me "boil the pasta until done".
I absolutely buy that feeding in a few photos of dusty half-complete manual pages found near my water heater would provide enough context for it to answer questions usefully.
Oh right, yeah I've done things like this (phone calls to ChatGPT) or the openwebui Whisper -> LLM -> TTS setup. I thought there might be something more than this by now
For images I use it for things like helping draft initial alt text for images, extracting tables from screenshots, translating photos of signs in languages I don't speak - and then really fun stuff like "invent a recipe to recreate this plate of food" or "my CSS renders like this, what should I change?" or "How do you think I turn on this oven?" (in an Airbnb).
I've recently started using the share-screen feature provided for Gemini by https://aistudio.google.com/live when I'm reading academic papers and I want help understanding the math. I can say "What does this symbol with the squiggle above it?" out loud and Gemini will explain it for me - works really well.