Hacker News new | past | comments | ask | show | jobs | submit | meerab's comments login

Multimodal (Specifically Video) testing for Gemini 2.5


The Gemini 2.5 model is truly impressive, especially with its multimodal capability. Its ability to understand audio and video content is amazing—truly groundbreaking.

I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:

1. Counting Occurrences in a Video

In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").

Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.

2. Identifying Background Music and Movie Name

Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!

3. OCR Text Recognition

Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!

4. Listen to Foreign News Media

The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.

5. Cricket Fans?

Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.

7. Webinar - Generate Slides from Video

Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!

Bonus: Humor Test

Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.

At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!

https://videotobe.com/blog/googles-gemini-25


For the past 6 months, I've been running VideoToBe.com, a simple transcription service using a single machine hosted in my house. My DIY setup is small, yet functional. I use the OpenAI Whisper Model to convert audio to text and send the transcript via email. You can try my service at VideoToBe.com.

First, I had to upgrade my desktop to one with a GPU. I had been exploring LLMs and generative AI, but my old hardware was not enough - in my upgrade, GPU is the most expensive component. I opted for the NVIDIA 3090 GPU, 62GB of RAM and AMD's Ryzen 9.

My Kitchen Table Tech Stack A GPU computer running in my home network

A front-end that uploads files to a Storage Bucket and adds the task to a queue.

A cron job that downloads these files

OpenAI's Whisper model create the transcripts.

A simple email system to deliver results

The service runs on my home internet, using residential AT&T Fiber. Since traffic is still low, my home internet can handle the traffic. Requests are added to the queue and processed one by one by server. The machine can run Whisper Large Model. The Transcripts are delivered via email. Since the delivery mechanism is email, I have wiggle room in terms of performance. So far, I am able to delivered most transcripts requests within 30 minutes.

What Did I Learn? - Scaling is overrated: My small setup has worked fine for 6 months. One day, I may need to scale — but that day is not today. Over engineering kills more projects.

- Users are nice: I used to think users were demanding and unreasonable. But when you set clear expectations and deliver, people are kind. Some have even given me marketing tips and encouragement!

- Monthly Subscription Fatigue: Most of the people hate monthly subscription!

Where Am I Right Now and What Next? People actually use my service! We recently crossed 10,000 transcripts. Many users keep coming back and send kind emails. Their encouragement keeps me going!

I’m now building a more complete product. My original goal is to build a system with video search, insights, and chat with videos. So far, I have worked only on the audio part — adding visual context is one of my goals and my next step.

Have a big collection of audio or video? Get in touch! I'd love to help you build your video insights media library. Have a feedback ? Feel free to get in touch at meera@videotobe.com


Her answer - the poster can be female.


  A Candid Review of DeepLearning.ai's Short Courses
I approached DeepLearning.ai's courses with high expectations, drawn by Andrew Ng's stellar background as a Stanford Professor and now Amazon Board member. His contributions to AI education through Coursera's pioneering ML course were groundbreaking.

However, most short courses on DeepLearning.ai function more as product showcases than educational resources. For instance:

* "Building Agentic RAG with LlamaIndex” → LlamaIndex walkthrough * "Serverless Agentic Workflows" → Amazon Bedrock tutorial * "Building Multimodal Search" → Weaviate implementation guide * “Evaluating and Debugging Generative AI Models” → How to use Weights & Biases

Yes, these courses are free. But they seem designed to funnel developers toward specific commercial APIs rather than build foundational knowledge. Know what you're signing up for: hands-on product experience, not deep technical education. hashtag#AI hashtag#MachineLearning hashtag#TechEducation hashtag#DeepLearning

Thoughts?


I approached DeepLearning.ai's courses with high expectations, drawn by Andrew Ng's stellar background as a Stanford Professor and now Amazon Board member. His contributions to AI education through Coursera's pioneering ML course were groundbreaking.

However, most short courses on DeepLearning.ai function more as product showcases than educational resources. For instance:

* "Building Agentic RAG with LlamaIndex” → LlamaIndex walkthrough * "Serverless Agentic Workflows" → Amazon Bedrock tutorial * "Building Multimodal Search" → Weaviate implementation guide * “Evaluating and Debugging Generative AI Models” → How to use Weights & Biases

Yes, these courses are free. But they seem designed to funnel developers toward specific commercial APIs rather than build foundational knowledge. Know what you're signing up for: hands-on product experience, not deep technical education. hashtag#AI hashtag#MachineLearning hashtag#TechEducation hashtag#DeepLearning


"The development process felt natural with Next.js" - the author

What part of this ReactJS syntax you find natural? Familiar Yes, Natural NO.

  useEffect(() => {
    const timer = setInterval(() => {
      setCount((prevCount) => prevCount + 1);
    }, 1000);

    return () => clearInterval(timer);
  }, []);


HTMX doesn't have an elegant solution for timers and intervals, as far as I know. You would still need to awkwardly wire that up in JS with HTMX.


It does, `hx-trigger` supports repeated polling on an interval and delays. The terminology is different, the ergonomics are different but the underlying concept is the same.


hx-trigger is used for server side requests. `setInterval` can be used to fire http requests on an interval but it can also be used for a lot more things and you have full control over it. `hx-trigger` involves using proprietary language as strings to program the request. I have no idea by looking at that how it tears down, cancels, or responds to other events or triggers on the page. The reason for JSX is a custom template language ALWAYS ends up reinventing JavaScript in a worse, more limited form as template strings.


What do you mean by timers and intervals?


You're being very generous with this code sample. It can get far more complex if you want a custom hook with variable timeout. Dan even had to write a lengthy article on the topic. https://overreacted.io/making-setinterval-declarative-with-r...


What would look natural for this use case? Can you write the code?


Well done! The project is simple and well documented. Deno deployment and micro service API approach is innovative. And thank you for making it open source!


Thanks for the comment. I'm glad that this work help other people as well!

Also, plan on improving it over time with other features.


I have been experimenting with making content libraries searchable. I found out that it is relatively easy to build one using a RAG solution (llamaindex or LangChain) and a Vector Database.

-Meera@VideoToBe.com


Great tool!

Founder at VideoToBe.com here. I built similar service, and it worked for a while. The moment it started to get traffic, it got blocked by Youtube. Your service may also get blocked when you start to scale. Your next iteration of document-centric version is more promising. It opens door to various use cases and isn't limited to YouTube.

I pivoted to transcriptions and AI summarization for user uploaded content.


Thanks! Already dealing with some issues with that. I think the solution is to just have a series of fallback approaches and then as one method stops working, look to find other ones, but have enough in reserve that you never have a service interruption.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: