The Gemini 2.5 model is truly impressive, especially with its multimodal capability. Its ability to understand audio and video content is amazing—truly groundbreaking.
I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:
1. Counting Occurrences in a Video
In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").
Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.
2. Identifying Background Music and Movie Name
Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!
3. OCR Text Recognition
Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!
4. Listen to Foreign News Media
The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.
5. Cricket Fans?
Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.
7. Webinar - Generate Slides from Video
Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!
Bonus: Humor Test
Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.
At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!
For the past 6 months, I've been running VideoToBe.com, a simple transcription service using a single machine hosted in my house. My DIY setup is small, yet functional. I use the OpenAI Whisper Model to convert audio to text and send the transcript via email. You can try my service at VideoToBe.com.
First, I had to upgrade my desktop to one with a GPU. I had been exploring LLMs and generative AI, but my old hardware was not enough - in my upgrade, GPU is the most expensive component. I opted for the NVIDIA 3090 GPU, 62GB of RAM and AMD's Ryzen 9.
My Kitchen Table Tech Stack
A GPU computer running in my home network
A front-end that uploads files to a Storage Bucket and adds the task to a queue.
A cron job that downloads these files
OpenAI's Whisper model create the transcripts.
A simple email system to deliver results
The service runs on my home internet, using residential AT&T Fiber. Since traffic is still low, my home internet can handle the traffic. Requests are added to the queue and processed one by one by server. The machine can run Whisper Large Model. The Transcripts are delivered via email. Since the delivery mechanism is email, I have wiggle room in terms of performance. So far, I am able to delivered most transcripts requests within 30 minutes.
What Did I Learn?
- Scaling is overrated: My small setup has worked fine for 6 months. One day, I may need to scale — but that day is not today. Over engineering kills more projects.
- Users are nice: I used to think users were demanding and unreasonable. But when you set clear expectations and deliver, people are kind. Some have even given me marketing tips and encouragement!
- Monthly Subscription Fatigue: Most of the people hate monthly subscription!
Where Am I Right Now and What Next?
People actually use my service! We recently crossed 10,000 transcripts. Many users keep coming back and send kind emails. Their encouragement keeps me going!
I’m now building a more complete product. My original goal is to build a system with video search, insights, and chat with videos. So far, I have worked only on the audio part — adding visual context is one of my goals and my next step.
Have a big collection of audio or video? Get in touch! I'd love to help you build your video insights media library. Have a feedback ? Feel free to get in touch at meera@videotobe.com
A Candid Review of DeepLearning.ai's Short Courses
I approached DeepLearning.ai's courses with high expectations, drawn by Andrew Ng's stellar background as a Stanford Professor and now Amazon Board member. His contributions to AI education through Coursera's pioneering ML course were groundbreaking.
However, most short courses on DeepLearning.ai function more as product showcases than educational resources. For instance:
* "Building Agentic RAG with LlamaIndex” → LlamaIndex walkthrough
* "Serverless Agentic Workflows" → Amazon Bedrock tutorial
* "Building Multimodal Search" → Weaviate implementation guide
* “Evaluating and Debugging Generative AI Models” → How to use Weights & Biases
Yes, these courses are free. But they seem designed to funnel developers toward specific commercial APIs rather than build foundational knowledge.
Know what you're signing up for: hands-on product experience, not deep technical education.
hashtag#AI hashtag#MachineLearning hashtag#TechEducation hashtag#DeepLearning
I approached DeepLearning.ai's courses with high expectations, drawn by Andrew Ng's stellar background as a Stanford Professor and now Amazon Board member. His contributions to AI education through Coursera's pioneering ML course were groundbreaking.
However, most short courses on DeepLearning.ai function more as product showcases than educational resources. For instance:
* "Building Agentic RAG with LlamaIndex” → LlamaIndex walkthrough
* "Serverless Agentic Workflows" → Amazon Bedrock tutorial
* "Building Multimodal Search" → Weaviate implementation guide
* “Evaluating and Debugging Generative AI Models” → How to use Weights & Biases
Yes, these courses are free. But they seem designed to funnel developers toward specific commercial APIs rather than build foundational knowledge.
Know what you're signing up for: hands-on product experience, not deep technical education.
hashtag#AI hashtag#MachineLearning hashtag#TechEducation hashtag#DeepLearning
It does, `hx-trigger` supports repeated polling on an interval and delays. The terminology is different, the ergonomics are different but the underlying concept is the same.
hx-trigger is used for server side requests. `setInterval` can be used to fire http requests on an interval but it can also be used for a lot more things and you have full control over it. `hx-trigger` involves using proprietary language as strings to program the request. I have no idea by looking at that how it tears down, cancels, or responds to other events or triggers on the page. The reason for JSX is a custom template language ALWAYS ends up reinventing JavaScript in a worse, more limited form as template strings.
Well done! The project is simple and well documented. Deno deployment and micro service API approach is innovative. And thank you for making it open source!
I have been experimenting with making content libraries searchable. I found out that it is relatively easy to build one using a RAG solution (llamaindex or LangChain) and a Vector Database.
Founder at VideoToBe.com here. I built similar service, and it worked for a while. The moment it started to get traffic, it got blocked by Youtube. Your service may also get blocked when you start to scale.
Your next iteration of document-centric version is more promising. It opens door to various use cases and isn't limited to YouTube.
I pivoted to transcriptions and AI summarization for user uploaded content.
Thanks! Already dealing with some issues with that. I think the solution is to just have a series of fallback approaches and then as one method stops working, look to find other ones, but have enough in reserve that you never have a service interruption.