I imagine you can do it by AI-transcribing the podcast while preserving timestamp metadata for each symbol. Use LLM to identify undesirable segments (ask it to output json or something) and then cut them out from the audio with ffmpeg.
Then you would need to set up a server that would do all this and serve as a 'mirror' to your podcasts without the ads.
If you've gone through that much effort, you might as well turn it into a subscription service. It would be resource intensive, but some people would gladly pay through their nose to rid their podcasts of ads.
I'd definitely like to make it easier to use and spread it more widely, but I can't directly distribute the edited (copyrighted) podcast files. Might share transcript markers of the text right before and after ad segments, which is like a slightly more complicated version of what SponsorBlock does.
What's your prompt for Gemini like, does it include examples of ads? Assume you're using Flash for cost?
I also have a setup like this, I transcribe with Whisper and send it to OpenAI 4o-mini to detect ads then clip those segments with pydub, but my prompt must be lacking because the success rate on detecting ads is maybe 60%
My Gemini Flash 2.0 prompt:
"Below is the transcript of a podcast preceded by a line number. Reply with the line numbers that are likely to be from advertisements, promotions, commercials, sponsorships, or ending credits."
I think it's better than 60%, but I should definitely set up some evals.
I split the text by sentence, but was considering having the LLM try and put into paragraph (that might conceptually chunk commercial sentences together), but what I've got has been good enough for me.
I wanted to switch to Flash 2.5, but it looks like they increased the price a lot.
I think I could do a fair bit of ad identification just with text heuristics: "This podcast is sponsored/supported by...", etc.
Then you would need to set up a server that would do all this and serve as a 'mirror' to your podcasts without the ads.