One small additional requirement: although I studied linguistics and took a year-long course in English phonology, speech-to-text still struggles with my accent.
The approach I'm playing with atm is inspired by some advice from Simon Willis, here on HN:
record audio → transcribe using whisper → clean up and format using a GPT prompt
So far the results have been pretty good: the original meaning is preserved but the text is much easier to read (and the missing/"misheard" words are often corrected).
What I'm experimenting at the moment:
- picking the right model size, tweaking the prompts
However, this is conceptually interesting. It might be fun to speak the first draft of my next piece and transcribe the result with Whisper.