That's an interesting thought. The semantic tokens we get from Whisper serve a similar purpose – you can convert existing speech to different voices, I did not try with accents yet.
There is still a lot to explore in this space – we certainly don't have all the answers yet!
There is still a lot to explore in this space – we certainly don't have all the answers yet!