If you're generally interested in TTS check out Bondsynth AI, it's a product I've been working on for long-form text-to-speech (think ebook to audiobook or website to audio). It's still in beta, so it's free, but I'm looking for feedback.
I have yet to make the transition in a paid role, but I quit my backend job to start a startup developing realistic text-to-speech for long-form content.
My approach has been to start at a high-level, with a specific goal in mind, and to progressively go deeper and deeper. The specific goal part has been really helpful IMO. It prevents sort of aimless shuffling about and provides a good metric to see if you're making progress. When I started I was basically just focusing on producing training data and treating the models, which were open-source on GitHub, as a black-box. At this point I've made a lot of modifications to the actual model code itself and I'm learning a ton. There's of course a bunch of adjacent skills that are similar to traditional backend skills, but slightly different. Like autoscaling for example, there aren't many autoscaling solutions for GPU VMs yet, there are some startups working on this space, but IMO it's good to have a rock-solid hosting solution that you don't have to worry about too much.
I'm working on making natural-sounding text-to-speech widely available. Specifically targeting normal people, not companies/developers. It's going to start with me picking the content, but I'd like to make it possible to have pretty much any text someone finds converted into a voice that is nice to listen to.
Have you checked out descript overdub? It actually uses your own voice (cloned) for ai, and is pretty amazing. It's used by a lot of youtubers: https://www.youtube.com/watch?v=sr-k1LWYsIg
Demo at: https://www.youtube.com/watch?v=OmQup3kst5s
Signup at: https://signup.bondsynth.ai