An algorithm understanding enough about the text to infer the correct emotional inflection to give a speech may edge into the category of strong AI, but I would have guessed that it would be easier to create a neural network that, given a text spoken in one voice, could transform it into another with the correct stress, intonation, etc. Perhaps even that's a more difficult task than I assumed, although speech generation seems to receive a lot less academic and industrial attention than speech recognition and understanding.
Yeah, it's a bit more difficult of a task than you've assumed.
Speech synthesis receives a lot of attention, but it's hard, so you rarely hear any news about it. People are throwing DNNs at it at the moment, but nothing earth shattering has come of it (yet). I have a couple of 'naturalness' filters that use DNNs and about 30% of the time, they drop all of their tones and I end up with an angry whisper as output. I don't work late too often.
For people interested in how hard it is, I recently read this [1] NYT article providing a comparison of synthetic speech that IBM experts tested for Watson in the Jeopardy competition.