I suspect the reason that excerpt sounds so bad is because the music has several instruments playing at once. One doesn't generally design a vocoder to deal with more than one voice. As that except plays, you can hear that the most prominent instruments (eg: the bass at several moments) sound pleasing, albeit speech-like.
It would probably different from the original music, but pleasant, if one processed each track separately.
It would probably different from the original music, but pleasant, if one processed each track separately.