Actually, this won't work at all for music because it makes fundamental assumptions that the signal is speech. For normal conversations, it should work, though for now the models are not yet as robust as I'd like (in case of noise and reverberation). That's next on the list of things to improve.
Here we go! This is the first minute or so of Penny Lane by The Beatles converted down to a 10KB .bin and then back to a .wav: http://no.gd/pennylane.wav .. unsurprisingly the vocals remain recognizable, but the music barely at all.
Pretty much! It shows off how the codec works to a great extent though as it seems to be misinterpreting parts of the music to be the pitch of the speech, so Paul's voice sounds weird at the start of most lines but okay throughout the lines.
I've also run a BBC news report through the program with better results although it demonstrates that any background noise at all can throw things off significantly: https://twitter.com/peterc/status/1111736029558517760 .. so at this low bitrate, it really is only good for plain speech without any other noise.
Well, in the case of music, what happens is that due to the low bit-rate there are many different signals that can produce the same features. The LPCNet model is trained to reproduce whatever is the most likely to be a single person speaking. The more advanced the model, the more speech-like the music is likely to turn
When it comes to noisy speech, it should be possible to improve things by actually training on noisy speech (the current model is trained only on clean speech). Stay tuned :-)
Can you try it with Tom's Diner by Suzanne Vega? It's sung without any instruments, and an early version of MP3 reportedly was a disaster on that song.
At some point she sings "loose" instead of "nice" in the compressed version, and a bit later it also sounds like "lulk" instead of "milk". So it's a bit lossy even with respect to the lyrics!
Curious, it definitely works, but the domain is "weird" enough that certain firewalls or proxies may have trouble, perhaps. I've put it at https://gofile.io/?c=F5gle3 as an alternative.
I suspect the reason that excerpt sounds so bad is because the music has several instruments playing at once. One doesn't generally design a vocoder to deal with more than one voice. As that except plays, you can hear that the most prominent instruments (eg: the bass at several moments) sound pleasing, albeit speech-like.
It would probably different from the original music, but pleasant, if one processed each track separately.
Right. This form of compression assumes a primary single pitch, plus variations from that tone. You can hear it locking into different components of the song and losing almost everything else.
Heavy compression of voice is vulnerable to background noise.
I miss the classic telco 8K samples per second, 8 bits. We used to think that was crappy audio.
I tried it with music and the results were spooky. Very ethereal and ghostly. It was only with some classical music though, I might have to do a pop song next and share the results!