When you sing into Riffusion, it just puts time constraints on the lyrics. Example:
"I've never seen a diamond in the flesh" from Lorde, sung into Riffusion:
"I've:0.72-0.96 never:0.96-1.20 seen:1.20-1.48 a:1.48-1.64 diamond:1.64-2.32 in:2.32-2.76 the:2.76-3.02 flesh:3.02-3.86"
With the timestamps being second-annotations. The Google model takes the whole melody and phrasing as input as well.
It seems most (or all?) AI music projects aim at producing audio?
Why not output MIDI instead, and let the artist manipulate that?
I for one would gladly buy a VST instrument that would generate a constant stream of MIDI ideas, chordings, voicings, variations, from a few seconds of singing or whistling or clapping.
It seems the technology is here to build it, yet (AFAIK) it doesn't exist. Why?
Its because the ultimate goal with these gen AI tools is to replace not support professionals. Your idea makes sense, is eminently doable, and would be awesome. The problem is, it still requires a human with expertise. Gen AI companies want to remove the human with expertise from the equation precisely so they can "empower" (take money from) the much wider population of humans without expertise. These humans, since they lack expertise will actually need gen AI to produce anything at all. A professional could use it but could ultimately make do without it--they would not have a strong dependency on gen AI. It's essentially the same idea as addicts being lucrative prospects for drug dealers--once you establish dependency, you've got them.
Let's be frank, gen AI projects are ultimately about companies wanting make money by putting small-scale professionals out of work for good. In other words, they want to capture a significant portion of all that capital that currently flows across small companies and projects (even if small scale operations continue a large percentage of their production will require funneling money to Gen AI subscriptions instead of skilled human workers)
I tried riffusion and sang 10 seconds of "the house of the rising sun" [0]
It produced... this...
https://www.riffusion.com/riffs/d0464fb4-a53a-4ad8-a765-0a64...
which, by some measure, is impressive (perfect recognition of lyrics, for one), but is also... a musical abomination?
So IDK. Let's see where this goes.
[0] https://www.youtube.com/watch?v=4-43lLKaqBQ