I really feel like the popularity of diffusion has made it far too shallow.
Why diffuse an entire track? We should be building these models to create music the same way that humans do, by diffusing samples, then having the model build the song using samples in a proper sequencer, diffuse vocals etc.
Problem with Suno etc, is that as other people have mentioned, you can't iterate or adjust anything. Saying "make the drums a little punchier and faster paced right after the chorus" is a really tough query to process if you've diffused the whole track rather than built it up.
Same thing with LLM story writing, the writing needs a good foundation, more generating information about the world and history and then generating a story taking that stuff into account, vs a simple "write me a story about x"
I completely agree on the editing aspect. However if you want to generate five stem tracks, then all five tracks must have the full bandwidth of your auto encoder. Accordingly each inference or training staff would take much more compute for the same result. That’s why we’d prefer to do it all together and split after.
Why diffuse an entire track? We should be building these models to create music the same way that humans do, by diffusing samples, then having the model build the song using samples in a proper sequencer, diffuse vocals etc.
Problem with Suno etc, is that as other people have mentioned, you can't iterate or adjust anything. Saying "make the drums a little punchier and faster paced right after the chorus" is a really tough query to process if you've diffused the whole track rather than built it up.
Same thing with LLM story writing, the writing needs a good foundation, more generating information about the world and history and then generating a story taking that stuff into account, vs a simple "write me a story about x"