Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the problem here is the same one as the other current music generation services. Iteration is so important to creativity and right now you can't really properly iterate. In order to get the right song you just spray and pray and keep generating until one that is sufficient arrives or you give up. I know you hint at this being a future direction of development but in my opinion it's a key feature to take these services beyond toys.

I think it's better to think of the process of finding the right song as a search algorithm through the space of all possible songs. The current approach just uses a "pick a random point in a general area". Once we find something that is roughly correct we need something that lets us iteratively tweak the aspects that are not quite right, decreasing the search space and allowing us to iteratively take smaller and smaller steps in defined directions.



Yep, I came to similar conclusions w/ text-to-audio models - in terms of creative work the ability to iterate is really lacking with the current interfaces. We've stopped working on text-to-audio models and are instead focusing on targeting a lower-level of abstraction by directly exposing an Ableton environment to LLM agents.

We just published a blog today discussing this - https://montyanderson.net/writing/synthesis


Our variations feature coming very soon is exactly this! Rhythm Control is an early version of this.


I'll keep an eye out for that! The variations feature in Suno is a good example of what not to do here, as it effectively just makes another random iteration using existing settings.

I think the other missing pieces I've found are upscaling and stem splitting. While existing tool exist for splitting stems exist, my testing found that this didn't work well in practice (at least on Suno music), likely due to a combination of encoder-specific artifacts and the overall low sound quality. Existing upscaling approaches also faced similar issues.

My naive guess is that these are things that will benefit from being closely intertwined with the generation process. Eg when splitting up stems, you can use the diffusion model(s) to help jointly converge individual stems into reasonable standalone tracks.

I'm excited about the potential of these tools. I've definitely personally found uses cases for small independent game projects where a paying for musicians is far out of budget, and the style of music is not one I can execute on my own. But I'm not willing to sacrifice on quality of results to do so.


Our variations feature will be nothing like Suno's (which just generates another song using the same prompt/lyrics). Since we use a diffusion model, we can actually restart the generation process from an early timestep (e.g., with a similar seed or even parts of the existing song) to get exactly what you're looking for.


> Our variations feature will be nothing like Suno's (which just generates another song using the same prompt/lyrics).

That's their "Remix" feature which just got renamed "Reuse prompt" or something.

Their extend feature generates a new song starting from an arbitrary timestamp, with a new prompt. It doesn't always work for drastic style changes and it can be a bit repetitive with some songs but it doesn't completely reroll the entire song.


I uploaded a bit of a song that I recorded once (that I wrote, unpublished), and I am trying to get it to riff on it, generate something close to it, etc.


More strength does what? More or less similar?


More strength = force rhythm more. If you crank it to max it will probably result in just a drum line, so I prefer 3-4.


Same with text models, for me. If I can't edit my query and the AI response, to retry/keep the context in check, then I have trouble finding use for it, in creation. I need to be able to directly influence the entire loop, and, most importantly, keep the context for the next token prediction clean and short.


Letting you edit the response is quite easy to do, technically speaking. It's not done in the default UI for most AI Chatbots, unfortunately. You will need to look for alternative UIs.


I've noticed that the output tends to suffer when you pass in longer lyrics, too. Lots of my experiments start off fairly strong but then it's like it starts to forget, and the lyrics lose any rhythmic structure or just becomes incoherent.

At some point it's just not efficient to try and get the desired output purely through a prompt, and it would be helpful to download the output in a format you can plug into your DAW to tweak.


But that’s not a problem when listening to Spotify? Why can’t we treat these music generation engines the same way we treat music streaming services?


Idk what you're referring to specifically, but music discovery services are terrible across all of spotify, apple music, google music, tidal, etc. I don't expect these services to read your mind, but they also don't ask for many parameters to help with the search. Definitely a huge opportunity here for innovative new services.


TikTok can tolerate a lot more active skipping than Spotify can before they annoy their users. We’d love to solve this. How would you? Maybe we could let users write why they didn’t like the song in natural language since we understand that now.


Basically you need something like comfy UI for music.

Variation in small details is fine, but you need control over larger scale structure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: