> The question that I think tells the kind of potential that ML has for music is, how good is written language in describing music?
If you took a DAW project file of a song.wav that was completely written and produced digitally using virtual instruments and compiled all of the parameters a user had to set to achieve their output.wav into a .csv file, you may be surprised to see (1) how few parameters were used (2) how often those parameters are unchanged from their defaults and (3) the amount of those parameters that would be expected in any other project file.
When you break it down, you really only have 6 layers to parse, all of which are dynamic but within a relatively small and consistent sandbox, at least relative to image generation.
1. composition layer - the midi or notes of the song.
2. arrangement layer - the selection of instruments used in the song and the division of the song's midi to the song's respective instruments.
3. instrument layer - the parameters of each instrument, such as a synth path or a virtual piano's room setting.
4. post processing layer - the effects placed on the output of each instrument, such as reverb, compression, delay, ect.
5. mixing layer - the volume of each instrument + post processing channel
6. mastering layer - processing on the master track
All of these things are more or less standardized. Developers always add their own flair (read: custom parameters) for their plugins, but they can be decompiled to be a composition of each of these layer's fundamental parameters. All these parameters + the midi of a song would be a few kb
I feel like a LLM trained on the parameter sets which interacts with the software used to manipulate these layers could really produce amazing tools and open the door to writing high quality songs to everyone, just as other AI products have opened so many similar doors.
The DALL-E app for music, in my mind, probably wont be a text based description -> .wav output. Instead, it would be the generation of the elements of each layer with options that can be listened to in real time using whatever VSTs were used in training. When you ask ChatGPT to write a complex python script, it starts with an outline of all the methods in the script as placeholders and then takes you step by step until you're done, then you troubleshoot it or flesh it out. The best part of a generative music like this is that it leaves the user really only with having to decide if something sounds good or not.
As a mostly musically illiterate producer myself, I've produced hundreds of songs and a few albums without ever really learning how to do anything other than manipulate the parameters. When I started learning to produce music I was 15 years old and knew nothing about music production. But, I was really good at was using computers and software so I learned to play the DAW, the plugins, and the sample packs. The only layer that I couldn't learn through learning software was the composition, the writing of the midi. Fortunately, the midi of a song becomes very easy to brute force over time, so I learned to brute force midi. Once I became efficient with my workflow, producing music became a task of "make this idea sound good." And without ever really feeling like I was a musician or composer, this became an enormous passion and outlet for me that I did every day for a decade.
I was able to do this because at its core, all the mechanical parts of a song are simple machines and a song's quality is the way those machines are used together. As an outside, this feels like a workflow that would be very machine learning friendly. But I could be wrong!
If you took a DAW project file of a song.wav that was completely written and produced digitally using virtual instruments and compiled all of the parameters a user had to set to achieve their output.wav into a .csv file, you may be surprised to see (1) how few parameters were used (2) how often those parameters are unchanged from their defaults and (3) the amount of those parameters that would be expected in any other project file.
When you break it down, you really only have 6 layers to parse, all of which are dynamic but within a relatively small and consistent sandbox, at least relative to image generation.
1. composition layer - the midi or notes of the song. 2. arrangement layer - the selection of instruments used in the song and the division of the song's midi to the song's respective instruments. 3. instrument layer - the parameters of each instrument, such as a synth path or a virtual piano's room setting. 4. post processing layer - the effects placed on the output of each instrument, such as reverb, compression, delay, ect. 5. mixing layer - the volume of each instrument + post processing channel 6. mastering layer - processing on the master track
All of these things are more or less standardized. Developers always add their own flair (read: custom parameters) for their plugins, but they can be decompiled to be a composition of each of these layer's fundamental parameters. All these parameters + the midi of a song would be a few kb
I feel like a LLM trained on the parameter sets which interacts with the software used to manipulate these layers could really produce amazing tools and open the door to writing high quality songs to everyone, just as other AI products have opened so many similar doors.
The DALL-E app for music, in my mind, probably wont be a text based description -> .wav output. Instead, it would be the generation of the elements of each layer with options that can be listened to in real time using whatever VSTs were used in training. When you ask ChatGPT to write a complex python script, it starts with an outline of all the methods in the script as placeholders and then takes you step by step until you're done, then you troubleshoot it or flesh it out. The best part of a generative music like this is that it leaves the user really only with having to decide if something sounds good or not.
As a mostly musically illiterate producer myself, I've produced hundreds of songs and a few albums without ever really learning how to do anything other than manipulate the parameters. When I started learning to produce music I was 15 years old and knew nothing about music production. But, I was really good at was using computers and software so I learned to play the DAW, the plugins, and the sample packs. The only layer that I couldn't learn through learning software was the composition, the writing of the midi. Fortunately, the midi of a song becomes very easy to brute force over time, so I learned to brute force midi. Once I became efficient with my workflow, producing music became a task of "make this idea sound good." And without ever really feeling like I was a musician or composer, this became an enormous passion and outlet for me that I did every day for a decade.
I was able to do this because at its core, all the mechanical parts of a song are simple machines and a song's quality is the way those machines are used together. As an outside, this feels like a workflow that would be very machine learning friendly. But I could be wrong!