> building LLMs is very costly, and that will probably remain so for quite some time.
Building LLMs is dropping in cost quickly. Back in mid 2023 training an 7B model had already dropped to $30K and it's even cheaper now.
> I think it is equally likely that LLMs will perform worse in the future, because of copyright reasons and increased privacy awareness, leading to less diverse data sets to train on.
I'll bet a lot of money this won't happen.
Firstly copyright isn't settled on this. Secondly people understand a lot more now about how to use less, higher quality data and how to use synthetic data (eg MS Phi series, Persona dataset etc and of course the upcoming OpenAI Strawberry and Orion models which use synthetic data heavily). Thirdly the knowledge about how to use multi-modal data in your LLM is much more widely spread, which means that video and code can both be used to improve LLM performance.
Building LLMs is dropping in cost quickly. Back in mid 2023 training an 7B model had already dropped to $30K and it's even cheaper now.
> I think it is equally likely that LLMs will perform worse in the future, because of copyright reasons and increased privacy awareness, leading to less diverse data sets to train on.
I'll bet a lot of money this won't happen.
Firstly copyright isn't settled on this. Secondly people understand a lot more now about how to use less, higher quality data and how to use synthetic data (eg MS Phi series, Persona dataset etc and of course the upcoming OpenAI Strawberry and Orion models which use synthetic data heavily). Thirdly the knowledge about how to use multi-modal data in your LLM is much more widely spread, which means that video and code can both be used to improve LLM performance.
[1] https://arize.com/resource/mosaicml/