What if even though it's a small portion of the training data, their content has an outsized influence on the output being generated? A random NYT article about Donald Trump and a random Wikipedia article about some obscure nematode might be around the same share of training data but if 10,000x more users are asking about DJT than the nematode, what is fair? Obviously they'll need to pay royalties on the usage! /s