That would have the same licensing problems that they have though: that alpaca_data.json file was created using GPT3. But creating a "clean" training set of 52,000 examples doesn't feel impossible to me for the right group.
You're only bound by the terms of OpenAI's agreement if you agreed to the terms of use. If a third party obtained the data without signing an agreement with OpenAI (eg. by just downloading it from that repo) they are under no obligation to refrain from using it to compete with OpenAI. It is fair-use by the same argument OpenAI itself uses to train its own models on publicly available data.
That would have the same licensing problems that they have though: that alpaca_data.json file was created using GPT3. But creating a "clean" training set of 52,000 examples doesn't feel impossible to me for the right group.