Copyright infringement is not avoided by changing some text so it isn’t an exact clone of the source.
Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.
There is not an algorithm for this, cases are decided on by people.
There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.
I certainly don't think it's impossible, but I think it is hard problem that won't be solved in the immediate future, and creators of data used for training are right to seek to stop wide availability of LLMs that regurgitate information they worked hard to obtain.
I think it will be a bit easier than you believe. The reason why it hasn’t been done yet is that there hasn’t been a compelling economic reason to do so.
Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.
There is not an algorithm for this, cases are decided on by people.
There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.