Seems easy enough to do a more rigorous test. Just find a large set of novel text, write a program to segment it by sentence as well as uppercasing and removing spaces/punctuation.
Then run it through the GPT-4 API and compare the output to the original.
Then run it through the GPT-4 API and compare the output to the original.