GPT-3 is a very different model from GPT-3.5. My understanding is that they were comparing LLaMA's performance to benchmark scores published for the original GPT-3, which came out in 2020 and had not yet had instruction tuning, so was significantly harder to use.
GPT 3.5 is the instruction tuned modern GPT models, such as Da Vinci 002 and 003.
3.5 Turbo is the ChatGPT model: it's cheaper (1/10th the price), faster and has a bunch of extra RLHF training to make it work well as a safe and usable chatbot.
Or... it could be that Chinchilla study has deficiencies in measuring capabilities of models maybe? Either that or your explanation. Frankly I don't think 13B is better than GPT-3 (text-davinci-001 which I think is not RLHF - but maybe better than base)