To me the ability to reason is the biggest difference you can currently observe ...

To me the ability to reason is the biggest difference you can currently observe between 7B and 70B models.

People love to test this with "brain teasers". You could argue that LLMs can pattern match "what's heavier? 1kg of feathers or 2kg of steel", but there are enough examples of at the time novel puzzles that I feel confident to say that good LLMs can absolutely reason (at small scales, but it's getting better)