If things work one or a few columns at a time this isn't necessarily true. Prosumer CPU L3 caches are already up to 128MB and will be up to 1GB before long. If one or a few full columns fits fully in cache but all the data doesn't, columnar layout may still be faster in the face of random access within the column.