Alignment and other optimisations that increase size tend to show their greatest benefit in microbenchmarks, but as the size of the data increasingly doesn't fit in the cache, you'll find that the increased cache misses will start decreasing performance.
That's highly dependent on access patterns and domain space.
When I was optimizing for render sorting on mobile hardware we knew we would never see 2k+ drawcalls and cache aware tuning gave us huge end-user facing benefits.
I've long had the idea for trying to write a Valgrind tool to help with this by analyzing struct usage. Something to profile how hot and cold the various fields of my structs are, and also to correlate which fields in a struct are frequently accessed together (i.e., within N cycles of each other). A tool for the profile part of "profile before optimizing" to go with the optimizations you mentioned.
I'm not sure how feasible this is. But if someone else wants to steal this idea and implement it for me, be my guest. :-)
The problem with this kind of instrumentation is that it is very expensive to collect, which affects the data collected in a way that may skew it from true runtime performance. Maybe that is still good enough! (It also feels difficult to implement.)
He shows the complete opposite, how to seperate data, so they won't appear on the same cache line. This is of course nonsense for single threaded accesses, but beneficial for concurrent accesses.