I now understand what you were trying to say. I think it is still wrong to think of it as "memorization vs. generalization". In my opinion, the question is "why does it happen to generalize when it is in the process of memorization?" As alluded to in another comment, my belief is that the dimensionality of neural networks combined with certain optimization schemes favor networks whose outputs aren't overly sensitive to changes in the data. That is basically the definition of generalization. It also explains why double descent occurs. First, it memorizes as best as it can since this is "easy", then the optimization scheme starts to push it towards parameters that yield better generalizability.