Here is my rank conjecture. Part of the advantage to working in frequency space as I suggest is because the output is invariant under certain transformations (up to a multiplication by a phase factor e^(i omega)). In the 2d image case, the invariant is translations along either axis. In the 1d time series case, the invariant is translations along the time axis. That's just a fancy way to say you get the same frequencies out no matter what time you pressed record. So I conjecture that the sorts of changes to the sound wave form which don't get in the way of our comprehension are exactly the invariant of some transform similar to Fourier which happens as a low level processing step shortly after the ear. The exact extent to which a vowel tone differs from the native speaker average manifests as (and is encoded in) a phase shift on the output of the transform, but the output magnitudes stay invariant and thus recognizable.