The above is a common anthropocentric take that has been repeatedly disproven by...

The above is a common anthropocentric take that has been repeatedly disproven by the last decade of deep learning research: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Understanding audio with inputs in the frequency domain isn’t required for understanding frequencies in audio.

A large enough system with sufficient training data would definitely be able to come up with a Fourier transform (or something resembling one), if encoding it helped the loss go down.

> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

Today’s diffusion models learn representations from raw pixels, without even the concept of convolutions.

Ditto for language - as long as the architecture is 1) capable of modeling long range dependencies and 2) can be scaled reasonably, whether you pass in tokens, individual characters, or raw ASCII bytes is irrelevant. Character based models perform just as well (or better than) token/word level models at a given parameter count/training corpus size - the main reason they aren’t common (yet) is due to memory limitations, not anything fundamental.

For further reading, I’d recommend literature on transformer circuits for learning arithmetic without axioms: https://www.lesswrong.com/posts/CJsxd8ofLjGFxkmAP/explaining...