As a computer vision guy I'm sad JEPA didn't end up more effective. Makes perfect sense conceptually, would have easily transferred to video, but other self-supervised methods just seem to beat it!
Needs a (2023) tag. But definitely the release of ARC2 and image outputs from 4o got me thinking about the JEPA family too.
I don't know if it's right (and I'm sure JEPA has lots of performance issues) but seems good to have a fully latent space representation, ideally across all modalities, so that the concept "an apple a day keeps the doctor away" becoming image/audio/text is a choice of decoder rather than dedicated token ranges being chosen even before the actual creation process in the model begins.