To make a camera-based solution practical, it seems like one would neeed a specialized camera, for modest resolution, but very high frame rate. Without enough frame-rate, you can't capture high frequencies. Even if speech could be discerned from a spectrum that rolls off at 300hz, well under the peak energy of a typical human's speech, you would still need many hundreds of frames per second.
If you watch the video at the end, you'll see that they actually exploit the fact that CMOS pixels are read out sequentially and use the data from each row to extract surprisingly decent high frequency data from even 60 Hz video!
They article says they were able to use the rolling shutter on standard cameras to extract data at a much higher frequency than the nominal framerate would allow.