While impressive number of images today. I believe this will be an underwhelming amount of images compared to what models are trained on in the future.
This is an incomplete analogy but from the time a baby is born that baby will have seen 1,892,160,000 frames of data per eye 3,784,320,000 frames in a year. That baby practically knows nothing about the world still.
You are correct. Deepmind released a paper earlier this year showing that data is the primary constraint holding back these models, not their architecture size (ie a model with 5 billion parameters is not much better than one with 1 billion, but more data can make both much better) [0].
I will copy paste the main findings from the article here:
- Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big.
- If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models.
- If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible.
- The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other.
- The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available.
I'm not an ML engineer (anymore) so I don't know the particulars, but I'd say that while the amount of data matters, it's still better to have high quality data than to not have it.
The assumption that human eyes can be measured in FPS is, in itself, very questionable. And if it were indeed the case, then it would surely be far in access of 60fps…
Well, inhibitory alpha waves cycle across the visual field 10 times a second. People with faster alpha waves can detect two flashes that people with slower alpha waves see as one flash.
The assumption that human eyes can be measured in FPS is, in itself, very questionable.
In the strictest sense, yes. But it seems quite reasonable to think that there is something like an "FPS equivalent" for the human eye. I mean, it's not magic, and physics comes into play at some level. There's a shortest unit of time / amount of change that the eye can resolve. From that you could work out something that is analogous to a frame-rate.
And if it were indeed the case, then it would surely be far in access of 60fps
Not necessarily. Quite a few people believe that the human eye "FPS equivalent" is somewhere between 30-60 FPS. That's by no means universally accepted and since it's just an analogy to begin with the whole thing is admittedly a little big dodgy. But by the same token, it's not immediately obvious that the human "FPS equivalent" would be "far in excess of 60 FPS" either.
It would be nice to have a dataset of a couple "raising" a Video recorder for 1 year as if they would a baby. A continuous stream of data.
The project I'm working on right now is to build a sort of "body" for a (non ambulatory, totally non anthropomorphic) "baby AI" that senses the world using cameras, microphones, accelerometer/magnetometer/gyroscope sensor, temperature sensors, gps, etc. The idea is exactly to carry it around with me and "raise" it for long periods of time (a year? Sure, absolutely, in principle. But see below) and explore some ideas about how learning works in that regime.
The biggest (well, one of the biggest) challenge(s) is going to be data storage. Once I start storing audio and video the storage space required is going to ramp up quickly, and since I'm paying for this out of my own pocket I'm going to be limited in terms of how much data I can keep around. Will I be able to keep a whole year? Don't know yet.
There's also some legal and ethical stuff to work out, around times when I take the thing out in public and am therefore recording audio and video of other people.
Glad to hear you are working on such a project. There definitely will be a lot of privacy concerns in any such project so it may be difficult to open source the data to broad public.
But could still be useful to research institutes who follow privacy guidelines.
It might be best to do a short stint of 1 week to test the feasibility. That should give you a good estimate on future projections of how much data it will consume after a month, 3 months, and a year.
I imagine any intelligent system could work with reduced data quality/lossy data at least on the audio.
As long as it's consistent in the type/amount of compression. So instead of WAV/FLAC/RAW. You could encode it to something like Opus 100 Kbps and that would give you 394.2 Gigabytes of Data for a single year for the audio.
As for video... it would definitely require a lot of tricks to store on a hobbyist level.
Yep. Your reply here encapsulates a lot of what I've been thinking about for the past few weeks. I'd love to open-source at least some of the data I collect, but the privacy/ethics issues have to be considered. And as far as that goes, there are legal/ethical issues around simply collecting data even if I don't share it, that come into play where other people are involved.
It might be best to do a short stint of 1 week to test the feasibility. That should give you a good estimate on future projections of how much data it will consume after a month, 3 months, and a year.
Yep. That's basically the approach I took with "phase 1" where the only data being ingested was gps / accelerometer data. I just let it run for a couple of weeks and then extrapolated out what the storage requirements would be for the future. Obviously audio and video are going to change the equation a lot, but the same principle is what I am planning to employ.
I imagine any intelligent system could work with reduced data quality/lossy data at least on the audio.
Yep, that's another area I've been thinking a lot about. The "instinct" is to capture everything at the highest possible resolution / sampling rate / etc. and store in a totally lossless format. But that is also the most expensive scenario and if it's not strictly required, then why do it? We know human hearing at least can work with relatively crappy audio. Look at the POTS phone system and it's 8khz of bandwidth for example. Does that analogy hold for video? Good question.
As long as it's consistent in the type/amount of compression. So instead of WAV/FLAC/RAW. You could encode it to something like Opus 100 Kbps and that would give you 394.2 Gigabytes of Data for a single year for the audio.
Agreed.
As for video... it would definitely require a lot of tricks to store on a hobbyist level.
Definitely. One thing that may help with costs in the short-term is that I'm very explicitly not (for now anyway) using a cloud storage service. Data ingestion is to a server I own and physically have in my home. I can get away with this because while the aggregate total amount of data may wind up fairly big over longer periods of time, the rate at which I need to ingest data isn't all that high (there's only one of these devices sending to the server). And I can just keep adding 5TB or 10TB drives as needed. When one fills up, I can unplug it, replace it with another, label and store it, and move on. The big risks here are that I don't really have any redundancy in that scenario, especially if my home burns down or something. But in that case I have bigger problems to worry about anyway!
There are other downsides to this approach, like dealing with the case of needing to access the entire year's worth of data "at once" for analysis or training, but I'm not sure that need will ever even arise.
The upside is that babies get to interact with the environment they're training on. Image models can't move the camera a few cm to the right if they're interested in the perspective of a particular scene.
Not absolutely nothing, the neural net is initialized with some weights encoding basic things (breathing, sucking, crying, etc.). Newborn horse walks and follows mother after first 5-10 minutes.
On the surface, that sounds like a reasonable position to take. ("Cowley proposes an alternative: that language acquisition involves culturally determined language skills, apprehended by a biologically determined faculty that responds to them. In other words, he proposes that each extreme is right in what it affirms, but wrong in what it denies. Both cultural diversity of language, and a learning instinct, can be affirmed; neither need be denied.")
GPT's ability to fool intelligent people into thinking that it is "intelligent" itself seems like a powerful argument that language, more than anything else, is what makes humans capable of higher thought. Language is all GPT has. (Well, that and a huge-ass cultural database.)
Intelligence is one of those areas in which, once you fake it well enough, you've effectively made it. Another 10x will be enough to tie the game against an average human player.
There's a really easy, yet unconscionably horrible experiment we could perform to test the assumption that we're preprogrammed with any sort of knowledge.
Take a baby and stick it in a room. Let it grow up with absolutely no stimulation whatsoever. They are given food and that's about it. What do you think it can demonstrate knowledge of by the time it reaches 5? 10? 15?
All behavior is learned behavior. People talk about sucking and breathing and walking horses and what not, but babies do have to learn how to latch and how to feed. Now, they can work it out themselves. But quick acquisition of a skill does not mean the skill already existed.
Not to mention it's a far cry from sucking to language. Or knowing what a person is. Or who a person is.
This is an incomplete analogy but from the time a baby is born that baby will have seen 1,892,160,000 frames of data per eye 3,784,320,000 frames in a year. That baby practically knows nothing about the world still.