Yep. Your reply here encapsulates a lot of what I've been thinking about for the...

Yep. Your reply here encapsulates a lot of what I've been thinking about for the past few weeks. I'd love to open-source at least some of the data I collect, but the privacy/ethics issues have to be considered. And as far as that goes, there are legal/ethical issues around simply collecting data even if I don't share it, that come into play where other people are involved.

It might be best to do a short stint of 1 week to test the feasibility. That should give you a good estimate on future projections of how much data it will consume after a month, 3 months, and a year.

Yep. That's basically the approach I took with "phase 1" where the only data being ingested was gps / accelerometer data. I just let it run for a couple of weeks and then extrapolated out what the storage requirements would be for the future. Obviously audio and video are going to change the equation a lot, but the same principle is what I am planning to employ.

I imagine any intelligent system could work with reduced data quality/lossy data at least on the audio.

Yep, that's another area I've been thinking a lot about. The "instinct" is to capture everything at the highest possible resolution / sampling rate / etc. and store in a totally lossless format. But that is also the most expensive scenario and if it's not strictly required, then why do it? We know human hearing at least can work with relatively crappy audio. Look at the POTS phone system and it's 8khz of bandwidth for example. Does that analogy hold for video? Good question.

As long as it's consistent in the type/amount of compression. So instead of WAV/FLAC/RAW. You could encode it to something like Opus 100 Kbps and that would give you 394.2 Gigabytes of Data for a single year for the audio.

Agreed.

As for video... it would definitely require a lot of tricks to store on a hobbyist level.

Definitely. One thing that may help with costs in the short-term is that I'm very explicitly not (for now anyway) using a cloud storage service. Data ingestion is to a server I own and physically have in my home. I can get away with this because while the aggregate total amount of data may wind up fairly big over longer periods of time, the rate at which I need to ingest data isn't all that high (there's only one of these devices sending to the server). And I can just keep adding 5TB or 10TB drives as needed. When one fills up, I can unplug it, replace it with another, label and store it, and move on. The big risks here are that I don't really have any redundancy in that scenario, especially if my home burns down or something. But in that case I have bigger problems to worry about anyway!

There are other downsides to this approach, like dealing with the case of needing to access the entire year's worth of data "at once" for analysis or training, but I'm not sure that need will ever even arise.