Free yes but they come with an energy tag on it. (backup and redundant systems etc). I wonder how much energy it would take to store it like this compared a local hard drive and simple PC or Raspi.
Only if you don't mind them being recompressed and the metadata being deleted. Of course, if you don't, surely Google Photos would be a better choice, since it's specifically designed for this purpose, has unlimited storage for lossy-compressed photos, and photos under a certain size are left alone.
I’ve been wondering how does Creative Commons apply in ‘big data’-ish use cases. Can a dataset distributed under CC BY-SA be analyzed, possibly used as part of training input for an ML model? What if a product is built on top of a model that learned from a CC-licensed dataset? Products are rarely distributed under CC; bow far do ShareAlike & Attribution reach, by letter and by spirit?
Should there be (or does there exist) a type of license for data—different from the ones typically used for software source code (MIT, GPL) and ones typically used for creative work (CC), encouraging innovation but giving something back to dataset creator or maintainer?
Those are reasonable questions. At work, we release lots of data under OGL (Open Government Licence) which is CC compatible.
For my personal stuff, if you'd like a different license, I'm happy for you to pay me for a more restrictive one. But if you build an ML using my open data, I expect that model to be released under a similarly licence.
Didn’t know about OGL, it does look suitable for this purpose.
To (partially) answer myself, contrary to what I implied CC-BY does cover this base if (for example) the creator of the dataset accepts a note in product’s “About” documentation as sufficient attribution.
It looks like a great dataset to associate power generation to pictures of the sky.
Perhaps it could help decide the best location to place the solar panels? One big picture of the sky and you would get the power-generating estimate of each location based solely on the image.
Perhaps taking several large pictures over the year would help decide the best location on average. Or the location with best worst-case scenario. Hmmm
I think a machine learning algorithm wouldn't care about that, because with a large enough training data set it would start to account for that and be able to accurately predict energy output based on the image alone.
Regardless of how big the dataset is, the image recognition algorithm is bound to get confused by the large differences in colour that exposure and sensitivity results in. It will likely look for the overall gray-to-blue gradient and estimate results from that; on the gray end of the things alone, the camera settings make a very, very big difference. You can't really tell the algorithm to ignore these and only determine the level of 'cloudiness.'
Another issue with this dataset is the overlay changing over time in text content, font, and colour. The algorithm might overfit and think e.g. yellow font presence means higher output simply because the output was higher during that period. You could strip away the text, but then you're introducing potential errors into the dataset yourself.
If each of the 1.2 million tweets includes a ~150 KB image, that’s 180 GB of images hosted on Twitter for free.