The current state of the internet, has reached a very low point. One would expect more from berkley.edu.
EDIT: I find the downvotes preposterous. Are we somehow supposed to expect requiring a proprietary browser to simply download a file from an educational institution now?
This appears to be a hosted site the team put together for a public front to the dataset, not an official Berkeley page. It certainly doesn't meet any standards required by a public institution, like ADA compliance, either.
Broken downloads, yes. Corrupted downloads, no. Given that files served from CDNs are still usually served without HTTPS, there aren't many checksums between the two ends of the pipe to protect it from on-the-wire corruption. Doesn't matter much for video streaming ala Netflix; matters a lot for a structured dataset.
BitTorrent and related protocols handle this automatically by breaking the file into large (megabyte-range) chunks, and then putting the cryptographic hashes of all the chunks in the manifest. As long as you've received the manifest, you can protect against both passive corruption and active MITMing in the same way you resume broken downloads: by just discarding chunks that failed to complete to a state of "has all the bytes and hashes correctly", and trying those chunks again.
(Sadly, HTTP doesn't support a digest response header that applies to each chunk of a "Transfer-Encoding: chunked" response stream, or it could vaguely compete with this. The Content-MD5 header could have done this, but it was removed precisely because implementations were in conflict on whether it was for this, or for hashing the document as a whole.)
In case you don't want to register and are curious about some metadata:
Videos:
100K video clips:
Size:1.8TB
Info:
The GPS/IMU information recorded along with the videos:
Size: 3.9GB
Images
It has two subfolders. 1) 100K labeled key frame images extracted from the videos at 10th second 2) 10K key frames for full-frame semantic segmentation.:
Size: 6.5GB
Labels:
Annotations of road objects, lanes, and drivable areas in JSON format. Details at Github repo.:
Size: 147MB
Drivable Maps:
Segmentation maps of Drivable areas.:
Size: 661MB
Segmentation:
Full-frame semantic segmentation maps. The corresponding images are in the same folder.:
Size: 1.2GB
Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement; and permission use, copy, modify and distribute this software for commercial purposes (such rights not subject to transfer) to BDD member and its affiliates, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, otl@berkeley.edu, http://ipira.berkeley.edu/industry-info for commercial licensing opportunities.
IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
If I'm reading that right, it's not an open source license. It has a field-of-use restriction because only Berkeley Deep Dive members can use it for commercial purposes.
EDIT: The title of this HN topic is wrong. It's not what's in the source and it to be changed. (I'm relieved that it's just a submitter summarizing incorrectly and that Berkeley Deep Dive was not responsible for this mistake.)
So you're saying neither (say) VirtualBox nor other GNU software qualify as open-source? They seem to fail the very first sentence of criterion #1, since they place restrictions on when/how you can redistribute the source/software in aggregation with other sources/software.
The GPL absolutely falls under the open source definition. You can ship GPL programs and code alongside programs that may have different licenses. What you can't do is combine GPL code with code that has incompatible licenses.
Copyleft imposes some requirements on redistribution. It does not impose restrictions on usage at all.
> Copyleft imposes some requirements on redistribution. It does not impose restrictions on usage at all.
I wasn't saying copyleft imposes restrictions on usage.
The first "open-source" criterion says the following (and note that, like you said, this is a restriction on redistribution and not usage):
> The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources.
We both agree GPL places a restriction on redistribution (namely: that it must be with source code). However, criterion #1 says very clearly that the license can't place restrictions on software redistribution when it's aggregated with software from different sources.
This is a pretty clear contradiction to me. The fact that you cannot redistribute GPL software without source (whether bundled with other software or otherwise) is a restriction on whether/how you can redistribute GPL software, hence it goes against the "shall not restrict" requirement. And there's no exception carved out for "restrictions that require source code to be included". So I don't see how we get to ignore this and cherry-pick what restrictions actually fall under "restrictions"...
GPL does not say that. What it does say is that you must provide the source upon request.
> The fact that you cannot redistribute GPL software without source (whether bundled with other software or otherwise) is a restriction on whether/how you can redistribute GPL software, hence it goes against the "shall not restrict" requirement. And there's no exception carved out for "restrictions that require source code to be included".
So all of what you said there is simply incorrect, because like I said, you absolutely can distribute GPL software without including the source code alongside it. And that is what is done by everyone 99% of the time.
You only need to provide the source upon request to the people that ask you for it.
I encourage you to take the time to read the GPL FAQ. Even though GPL is not my preferred license I think it is important to have a good understanding of it. https://www.gnu.org/licenses/gpl-faq.en.html
> So all of what you said there is simply incorrect, because like I said, you absolutely can distribute GPL software without including the source code alongside it. And that is what is done by everyone 99% of the time. You only need to provide the source upon request to the people that ask you for it.
No, it makes no difference at all. You cannot redistribute the software unless you are willing and able to redistribute the source code as well. That is very clearly a restriction on your redistribution of the software. The fact that we happen to be talking about the software's own source code makes no difference as to whether it's a restriction or not. It'd be a restriction whether we're talking about "source code", or "$100,000", or anything else. The simple fact that you have to be willing and able to provide {something} before you can redistribute the software is obviously a restriction on your redistribution of the software.
"Why Open Source misses the point of Free Software" by Richard Stallman
> When we call software “free,” we mean that it respects the users' essential freedoms: the freedom to run it, to study and change it, and to redistribute copies with or without changes. This is a matter of freedom, not price, so think of “free speech,” not “free beer.”
There's a lot of interesting history between the Open Source and Free Software movements, and this is a valuable view of one side of the story. I'd recommend it for people who are interested, though I hope they would also seek out the other side.
However, we're going off topic here. The self-driving download is neither Open Source nor Free Software.
The license seems a bit odd. It refers to software, but I can't see any software in any of the downloads - just data. Clearly the license is intended to cover the contents of the downloads, but the wording seems wrong then.
I'm not a lawyer, so maybe someone with more expertise could chime in...
I am curious, does training the AI on other driving data sets help? What I means is, not just sedan data set. Trucks, buses, maybe 2 wheelers? Would this help the model generalize more and make better prediction of how other vehicles work or would it just add noise?
It helps a lot in my experience. So in simulation I tried this with imitation learning by training on a hood camera, a camera at the height of what a semi-truck hood would be, and another camera offset to the left 1.5m along with steering and throttle for labels. I also added random noise to the position (less than a meter), rotation (less than degree), fov (less than a degree), capture height (< 1%), and capture width (< 1%). The result was a 3x higher average score on a driving benchmark where the score was meters driven minus seconds taken, second-meters of lane deviation, and seconds where acceleration surpassed 0.5g forces (to measure comfort). The dataset, training code, and sim are at deepdrive.io - a different entity with the same name :)
This made me wonder, how will driver-less cars interpret roads that have very faded, non-existent, or unique lane markings? I imagine rural, park, construction, and weather like snow/dust can really obscure things.
First, thanks for sharing this data.
Second - why on earth would anyone create a 1.8TB zip file of 100k videos? Likely the video encoder already compressed every possible bit out of these videos, zip is not going to make it smaller. It is, however, going to make it mandatory for everyone to download the full 1.8TB file even to get a single video out of this archive. Makes me wonder what else is happening here (like the chrome only download link which is hosted on another domain, and the non-https login, and that escalator to nowhere..*)
Tangent - does anyone know if Waymo is user-generated driving data from Navigation mode on Google Maps for their self-driving data? Or if that would even be feasible or useful?
GPS + IMU data from phones might be useful but neither is particularly accurate compared to video data. Maybe the IMU data would be useful to make the car feel more "human"
I'm confused...are the datasets free (as in for profit and nonprofit) to use and the pretrained models (mentioned in the paper) are under the UC license?
Presumably lots of successful driving video is at least a prerequisite to validating an automatic driving system, even if you're not using it for training?
Before I register to download the data, is there a smaller dataset for you to play with on the portal? I've been itching to do something fun after taking my SDC from Udacity. Nut1.8TB is way more than I can handle right now. Can someone upload a portion of this? (<10GB)?
I already emailed the creator a couple weeks back to request / offer a torrent, but haven't heard anything back.
The problem here is that both of your suggestions involve a 2-step process:
1. Download the file
2. Create a torrent from it, or upload it to IPFS
Since step 1 is already a 2TB download, getting to either version of step 2 is untenable. I agree with one of the other posters in this thread, the default for something like this should be torrent since you get both distribution and checksumming for free.
It would also be nice if it wasn't a 2TB zip file, which then has to be unzipped onto another 2TB of storage for practical use.
Subject to licensing, we intend to make the dataset available (along with loads of other big datasets for ML) using a bit-torrent like program called Dela for the Hops Hadoop platform. Maybe in 3 weeks or so, it will all be released - with this dataset.
Dela integrates with HDFS/S3/GCS backends, and it supports NAT traversal, and a delay-based congestion control over UDP - good for high bandwidth/high latency networks. See http://www.hops.io and our paper - https://ieeexplore.ieee.org/document/7980225/
"FAQ: The download buttons do not work".
"The website is fully supported by Chrome now"
The current state of the internet, has reached a very low point. One would expect more from berkley.edu.
EDIT: I find the downvotes preposterous. Are we somehow supposed to expect requiring a proprietary browser to simply download a file from an educational institution now?