I'm the tech lead on the team that runs http://police.uk, the UK Police crime mapping website, and our data is available to download and through an API at http://data.police.uk. The dataset isn't nearly as big as some of the ones in this list (~40MM rows), but it's nice that the Home Office are open with their data.
People have built some cool apps with it which are showcased[1] on the site, there's even a Pebble watch app[2] which is due to be added to that list shortly.
We also have an open-source Python client[3] for the API, which I'm planning to post here once the documentation is finished.
The open data index ranks countries by how much data they make publically available based on 10 areas (transport, budget, spending, etc). It isn't a full global picture and it doesn't cover all datasets in every country. However, it's still useful as a comparison tool and a way of seeing what data is (and isn't available).
They have rankings for 70 countries. The top 10 (from October 2013) are
1. UK
2. US
3. Denmark
4. Norway
5. Netherlands
6. Finland
7. Sweden
8. New Zealand
9. Australia
10. Canada
Tangentially related, my master's thesis is applying predictive algorithms to web traffic for scaling purposes and I cannot believe that their isn't more server trace data available. The best I've done is some data from the mid-90s and Wikipedia in 2007.
So if any of you wonderful people feel so inclined as to donate some requests/sec metrics, I would be deeply appreciative.
What's interesting to me is how much this has been criticized and yet it is still in place. This is pretty much the first point that comes up on HN every time the topic is Quora, so there is simply no way they are unaware of the complaints.
I worked at Quora, and I wish Adam would speak about this publicly. His reasons for this are very solid and he articulates them in a very convincing way. Maybe now that Quora has joined YC, he may speak more about it publicly since it's such a point of contention for many YC News readers.
It's a discussion of "good places to find large datasets open to the public". There are 132 or so answers, so I wouldn't even try to copy them all here (copyright issues aside anyway), but here's the current contents of the answer wiki. Note that some of these got partially truncated by the c&p from Quora, so if you want the full URL, you'll have to login or get somebody else to fix it. I'm too lazy for all that right now. :-)
Here are many of the links mentioned so far:
Cross-disciplinary data repositories, data collections and data search engines:
http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative
Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
I don't blame you, but it worked for me without signing into anything and I deleted my Facebook account several years ago. I do have an actual Quora account, which of course it would be a nuisance to have to sign into every time as well, but for whatever reason, it seems to either keep me signed in or sign me in automatically; maybe try to see if you can get it to do this? I'm using Chrome if that matters.
The problem with the data they provide is it is not relational, missing huge amounts of movies, leaving you with no way to distinguish between movies other than title which gives you mismatches when movies have the same name in the same year.
Yes, imdb is a pretty poor source of data. From my research, a combination of the well organized freebase film database[1] (~19mm facts) with details filled in from imdb is a better approach. However, processing data from freebase is not trivial and requires a decent amount of time investment to grok.
Corelate and infer from the data. If you can't distinguish between similarly named films, you still have the director and cast and years when they are active, along with country, production studio and countless other attributes. A single data source is never effective anyway.
Along with data sources you will also need a production process and streamlined workflow. It's a holistic, iterative exercise.
I have a Ruby project that parses IMDb data into a database, and which can also supplement itself with data from Rotten Tomatoes and Freebase. It's fairly generic and could be extended. Email me if you want to chat.
Amazon hosts public datasets at
https://aws.amazon.com/publicdatasets/
Good if you want to quickly spin up an instance, copy data over from s3 and process it.
"Planet.osm is the OpenStreetMap data in one file: all the nodes, ways and relations that make up our map. A new version is released every week. It's a big file (XML variant over 400GB uncompressed, 34GB compressed)."
People have built some cool apps with it which are showcased[1] on the site, there's even a Pebble watch app[2] which is due to be added to that list shortly.
We also have an open-source Python client[3] for the API, which I'm planning to post here once the documentation is finished.
[1]: http://www.police.uk/apps/
[2]: https://git.bengcooper.co.uk/bengcooper/pebble-crimewatch/wi...
[3]: https://github.com/rkhleics/police-api-client-python/