Large Public Datasets

robgolding · on May 11, 2014

I'm the tech lead on the team that runs http://police.uk, the UK Police crime mapping website, and our data is available to download and through an API at http://data.police.uk. The dataset isn't nearly as big as some of the ones in this list (~40MM rows), but it's nice that the Home Office are open with their data.

People have built some cool apps with it which are showcased[1] on the site, there's even a Pebble watch app[2] which is due to be added to that list shortly.

We also have an open-source Python client[3] for the API, which I'm planning to post here once the documentation is finished.

[1]: http://www.police.uk/apps/

[2]: https://git.bengcooper.co.uk/bengcooper/pebble-crimewatch/wi...

[3]: https://github.com/rkhleics/police-api-client-python/

chestnut-tree · on May 11, 2014

The open data index ranks countries by how much data they make publically available based on 10 areas (transport, budget, spending, etc). It isn't a full global picture and it doesn't cover all datasets in every country. However, it's still useful as a comparison tool and a way of seeing what data is (and isn't available).

They have rankings for 70 countries. The top 10 (from October 2013) are

  1. UK
  2. US
  3. Denmark
  4. Norway
  5. Netherlands
  6. Finland
  7. Sweden
  8. New Zealand
  9. Australia
  10. Canada

https://index.okfn.org/country

saganus · on May 11, 2014

Non-blured:

https://www.quora.com/Where-can-I-find-large-datasets-open-t...

Someone else shared this tip before. Note the "?share=1"

dang · on May 11, 2014

We added it to the url.

andrewguenther · on May 11, 2014

Tangentially related, my master's thesis is applying predictive algorithms to web traffic for scaling purposes and I cannot believe that their isn't more server trace data available. The best I've done is some data from the mid-90s and Wikipedia in 2007.

So if any of you wonderful people feel so inclined as to donate some requests/sec metrics, I would be deeply appreciative.

eksith · on May 11, 2014

I'm not signing in to Facebook to read that.

asher_ · on May 11, 2014

Add ?share=1 to the end of any Quora url to avoid signing in.

camus2 · on May 11, 2014

this trick is getting old.` Quora UX is just rotten.

nilkn · on May 11, 2014

What's interesting to me is how much this has been criticized and yet it is still in place. This is pretty much the first point that comes up on HN every time the topic is Quora, so there is simply no way they are unaware of the complaints.

codezero · on May 11, 2014

I worked at Quora, and I wish Adam would speak about this publicly. His reasons for this are very solid and he articulates them in a very convincing way. Maybe now that Quora has joined YC, he may speak more about it publicly since it's such a point of contention for many YC News readers.

mindcrime · on May 11, 2014

It's a discussion of "good places to find large datasets open to the public". There are 132 or so answers, so I wouldn't even try to copy them all here (copyright issues aside anyway), but here's the current contents of the answer wiki. Note that some of these got partially truncated by the c&p from Quora, so if you want the full URL, you'll have to login or get somebody else to fix it. I'm too lazy for all that right now. :-)

Here are many of the links mentioned so far:

Cross-disciplinary data repositories, data collections and data search engines:

http://usgovxml.com

http://aws.amazon.com/datasets

http://databib.org

http://datacite.org

http://figshare.com

http://linkeddata.org

http://reddit.com/r/datasets

http://thedatahub.org alias http://ckan.net

http://quandl.com

Social Network Analysis Interactive Dataset Library (Social Network Datasets)

Datasets for Data Mining

http://enigma.io

Single datasets and data repositories

http://archive.ics.uci.edu/ml/

http://crawdad.org/

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.govloop.com

http://data.gov.uk/

http://data.medicare.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sunlightlabs.com

https://datamarket.azure.com/

http://developer.yahoo.com/geo/g...

http://econ.worldbank.org/datasets

http://en.wikipedia.org/wiki/Wik...

http://factfinder.census.gov/ser...

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://googleresearch.blogspot.com

http://books.google.com/ngrams/

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://rechercheisidore.fr

http://snap.stanford.edu/data/in...

http://timetric.com/public-data/

https://wist.echo.nasa.gov/~wist...

http://www2.jpl.nasa.gov/srtm

http://www.archives.gov/research...

http://www.bls.gov/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.datakc.org

http://dbpedia.org

http://www.delicious.com/jbaldwi...

http://www.faa.gov/data_research/

http://www.factual.com/

http://research.stlouisfed.org/f...

http://www.freebase.com/

http://www.google.com/publicdata...

http://www.guardian.co.uk/news/d...

http://www.infochimps.com

http://www.kaggle.com/

http://build.kiva.org/

http://www.nationalarchives.gov....

http://www.nyc.gov/html/datamine...

http://www.ordnancesurvey.co.uk/...

http://www.philwhln.com/how-to-g...

http://www.imdb.com/interfaces

http://imat-relpred.yandex.ru/en...

http://www.dados.gov.pt/pt/catal...

http://knoema.com

http://daten.berlin.de/

http://www.qunb.com

http://databib.org/

http://datacite.org/

http://data.reegle.info/

http://data.wien.gv.at/

http://data.gov.bc.ca

https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)

http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)

eksith · on May 11, 2014

Thanks! That's actually a very helpful list. It's ironic that it's posted on a large closed dataset like Quora ;)

zwegner · on May 11, 2014

Some of the links got truncated by Quora, so here's the same list (modulo intervening wiki edits) with full URLs:

  http://usgovxml.com
  http://aws.amazon.com/datasets
  http://databib.org
  http://datacite.org
  http://figshare.com
  http://linkeddata.org
  http://reddit.com/r/datasets
  http://thedatahub.org
  http://ckan.net
  http://quandl.com
  http://www.growmeme.com/overview
  http://www.kdnuggets.com/datasets/index.html
  http://enigma.io
  http://archive.ics.uci.edu/ml/
  http://crawdad.org/
  http://data.austintexas.gov
  http://data.cityofchicago.org
  http://data.govloop.com
  http://data.gov.uk/
  http://data.medicare.gov
  http://data.seattle.gov
  http://data.sfgov.org
  http://data.sunlightlabs.com
  https://datamarket.azure.com/
  http://developer.yahoo.com/geo/geoplanet/data/
  http://econ.worldbank.org/datasets
  http://en.wikipedia.org/wiki/Wikipedia:Database_download
  http://factfinder.census.gov/servlet/DatasetMainPageServlet
  http://ftp.ncbi.nih.gov/
  http://gettingpastgo.socrata.com
  http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
  http://books.google.com/ngrams/
  http://medihal.archives-ouvertes.fr
  http://public.resource.org/
  http://rechercheisidore.fr
  http://snap.stanford.edu/data/index.html
  http://timetric.com/public-data/
  https://wist.echo.nasa.gov/~wist/api/imswelcome/
  http://www2.jpl.nasa.gov/srtm
  http://www.archives.gov/research/alic/tools/online-databases.html
  http://www.bls.gov/
  http://www.crunchbase.com/
  http://www.dartmouthatlas.org/
  http://www.data.gov/
  http://www.datakc.org
  http://dbpedia.org
  http://www.delicious.com/jbaldwinconnect/DataSets
  http://www.faa.gov/data_research/
  http://www.factual.com/
  http://research.stlouisfed.org/fred2/
  http://www.freebase.com/
  http://www.google.com/publicdata/directory
  http://www.guardian.co.uk/news/datablog
  http://www.infochimps.com
  http://www.kaggle.com/
  http://build.kiva.org/
  http://www.nationalarchives.gov.uk/doc/open-government-licence/open-government-licence.htm
  http://www.nyc.gov/html/datamine/html/home/home.shtml
  http://www.ordnancesurvey.co.uk/oswebsite/opendata/
  http://www.philwhln.com/how-to-get-experience-working-with-large-datasets
  http://www.imdb.com/interfaces
  http://imat-relpred.yandex.ru/en/datasets
  http://www.dados.gov.pt/pt/catalogodados/catalogodados.aspx
  http://knoema.com
  http://daten.berlin.de/
  http://www.qunb.com
  http://databib.org/
  http://datacite.org/
  http://data.reegle.info/
  http://data.wien.gv.at/
  http://data.gov.bc.ca
  https://pslcdatashop.web.cmu.edu/
  http://www.icpsr.umich.edu/icpsrweb/CPES/

rwallace · on May 11, 2014

I don't blame you, but it worked for me without signing into anything and I deleted my Facebook account several years ago. I do have an actual Quora account, which of course it would be a nuisance to have to sign into every time as well, but for whatever reason, it seems to either keep me signed in or sign me in automatically; maybe try to see if you can get it to do this? I'm using Chrome if that matters.

icpmacdo · on May 11, 2014

Anyone have a dataset of Every movie to come out between 1950 and 2013 I need cast, year released and title?

siddboots · on May 11, 2014

http://www.imdb.com/interfaces

deanc · on May 11, 2014

The problem with the data they provide is it is not relational, missing huge amounts of movies, leaving you with no way to distinguish between movies other than title which gives you mismatches when movies have the same name in the same year.

mcphilip · on May 11, 2014

Yes, imdb is a pretty poor source of data. From my research, a combination of the well organized freebase film database[1] (~19mm facts) with details filled in from imdb is a better approach. However, processing data from freebase is not trivial and requires a decent amount of time investment to grok.

[1]https://www.freebase.com/film

mahmud · on May 11, 2014

Corelate and infer from the data. If you can't distinguish between similarly named films, you still have the director and cast and years when they are active, along with country, production studio and countless other attributes. A single data source is never effective anyway.

Along with data sources you will also need a production process and streamlined workflow. It's a holistic, iterative exercise.

atombender · on May 11, 2014

I have a Ruby project that parses IMDb data into a database, and which can also supplement itself with data from Rotten Tomatoes and Freebase. It's fairly generic and could be extended. Email me if you want to chat.

codezero · on May 11, 2014

A lot of IMDB's data is available freely, it may not go to 2013 though. Sorry, don't have the link readily available.

warrenmar · on May 11, 2014

Amazon hosts public datasets at https://aws.amazon.com/publicdatasets/ Good if you want to quickly spin up an instance, copy data over from s3 and process it.

veb · on May 11, 2014

The New Zealand Government makes a lot of their datasets accessible. You can also request data too:

https://data.govt.nz

ceolol · on May 11, 2014

Any free and good datasets for business and POI addresses world-wide? Preferably, with geo coding...

pella · on May 11, 2014

check: OpenStreetMap database ( ODBL license )

http://stackoverflow.com/questions/1875255/open-source-poi-d...

"Planet.osm is the OpenStreetMap data in one file: all the nodes, ways and relations that make up our map. A new version is released every week. It's a big file (XML variant over 400GB uncompressed, 34GB compressed)."

http://planet.openstreetmap.org/

valevk · on May 11, 2014

Thanks for this! Currently writing my master's thesis, and I'm in desperate need for such data sets. Especially free data sets.