Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Large Public Datasets (quora.com)
164 points by amazedsaint on May 11, 2014 | hide | past | favorite | 26 comments


I'm the tech lead on the team that runs http://police.uk, the UK Police crime mapping website, and our data is available to download and through an API at http://data.police.uk. The dataset isn't nearly as big as some of the ones in this list (~40MM rows), but it's nice that the Home Office are open with their data.

People have built some cool apps with it which are showcased[1] on the site, there's even a Pebble watch app[2] which is due to be added to that list shortly.

We also have an open-source Python client[3] for the API, which I'm planning to post here once the documentation is finished.

[1]: http://www.police.uk/apps/

[2]: https://git.bengcooper.co.uk/bengcooper/pebble-crimewatch/wi...

[3]: https://github.com/rkhleics/police-api-client-python/


The open data index ranks countries by how much data they make publically available based on 10 areas (transport, budget, spending, etc). It isn't a full global picture and it doesn't cover all datasets in every country. However, it's still useful as a comparison tool and a way of seeing what data is (and isn't available).

They have rankings for 70 countries. The top 10 (from October 2013) are

  1. UK
  2. US
  3. Denmark
  4. Norway
  5. Netherlands
  6. Finland
  7. Sweden
  8. New Zealand
  9. Australia
  10. Canada
https://index.okfn.org/country


Non-blured:

https://www.quora.com/Where-can-I-find-large-datasets-open-t...

Someone else shared this tip before. Note the "?share=1"


We added it to the url.


Tangentially related, my master's thesis is applying predictive algorithms to web traffic for scaling purposes and I cannot believe that their isn't more server trace data available. The best I've done is some data from the mid-90s and Wikipedia in 2007.

So if any of you wonderful people feel so inclined as to donate some requests/sec metrics, I would be deeply appreciative.


I'm not signing in to Facebook to read that.


Add ?share=1 to the end of any Quora url to avoid signing in.


this trick is getting old.` Quora UX is just rotten.


What's interesting to me is how much this has been criticized and yet it is still in place. This is pretty much the first point that comes up on HN every time the topic is Quora, so there is simply no way they are unaware of the complaints.


I worked at Quora, and I wish Adam would speak about this publicly. His reasons for this are very solid and he articulates them in a very convincing way. Maybe now that Quora has joined YC, he may speak more about it publicly since it's such a point of contention for many YC News readers.


It's a discussion of "good places to find large datasets open to the public". There are 132 or so answers, so I wouldn't even try to copy them all here (copyright issues aside anyway), but here's the current contents of the answer wiki. Note that some of these got partially truncated by the c&p from Quora, so if you want the full URL, you'll have to login or get somebody else to fix it. I'm too lazy for all that right now. :-)

Here are many of the links mentioned so far:

Cross-disciplinary data repositories, data collections and data search engines:

http://usgovxml.com

http://aws.amazon.com/datasets

http://databib.org

http://datacite.org

http://figshare.com

http://linkeddata.org

http://reddit.com/r/datasets

http://thedatahub.org alias http://ckan.net

http://quandl.com

Social Network Analysis Interactive Dataset Library (Social Network Datasets)

Datasets for Data Mining

http://enigma.io

Single datasets and data repositories

http://archive.ics.uci.edu/ml/

http://crawdad.org/

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.govloop.com

http://data.gov.uk/

http://data.medicare.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sunlightlabs.com

https://datamarket.azure.com/

http://developer.yahoo.com/geo/g...

http://econ.worldbank.org/datasets

http://en.wikipedia.org/wiki/Wik...

http://factfinder.census.gov/ser...

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://googleresearch.blogspot.com

http://books.google.com/ngrams/

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://rechercheisidore.fr

http://snap.stanford.edu/data/in...

http://timetric.com/public-data/

https://wist.echo.nasa.gov/~wist...

http://www2.jpl.nasa.gov/srtm

http://www.archives.gov/research...

http://www.bls.gov/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.datakc.org

http://dbpedia.org

http://www.delicious.com/jbaldwi...

http://www.faa.gov/data_research/

http://www.factual.com/

http://research.stlouisfed.org/f...

http://www.freebase.com/

http://www.google.com/publicdata...

http://www.guardian.co.uk/news/d...

http://www.infochimps.com

http://www.kaggle.com/

http://build.kiva.org/

http://www.nationalarchives.gov....

http://www.nyc.gov/html/datamine...

http://www.ordnancesurvey.co.uk/...

http://www.philwhln.com/how-to-g...

http://www.imdb.com/interfaces

http://imat-relpred.yandex.ru/en...

http://www.dados.gov.pt/pt/catal...

http://knoema.com

http://daten.berlin.de/

http://www.qunb.com

http://databib.org/

http://datacite.org/

http://data.reegle.info/

http://data.wien.gv.at/

http://data.gov.bc.ca

https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)

http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)


Thanks! That's actually a very helpful list. It's ironic that it's posted on a large closed dataset like Quora ;)


Some of the links got truncated by Quora, so here's the same list (modulo intervening wiki edits) with full URLs:

  http://usgovxml.com
  http://aws.amazon.com/datasets
  http://databib.org
  http://datacite.org
  http://figshare.com
  http://linkeddata.org
  http://reddit.com/r/datasets
  http://thedatahub.org
  http://ckan.net
  http://quandl.com
  http://www.growmeme.com/overview
  http://www.kdnuggets.com/datasets/index.html
  http://enigma.io
  http://archive.ics.uci.edu/ml/
  http://crawdad.org/
  http://data.austintexas.gov
  http://data.cityofchicago.org
  http://data.govloop.com
  http://data.gov.uk/
  http://data.medicare.gov
  http://data.seattle.gov
  http://data.sfgov.org
  http://data.sunlightlabs.com
  https://datamarket.azure.com/
  http://developer.yahoo.com/geo/geoplanet/data/
  http://econ.worldbank.org/datasets
  http://en.wikipedia.org/wiki/Wikipedia:Database_download
  http://factfinder.census.gov/servlet/DatasetMainPageServlet
  http://ftp.ncbi.nih.gov/
  http://gettingpastgo.socrata.com
  http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
  http://books.google.com/ngrams/
  http://medihal.archives-ouvertes.fr
  http://public.resource.org/
  http://rechercheisidore.fr
  http://snap.stanford.edu/data/index.html
  http://timetric.com/public-data/
  https://wist.echo.nasa.gov/~wist/api/imswelcome/
  http://www2.jpl.nasa.gov/srtm
  http://www.archives.gov/research/alic/tools/online-databases.html
  http://www.bls.gov/
  http://www.crunchbase.com/
  http://www.dartmouthatlas.org/
  http://www.data.gov/
  http://www.datakc.org
  http://dbpedia.org
  http://www.delicious.com/jbaldwinconnect/DataSets
  http://www.faa.gov/data_research/
  http://www.factual.com/
  http://research.stlouisfed.org/fred2/
  http://www.freebase.com/
  http://www.google.com/publicdata/directory
  http://www.guardian.co.uk/news/datablog
  http://www.infochimps.com
  http://www.kaggle.com/
  http://build.kiva.org/
  http://www.nationalarchives.gov.uk/doc/open-government-licence/open-government-licence.htm
  http://www.nyc.gov/html/datamine/html/home/home.shtml
  http://www.ordnancesurvey.co.uk/oswebsite/opendata/
  http://www.philwhln.com/how-to-get-experience-working-with-large-datasets
  http://www.imdb.com/interfaces
  http://imat-relpred.yandex.ru/en/datasets
  http://www.dados.gov.pt/pt/catalogodados/catalogodados.aspx
  http://knoema.com
  http://daten.berlin.de/
  http://www.qunb.com
  http://databib.org/
  http://datacite.org/
  http://data.reegle.info/
  http://data.wien.gv.at/
  http://data.gov.bc.ca
  https://pslcdatashop.web.cmu.edu/
  http://www.icpsr.umich.edu/icpsrweb/CPES/


I don't blame you, but it worked for me without signing into anything and I deleted my Facebook account several years ago. I do have an actual Quora account, which of course it would be a nuisance to have to sign into every time as well, but for whatever reason, it seems to either keep me signed in or sign me in automatically; maybe try to see if you can get it to do this? I'm using Chrome if that matters.


Anyone have a dataset of Every movie to come out between 1950 and 2013 I need cast, year released and title?



The problem with the data they provide is it is not relational, missing huge amounts of movies, leaving you with no way to distinguish between movies other than title which gives you mismatches when movies have the same name in the same year.


Yes, imdb is a pretty poor source of data. From my research, a combination of the well organized freebase film database[1] (~19mm facts) with details filled in from imdb is a better approach. However, processing data from freebase is not trivial and requires a decent amount of time investment to grok.

[1]https://www.freebase.com/film


Corelate and infer from the data. If you can't distinguish between similarly named films, you still have the director and cast and years when they are active, along with country, production studio and countless other attributes. A single data source is never effective anyway.

Along with data sources you will also need a production process and streamlined workflow. It's a holistic, iterative exercise.


I have a Ruby project that parses IMDb data into a database, and which can also supplement itself with data from Rotten Tomatoes and Freebase. It's fairly generic and could be extended. Email me if you want to chat.


A lot of IMDB's data is available freely, it may not go to 2013 though. Sorry, don't have the link readily available.


Amazon hosts public datasets at https://aws.amazon.com/publicdatasets/ Good if you want to quickly spin up an instance, copy data over from s3 and process it.


The New Zealand Government makes a lot of their datasets accessible. You can also request data too:

https://data.govt.nz


Any free and good datasets for business and POI addresses world-wide? Preferably, with geo coding...


check: OpenStreetMap database ( ODBL license )

http://stackoverflow.com/questions/1875255/open-source-poi-d...

"Planet.osm is the OpenStreetMap data in one file: all the nodes, ways and relations that make up our map. A new version is released every week. It's a big file (XML variant over 400GB uncompressed, 34GB compressed)."

http://planet.openstreetmap.org/


Thanks for this! Currently writing my master's thesis, and I'm in desperate need for such data sets. Especially free data sets.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: