Also, the article does state what Hadley's take on the question is:
"As Wickham defines data science as “the process by which data becomes understanding, knowledge, and insight”, he advocates using data science tools where value is gained from iteration, surprise, reproducibility, and scalability. In particular, he argues that being a data scientist and being programmer are not mutually exclusive and that using a programming language helps data scientists towards understanding the real signal within their data. "
As long as you obey robots.txt there is nothing wrong with crawling. Your code in GitHub doesn't give any indication of what sites you collect data from so there is no indication that you are scraping instead of using it to crawl in an acceptable manner. Though it wouldn't hurt to label your work as crawler scripts instead of scraping scripts ;)
New, somewhat stealth startup, for profit company focused on social good.
We have a very talented team so far comprised of : full stack web dev, data architect, 2 junior software engineers, CTO, CEO (me), 2 marketing people, and a business operations person.
We are looking to add a designer and devops.
We work out of the top floor of my house for now - it is a comfortable space. We have funding. The team members we have so far are wonderful to work with, everyone gets along well, and we all feel like what we are building is work that matters.
Please email me if you want to hear more about the team, the stack, and the product.
Great question and great blog post! I am looking forward to reading the Homay King stuff that uses Queer Theory and will probably reread Computing Machinery and Intelligence more thoroughly.
I played around with Prismatic before but it just didn’t grab me and I found I didn’t use it that much.
This new version is a whole different animal. Not only is it much prettier (great design) but they seem to have seriously improved their relevance algorithms. I would be very interested to hear from their data team why the relevance is so much better now - anyone from Prismatic monitoring these comments?
Glad you're liking it! I'm one of the backend engineers who designed the system. We devoted a good chunk of time this year to improving our relevance algorithms, and I actually gave a talk on exactly this at Strangeloop in September. The video is at http://www.infoq.com/presentations/machine-learning.
Great presentation! Very fascinated by this world and Prismatic definitely appears to be at the forefront of achieving relevance. What I've been kind of baffled by are businesses like Taboola, Gravity and Outbrain. I understand their businesses are centered around placement of "related" or 'personalized' articles of boobs, bikinis, and booze just so you click so they get paid, but do you think behind the scenes they have relevance and personalization technology on par with Prismatic? They have a ton of money and people so you'd think they should be able to recommend genuinely relevant content, but publicly never appears so
Ideally beyond the top sites, these subsets would be available as verticals, so that people can focus on specialized search engines.
While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:
1) Everything linux, unix and both
2) Everything open-source
3) Only news & current events
4) Popular culture globally and by country
5) Politics globally and by country
6) Everything software engineering
7) Everything hardware engineering
8) Everything maker community
9) Everything financial markets
10) Everything medicine / health (sans obvious quackery)
11) etc.
Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.
The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.
You don't usually download this data - you process it on AWS to your requirements.
Seriously - they give you an easy way to create these subsets yourself[1]. That is a much better solution than them trying to anticipate the exact needs of every potential client.
I guess what I was suggesting is "given enough eyeballs, all spam and poor quality content is shallow"
There is definitely a benefit in using the community to identify valuable subsets and then individually putting your energy towards building discovery/search products around that subset.
would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)
The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:
Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.
Totally true. Smaller versions are not helpful for just casual/student use; it also helps in code development and debugging. Otherwise, algorithm development gets impeded by scaling issues.
Limited resources are the only reason. We are working on a subset crawl of ~3 million pages that will be published weekly starting two weeks from now. But doing the full crawl takes a lot of time, effort and money.
Is that really worth it though? I can crawl 3 million pages in less than 24 hours without any real effort on my part. Or are you going to provide 3 million of the most useful pages? Depth or breadth first crawl?
We do think it is worth it to avoid duplicative efforts.
Suppose you crawl 3 million pages and you pay for the compute and storage costs. Then the next person who wants crawl data goes through the same effort and pays the same costs. Doesn't it make much more sense to have a common pool of open data that everyone can use? Even if the effort and costs are low, they are not zero.
For the smaller frequent crawl, we are working with Mozilla and we are will do the top pages (top according to Alexa).
Fair point and makes sense. If you publish the rank along with the data itself that would be very useful. Perhaps having a few sets of data? 3 million top pages, 3 million deep pages etc...
Personally I would like to see around 20-100 million pages or whatever is about 500-1000GB. That's enough data to work with on a local machine and serve up some meaningful results assuming you want to build a search engine or just do some deep analysis of the web.
Isn't there also the additional factor that webservers sometimes allow only the major search engines to crawl? If so, with something like this, should it gain popularity, and as more apps start using it, you'd hope more webservers allow the common crawler to crawl their websites which they might not if everyone were doing it individually...thinking aloud...
To be honest a simple crawler is a very simple thing to write. If someone had issues getting that going I think they are going to have issues with the data volume anyway. LisaG answered why the 3 million data set though and I agree with the reasoning.
Internet Archive (currently) doesn't want to put their data on any cloud service. We believe it is crucial that people can easily access and analyze the data so we put it on various cloud platforms.
We are talking with a few organizations about getting data donations that we could put in our corpus and make available to everyone, but nothing is settled enough that I can publicly comment on those potential partnerships yet.