More

LisaG · on April 16, 2018

Did you watch all of Hadley's video? You might get the title more if you saw/see the whole talk :)

LisaG · on April 16, 2018

Also, the article does state what Hadley's take on the question is: "As Wickham defines data science as “the process by which data becomes understanding, knowledge, and insight”, he advocates using data science tools where value is gained from iteration, surprise, reproducibility, and scalability. In particular, he argues that being a data scientist and being programmer are not mutually exclusive and that using a programming language helps data scientists towards understanding the real signal within their data. "

rockybullwinkle · on April 16, 2018

thank you!

CyberDildonics · on April 16, 2018

Cool, I'll just watch a 75 minute video so I understand a poorly written title, that's reasonable.

LisaG · on Sept 7, 2017

As long as you obey robots.txt there is nothing wrong with crawling. Your code in GitHub doesn't give any indication of what sites you collect data from so there is no indication that you are scraping instead of using it to crawl in an acceptable manner. Though it wouldn't hurt to label your work as crawler scripts instead of scraping scripts ;)

Why use your own scripts and not Nutch?

Do you know about Common Crawl? https://aws.amazon.com/public-datasets/common-crawl/ It obeys robots.txt so it may not have everything you want, but it could save you part of the effort of crawling yourself.

LisaG · on July 1, 2015

San Francisco CA Full-time / Onsite

New, somewhat stealth startup, for profit company focused on social good.

We have a very talented team so far comprised of : full stack web dev, data architect, 2 junior software engineers, CTO, CEO (me), 2 marketing people, and a business operations person.

We are looking to add a designer and devops.

We work out of the top floor of my house for now - it is a comfortable space. We have funding. The team members we have so far are wonderful to work with, everyone gets along well, and we all feel like what we are building is work that matters.

Please email me if you want to hear more about the team, the stack, and the product.

vesnalorem · on July 12, 2015

Hi Lisa, what is your email? Thanks.

LisaG · on Aug 7, 2014

So excited so see Common Crawl data be useful for such fascinating work!

I work at Common Crawl :)

LisaG · on July 23, 2014

Love this idea!!

LisaG · on April 2, 2014

Great question and great blog post! I am looking forward to reading the Homay King stuff that uses Queer Theory and will probably reread Computing Machinery and Intelligence more thoroughly.

LisaG · on Dec 19, 2013

I played around with Prismatic before but it just didn’t grab me and I found I didn’t use it that much.

This new version is a whole different animal. Not only is it much prettier (great design) but they seem to have seriously improved their relevance algorithms. I would be very interested to hear from their data team why the relevance is so much better now - anyone from Prismatic monitoring these comments?

jrfinkel · on Dec 19, 2013

Glad you're liking it! I'm one of the backend engineers who designed the system. We devoted a good chunk of time this year to improving our relevance algorithms, and I actually gave a talk on exactly this at Strangeloop in September. The video is at http://www.infoq.com/presentations/machine-learning.

asaramis · on Dec 19, 2013

Great presentation! Very fascinated by this world and Prismatic definitely appears to be at the forefront of achieving relevance. What I've been kind of baffled by are businesses like Taboola, Gravity and Outbrain. I understand their businesses are centered around placement of "related" or 'personalized' articles of boobs, bikinis, and booze just so you click so they get paid, but do you think behind the scenes they have relevance and personalization technology on par with Prismatic? They have a ton of money and people so you'd think they should be able to recommend genuinely relevant content, but publicly never appears so

LisaG · on Nov 27, 2013

There will be news about a subset sometime next month!

malandrew · on Nov 28, 2013

Ideally beyond the top sites, these subsets would be available as verticals, so that people can focus on specialized search engines.

While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:

1) Everything linux, unix and both

2) Everything open-source

3) Only news & current events

4) Popular culture globally and by country

5) Politics globally and by country

6) Everything software engineering

7) Everything hardware engineering

8) Everything maker community

9) Everything financial markets

10) Everything medicine / health (sans obvious quackery)

11) etc.

Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.

The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.

nl · on Nov 28, 2013

You don't usually download this data - you process it on AWS to your requirements.

Seriously - they give you an easy way to create these subsets yourself[1]. That is a much better solution than them trying to anticipate the exact needs of every potential client.

[1] http://commoncrawl.org/get-started/

malandrew · on Nov 28, 2013

I guess what I was suggesting is "given enough eyeballs, all spam and poor quality content is shallow"

There is definitely a benefit in using the community to identify valuable subsets and then individually putting your energy towards building discovery/search products around that subset.

gsnedders · on Nov 28, 2013

Is the example code still right with the new file formats for this new crawl?

hkmurakami · on Nov 27, 2013

would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)

Aloisius · on Nov 28, 2013

You can fetch a single WARC file directly like say:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/warc/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.gz

They are around 850 MB each.

The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wat/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wat.gz

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wet/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wet.gz

ccleve · on Nov 28, 2013

Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.

froo · on Nov 28, 2013

  s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/

That should get you about 90% on your way.

alok-g · on Nov 28, 2013

Totally true. Smaller versions are not helpful for just casual/student use; it also helps in code development and debugging. Otherwise, algorithm development gets impeded by scaling issues.

ma2rten · on Nov 28, 2013

Also interesting for machine learning. Where you can use it as a background collection.

daivd · on Nov 28, 2013

One subset for each TLD would be nice. Or, if you can afford more CPU-power, per language, using a good open language detector.

boyter · on Nov 27, 2013

Fantastic news. Will be looking forward to seeing it.

LisaG · on Aug 14, 2013

If you don't feel like reading the paper Sebastian wrote on the Common Crawl data, he gives a summary of his findings in this video.

Link to full paper: http://bit.ly/14dxSJq

LisaG · on Aug 13, 2013

Limited resources are the only reason. We are working on a subset crawl of ~3 million pages that will be published weekly starting two weeks from now. But doing the full crawl takes a lot of time, effort and money.

boyter · on Aug 14, 2013

Is that really worth it though? I can crawl 3 million pages in less than 24 hours without any real effort on my part. Or are you going to provide 3 million of the most useful pages? Depth or breadth first crawl?

LisaG · on Aug 14, 2013

We do think it is worth it to avoid duplicative efforts.

Suppose you crawl 3 million pages and you pay for the compute and storage costs. Then the next person who wants crawl data goes through the same effort and pays the same costs. Doesn't it make much more sense to have a common pool of open data that everyone can use? Even if the effort and costs are low, they are not zero.

For the smaller frequent crawl, we are working with Mozilla and we are will do the top pages (top according to Alexa).

boyter · on Aug 14, 2013

Fair point and makes sense. If you publish the rank along with the data itself that would be very useful. Perhaps having a few sets of data? 3 million top pages, 3 million deep pages etc...

Personally I would like to see around 20-100 million pages or whatever is about 500-1000GB. That's enough data to work with on a local machine and serve up some meaningful results assuming you want to build a search engine or just do some deep analysis of the web.

dsinha · on Aug 16, 2013

Isn't there also the additional factor that webservers sometimes allow only the major search engines to crawl? If so, with something like this, should it gain popularity, and as more apps start using it, you'd hope more webservers allow the common crawler to crawl their websites which they might not if everyone were doing it individually...thinking aloud...

frederi · on Aug 14, 2013

Just because you can do it without much effort doesn't mean less experienced people can. Crawling can be a barrier to some people.

boyter · on Aug 14, 2013

To be honest a simple crawler is a very simple thing to write. If someone had issues getting that going I think they are going to have issues with the data volume anyway. LisaG answered why the 3 million data set though and I agree with the reasoning.

toomuchtodo · on Aug 14, 2013

Could you partner with other orgs that have the same needs? Like the Internet Archive?

LisaG · on Aug 14, 2013

Internet Archive (currently) doesn't want to put their data on any cloud service. We believe it is crucial that people can easily access and analyze the data so we put it on various cloud platforms. We are talking with a few organizations about getting data donations that we could put in our corpus and make available to everyone, but nothing is settled enough that I can publicly comment on those potential partnerships yet.