Tokenising the english text of 30TB common crawl

muyuu · on Dec 12, 2011

Having the full Common Crawl uncompressed in disk with some indexing and RAID-1 is becoming quite feasible (easily under $10k). The size of the text part of the web is growing a lot more slowly than the price of storage for maybe 10 years. At this rate anyone would be able to have a full crawl of the web in their mid-range desktop computers in just a few years. Here's hoping for good weather in Thailand. :-)

You need how much right now, 200TB tops? including index files, uncompressed data and replication. With some clever compression and data structures you can probably cut that by half or less. Financially speaking this is already on local club territory.

rjurney · on Dec 12, 2011

Very good to see an example of working with the common crawl. I would encourage YC applicants to think about what kind of opportunities it presents.

monatron · on Dec 12, 2011

Very interesting. Surprising to see Lithuania as the second most popular language. Oddly enough, I was having a conversation with someone this weekend when they mentioned that Lithuania had the highest fiber-to-the-home penetration in Europe and one of the fastest average internet connections in the world. Does anyone know why this is?

malkung · on Dec 12, 2011

I once tried to identify the language in (supposedly) English-language Facebook posts. The second-most frequent language was Estonian. Looking at actual posts classified as Estonian, it turned out they contained text like "Soooooo nice". Because the language identifier looks at character sequences, and Estonian tends to have unusually common double vowels, such posts were classified as Estonian. Dutch and Norwegian were other two languages for which English was mistaken - apparently some character sequencies there have frequency distributions that are similar to English.

mat_kelcey · on Dec 13, 2011

I think the Lithuania bit is a false alarm. The pipeline needs a lot more work and investigation of results. The main thing I was focusing on first was scaling it out.

ars · on Dec 12, 2011

A bit premature to post this - no results yet, he's still working on it.

mat_kelcey · on Dec 13, 2011