There was an article a few weeks ago on the front page of HN about an interview ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

csense on July 21, 2013 | parent | context | favorite | on: Processing large files, line by line

There was an article a few weeks ago on the front page of HN about an interview question for data scientists that was essentially the "exact-split" problem mentioned in the end of the article. The article (or maybe it was the comment thread) showed an algorithm to randomly split a file of size m+n into disjoint sublists of size m and n, using a single pass through the data and O(n) memory.

This blog post's algorithm accomplishes the same task with two passes and O(m+n) space. It seems odd that an article explicitly about encouraging readers to think like the authors of UNIX and make simple reusable utilities that process stream data would use a two-pass algorithm when a fairly simple one-pass algorithm is available.

ZygmuntZ on July 21, 2013 [–]

You need to know m and n, or a number of lines, beforehand for a one pass version, don't you?

csense on July 21, 2013 | [–]

Thinking about it some more, I think the previously mentioned problem had only m unknown, n was known ahead of time.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact