Hacker News new | past | comments | ask | show | jobs | submit login

There was an article a few weeks ago on the front page of HN about an interview question for data scientists that was essentially the "exact-split" problem mentioned in the end of the article. The article (or maybe it was the comment thread) showed an algorithm to randomly split a file of size m+n into disjoint sublists of size m and n, using a single pass through the data and O(n) memory.

This blog post's algorithm accomplishes the same task with two passes and O(m+n) space. It seems odd that an article explicitly about encouraging readers to think like the authors of UNIX and make simple reusable utilities that process stream data would use a two-pass algorithm when a fairly simple one-pass algorithm is available.




You need to know m and n, or a number of lines, beforehand for a one pass version, don't you?


Thinking about it some more, I think the previously mentioned problem had only m unknown, n was known ahead of time.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: