Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a novice, is there a benefit to using custom Node as the downloader? When I did my download of the 40 million Hacker News api items I used "curl --parallel".

What I would like to figure out is the easiest way to go from the API straight into a parquet file.



I think your curl approach would work just as fine if not better. My instinct was to reach for Node.js out of familiarity, but curl is fast and, given the IDs are sequential, something like `parallel curl ::: $(seq 0 $max_id)` would be pretty simple and fast. I did end up needing more logic though so Node.js did ultimately come in handy.

As for the Arrow file, I'm not sure unfortunately. I imagine there are some difficulties because the format is columnar, so it probably wants a batch of rows (when writing) instead of one item at a time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: