Hacker Newsnew | past | comments | ask | show | jobs | submit | aks_c's commentslogin

When you have to deal with (genetic data) files of few GB on daily basis, I dont think using Python, R or databases is good idea to do basic data exploring.

-rwxr-x--- 1 29594528008 out_chr1comb.dose

-rwxr-x--- 1 27924241334 out_chr2comb.dose

-rwxr-x--- 1 25684164559 out_chr3comb.dose

-rwxr-x--- 1 24665680612 out_chr4comb.dose

-rwxr-x--- 1 21493584686 out_chr5comb.dose

-rwxr-x--- 1 23626967979 out_chr6comb.dose

-rwxr-x--- 1 20856136599 out_chr7comb.dose

-rwxr-x--- 1 18398180426 out_chr8comb.dose

-rwxr-x--- 1 15864714472 out_chr9comb.dose


As someone that deals with large datasets on a Database + Python daily, I'm not quite sure what you mean. You'll have to explain it to me what "not a good idea is", or "basic data exploring".


Consider I get 10 files of size 3 GB every week, which I am supposed to filter based on certain column using a reference index and forward to my colleague. Before filtering I also want to check how the file looks like: column names, first few records etc.

I can use something like following to explore few rows and few columns. $$ awk '{print $1,$3,$5}' file | head -10

And then I can use something like sed with reference index to filter the file. Since I plan to repeat this with different files, databases would be time consuming(even if I automate it loading every file and querying). Due to the file size options like R, Python would be slower than unix commands. I can also save set of commands as script and share/run whenever I need it.

If there is a better way I would be happy to learn.


I think the gain you're seeing there is because it's quicker for you to do quick, dirty ad hoc work with the shell than it is to write custom python for each file. Which totally makes sense, the work's ad hoc so use an ad hoc tool. Python being slow and grep being a marvel of optimization doesn't really matter, here, compared to the dev time you're saving.


I have been doing Python the last few years, but went back to Perl for this sort of thing recently. You can start with a one liner, and if it gets complicated, just turn it into a proper script. As well as the Unix commands mentioned. Its just faster when you don't know what you are dealing with yet.


For this kind of thing, it's easiest to bulk-load them into SQLite and do your exploration and early analysis in SQL


Thank you for inputs, how about this?

uniq -u movies.csv > temp.csv

temp.csv > movie.csv

rm temp.csv


long_running_process > filename.tmp && mv filename.tmp filename

The rename is atomic; anyone opening "filename" will get either the old version, or the new version. (Although it breaks one of my other favorite idioms for monitoring log files, "tail -f filename", because the old inode will never be updated.)


> Although it breaks one of my other favorite idioms for monitoring log files, "tail -f filename", because the old inode will never be updated

You should look into the '-F' option of tail; it follows the filename, and not the inode.


you could directly write to uniqMovie.csv in your example. I would do it like below but ONLY once I am certain it is exactly what I want. Usually I just make one clearly named result file per operation without touching the original.

uniq -u movies.csv > /tmp/temp.csv && mv /temp/temp.csv movies.csv


  $ temp.csv > movie.csv
  temp.csv: command not found


He forgot his cat.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: