I use awk and sed (with tr/sort/uniq doing some heavy lifting) for most of my da...

I use awk and sed (with tr/sort/uniq doing some heavy lifting) for most of my data analysis work. It's a really great way to play around with data to get a feel for it before formalizing it in a different language.

For an interview, I wrote this guy to do a distributed-systems top-ten word count problem. It turned out to be much faster than anything else I wrote when combined with parallel. It's eaiser to read when split into a bash script :) [0].

  time /usr/bin/parallel -Mx -j+0  -S192.168.1.3,: --block 3.5M --pipe --fifo --bg "/usr/bin/numactl -l /usr/bin/mawk -vRS='[^a-zA-Z0-9]+' '{a[tolower(\$1)]+=1} END { for(k in a) { print a[k],k} }'" < ~/A* |  /usr/bin/mawk '{a[$2]+=$1} END {for(k in a) {if (a[k] > 1000) print a[k],k}}' | sort -nr | head -10

[0] https://github.com/red-bin/wc_fun/blob/master/wordcount.sh