Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you're going to those lengths then you might as well write it in Go or Python and get the performance boost on larger datasets.

However the point often missed when people see awk piped into other CLI tools is that code is intentionally optimised for one time writing rather than maximising awk's usefulness.

It's the same with the GPs comment too. In that example the awk code was longer than the awk|sed equivalent.

Don't get me wrong, I am a big fan of awk myself. But sometimes people get so hung up on better usage of awk that they lose sight of the point behind the command line example.



For the code here, which I used on an actual dataset (~500 NYT Headlines 1965--1974), the efficiency gaains of a Go / Python rewrite are ... slight.

  real    0m0.176s
  user    0m0.060s
  sys     0m0.060s
On an early-2015 Android tablet running Termux.

I've thrown multimillion row datasets at awk (usually gawk, occasionally mawk, nawk on OSX, and, hell, busybox on occasion) without any practical performance issues. I'm virtually always writing for one-off or project-based analyssys, not live web-scale realtime processing. A second or even ten won't be missed.

I'm also aware that building a pipeline out of grep / cut / sed / tr / sort / uniq / awk is often conceptually nearer at hand. It almost always mirrors how I start exploring some dataset.

But a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.

All with far-more-than-adequate performance.

We've spent far more time discussing this than coding, let alone running, it.


> For the code here, which I used on an actual dataset (~500 NYT Headlines 1965--1974), the efficiency gaains of a Go / Python rewrite are ... slight.

500 records isn't a large dataset. Not even close. Large would be orders of millions to billions. And yes, I have had to munge datasets that large on many occasions.

> But a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.

- cleaner: only if you find awk readable. Plenty of people don't. Plenty of people find Go or Python more readable.

- more powerful: again depends. You wouldn't have multithreading builtin like pipelines would. And Python and Go are undoubtedly more powerful than awk. I'm not knocking awk here, just being pragmatic.

- easer conversion to a script: at which point you might as well skip awk entirely and jump straight to Go or Python (or any other programming language)

- and access to a small library of awk-based utilities I've accumulated: that only benefits you. If you're having to sell the benefits of awk to someone then odd are they don't have that small library already to hand ;)

> We've spent far more time discussing this than coding, let alone running, it.

Some of the larger datasets I've had to process have definitely taken longer to munge than my reply here has taken to type :)

Disclaimer: I've honestly not got a problem with awk, I used to use it heavily 20 years ago. But these days its value is diminishing and a lot of the awk evangelists seem to miss the point of why awk isn't well represented in blog posts any more. It's both more verbose than pipelining to coreutils and less powerful than a programming language -- it's that weird middle ground that doesn't provide much value to most people aside those who are already invested into the awk language. Some might see that as a loss but personally I see that as demonstrating the strength of all the other tools we have at our disposal these days.


I don't have your scale problems. You might care to not impose them where they don't exist.


Don't be so ridiculous. You're the one imposing arbitrary problems by saying "everything should be written in awk" then writing an example that was an order of magnitude longer in both character count and execution time than the examples that you were suggesting was wrong.

All I'm doing is citing a few reasons why someone might prefer a terser pipeline but I hadn't realised this wasn't supposed to be an objective conversation and since I have no interest in engaging in pointless language fanboyism I'm just going to leave you to it.


"everything should be written in awk" is a rather dubious translation of:

a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.

https://news.ycombinator.com/item?id=24822697

You're not discussing what I've actually written.

Good day.


I addressed those points directly, which you ignored when you blamed me of arguing use cases specific only to me. Suffice to say your complaint was hypocritical given your statement previous to that was very specific to your use case (as I also pointed out in my aforementioned, but ignored, comment)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: