Hacker News new | past | comments | ask | show | jobs | submit login
The State of the Awk (lwn.net)
207 points by ftclausen on Nov 18, 2020 | hide | past | favorite | 58 comments



The Awk Programming Language has to be my all time favorite programming book https://archive.org/details/pdfy-MgN0H1joIoDVoIC7 Just like the language it's still very relevant today.


Completely agree. I was absolutely blown away at how much I loved this book. I sat down to read it and ending up losing 4 hours to it!

The book inspired me to build a little course on Awk to (hopefully) share the love: https://github.com/FreedomBen/awk-hack-the-planet.

I gave that course as a talk at Linux Fest Northwest in 2019 and then again in 2020 due to popularity. When Covid hit I recorded videos and put them on Youtube. There are links in the Github page above if you want to watch them.

I wasn't intending to self plug, but I guess it does feel relevant :-)


A good plug is always welcome. Especially when you have donated your time to create something useful and put it on Github for everyone else to benefit from.

Thanks!


Plug away! Thanks for the weekend reading!


Indeed -- such a beautiful and concise book. After the first one or two chapters you know all of the AWK language, and then the rest of the book is making the knowledge practical with interesting examples (that's my memory of the book, anyway).

I was on a Go spree when I read this book, and after the first few chapters, I thought, "it'd be fun to write a version of AWK in Go". I ended up with https://github.com/benhoyt/goawk

And OP, thanks for posting. I'm the author of the LWN article. Nice to see it re-posted here some six months later. :-)


Slightly off-topic, but Brian Kernighan has a great video about the origins of grep. If you know anything similar for awk, I'd enjoy watching it!

https://www.youtube.com/watch?v=NTfOnGZUZDk


There are some SNOBOL books on archive.org I had a look at not long ago, including a detailed manual/tutorial - fascinating, takes you back to <1970. It seems to have been widely used from the mid 60s on, predating the "Unix Epoch". It's pretty cool and Awk resembles it in many ways. The manual starts with many pages telling you how to load all the tapes for the program in correct order!

"It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.

SNOBOL4 stands apart from most programming languages of its era by having patterns as a first-class data type (i.e. a data type whose values can be manipulated in all ways permitted to any other data type in the programming language) and by providing operators for pattern concatenation and alternation. SNOBOL4 patterns are a type of object and admit various manipulations, much like later object-oriented languages such as JavaScript whose patterns are known as regular expressions. In addition SNOBOL4 strings generated during execution can be treated as programs and either interpreted or compiled and executed (as in the eval function of other languages).

SNOBOL4 was quite widely taught in larger US universities in the late 1960s and early 1970s and was widely used in the 1970s and 1980s as a text manipulation language in the humanities. ...

SNOBOL4 supports a number of built-in data types, such as integers and limited precision real numbers, strings, patterns, arrays, and tables (associative arrays), and also allows the programmer to define additional data types and new functions. SNOBOL4's programmer-defined data type facility was advanced at the time—it is similar to the records of the earlier COBOL and the later Pascal programming languages.

All SNOBOL command lines are of the form

    label subject pattern = object : transfer
Each of the five elements is optional. In general, the subject is matched against the pattern. If the object is present, any matched portion is replaced by the object via rules for replacement. The transfer can be an absolute branch or a conditional branch dependent upon the success or failure of the subject evaluation, the pattern evaluation, the pattern match, the object evaluation or the final assignment. It can also be a transfer to code created and compiled by the program itself during a run.

A SNOBOL pattern can be very simple or extremely complex. A simple pattern is just a text string (e.g. "ABCD"), but a complex pattern may be a large structure describing, for example, the complete grammar of a computer language. It is possible to implement a language interpreter in SNOBOL almost directly from a Backus–Naur form expression of it, with few changes. Creating a macro assembler and an interpreter for a completely theoretical piece of hardware could take as little as a few hundred lines, with a new instruction being added with a single line."

https://en.wikipedia.org/wiki/SNOBOL


SNOBOL was the first to use so-called "associative arrays". In SNOBOL, we call them "tables".


Is it still good if I don't plan to use Awk? (or maybe it will persuade me to use it:))


Yes. It is a model of clear concise technical writing and of programming. It is inspiring.


> Several developers in the last few years have written about using AWK in their "big data" toolkit

AWK can be great for simple statistics. I wrote the "num command" tool entirely in gawk, to handle typical math for averages, modes, deviations, kurtosis, etc. It's handy for use on systems where a user can't easily install R/Julia/NumPy/etc.

https://github.com/numcommand/num


In our organization, for comparing two files in custom data format and generating a html file in backend of an internal app.

It was using an old bash script(!), to generate the file. I was able learn little awk, and was able to write a 100 line awk script to replicate the same. Apart from the efficiency of the script, which was the main aim, it was impressive how readable and understandable the whole thing became. Some subtle bugs became apparent.

It is not do-all language. Yet, in its small niche, it is very succinct in expressing the text processing ideas. I know it could be done easily in any other language.

A port of bash to awk script : https://github.com/berry-thawson/diff2html/blob/master/diff2...

https://github.com/berry-thawson/diff2html

Original bash counterpart: https://web.archive.org/web/20180612205114/https://github.co...

Edit: Changing link and added new links


> https://github.com/numcommand/num

Nice, is it just me/my browser or is http://www.numcommand.com/ down? I was hoping to see if it’s packaged for my distro.


Looks like the domain is no longer there.


Looks like the domain is squatted now


> I wrote the "num command" tool entirely in gawk

That's pretty cool! It's really packed with features.


I remember writing a hash join in awk at my first job because I got tired of waiting hours for python to finish. Was probably two orders of magnitude faster in awk.


How did that work? I'd expect some difference but orders of magnitude sounds like there was a substantial difference in the amount of work being performed.


Python is fairly slow and I'm pretty sure it was even slower back then. Parsing tens of millions of lines of text, splitting them, converting to numbers, looking things up in a giant dictionary, etc. adds up. I remember inlining functions by hand to increase performance in Python back then. AWK is optimized for parsing text and has the performance of a compiled language.


Based on my experience with python, the python used global variables (which is idiomatic for scripts), making it about 10x slower than normal python.

The other 10x? That also matches my experience with other people’s python.

Someone probably tried to implement an abstraction, or something foolish like that.


I'm curious what you mean when you say global variables are idiomatic in Python scripts. I work on Python scripts on a daily basis and very rarely use globals except for the occasional "const" value.

If globals ever do cause a noticeable slowdown due to dynamic lookup, this can easily be fixed by pointing a local var at the global at the start of a function.


Globals are idiomatic in all scripting languages except python.

You don’t define a function called main, then manually invoke it at the bottom of your bash, perl, etc... script. In python, you do, partially because there is a severe performance penalty for not doing that.


Ah, I see. I read "which is idiomatic for scripts" as applying to Python scripts and not scripts generally.


Defining a main function and then calling it from the bottom of your script is a mainstay of many people who program in scripting languages. It's certainly how I did things in Tcl. Really, it depends on the type of thing you're writing in the language, moreso than the language itself.


I don't totally understand it, but I was impressed by Ward Cunningham's use of awk to calculate who owes who after a vacation, http://c2.com/doc/expense/


It‘s really a nice example of what can be achieved when one starts solving a specific problem and solves it by generalizing it to create something like an ad hoc language for that kind of reports. Additional bonus for designing it exactly so that it fits to the concepts which one gets already by design of awk.

The author explains the idea: “ The program keeps a running sum that it clears when it sees a blank line (NF==0). Other calculations are performed in place. The first occurrence of a variable name defines it as that sum. Subsequent occurrences become the stored value.”

The regexp that defines what the variable name is is simply: /^[A-Z]+[A-Z0-9]*$/ and numbers to sum and variables are always only the first word in the line.


It might be neat to introduce relational algebra with awk. At least when I was a teenager this would have made it obvious why people like RDBMSs even though awk is definitely better in some situations.


That’s interesting.

Replacing awk, grep, cut and sort with direct implementations of relational algebra has been on my TODO list for about 10 years.

(V2 would build relational calculus on that)



That sounds like a great idea.


I don't see how: awk is almost completely, but not entirely, unlike sql. The thing I would foremost like is an easier way to handle multiple file input; then you could join the result any way you want.


Learning the basics of AWK (I can't admit to being a super-user) was probably the biggest boost to my productivity in NLP.


Yep, I used it for my thesis and I thought all the scripting for shaping the training data (varying the number of parameters, prefixes, suffixes, transformations to case forms (UpperCase -> ulul) and similar will take me most of my time) but then ended up writing everything in a day.



I use awk all the time to slice and dice text files. Also useful for delimited files is q or similar.

http://harelba.github.io/q/


I have written so much Awk over the years, its not even funny.

The one awk script I will never forget, ~30 years ago, was written in 20 minutes when 4 days of C code became untenable.

It was a "hmm.. wonder if I can just push the parsing to AWK" moments that became an epiphany. We didn't have to tell anyone we'd replaced their custom C database loader with a shell script, at least not in those days .. our Friday nights were precious, then.


My choice in the awk family is mawk, because it is still the fastest one of them all.


AWK never stops to show its usefulness.

At work a colleague of mine was looking for some nice way to get some summary of configured/created resources from a kubectly apply with dry run.

I stepped in and provided an awk one-liner that counts and collects the configured and created resources, suppressing the various unchanged one...

Suddenly we were able to clearly see what our kubectl apply were going to touch :)


If you are like me two years ago and think that AWK is obscure, you might want to follow my twitter account where I try to teach AWK: https://twitter.com/mawkic

These days I use AWK on a regular basis for small everyday - tasks.


Not so much a programming language as a work of art.


https://rosettacode.org/wiki/Category:AWK has comprehensive list of AWK programs


Is there a good mnemonic for awk, sed, and grep? I sometimes get sent a command to run when I'm searching for text or file names but I've never gotten a hang of how all three cover their domain.


I love awk. I never used it beyond as a sort of SQL-ish method for querying semi-structured data (and adding some of that missing structure) but it's great just for that sort of usage alone.


I adore awk. I think its my most esoteric common linux tool


Awk is still the best language to perform simple processing of text files. It is very fast and easy to use.


Perl is slower, but has a ton of shortcuts (inspired by awk) to make common text processing tasks even easier. If you already know awk it may not be of much benefit. But CPAN makes it easy to scale up to a more robust program when needed.

I wrote a quick and dirty HCL parser in about an hour using functional programming in Perl, and I hadn't really touched Perl in years. It's still pretty nifty.


Perl one-liners can be incredibly powerful, but I keep using awk because it's a much smaller and simpler language to keep in my head. My trusty old Camel book, 2nd ed, is what, 500+ pages?

For more complicated stuff I usually resort to python or even a compiled language.


just yesterday evening I've shown awk to my boss


Parsing and working with tabular data without tab autocomplete on column names sounds like something from the stone age.


I’m not sure what you hope to gain from typing “$1<tab><tab>”.


It doesn’t make sense when we have tools that give you autocomplete on column names these days. You don’t have to hunt down the column you need and ensure it’s in the same place every time.


Awk is ancient. Mock it if you like, but it is your loss.


Awk is great but I don’t get why every single example is using tabular data.

It’s strength comes from attacking non-tabular data.


Really? Like JSON or unstructured text or what?


Neither my shell scripts, my awk scripts or my Python scripts need autocompletion when I'm done writing them.


How do you work with tabular data?


Stdout -> fread from r -> data.table -> where ever it needs to go.

I churn through hundreds of gigs a day.


You mean like Pandas?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: