I gave that course as a talk at Linux Fest Northwest in 2019 and then again in 2020 due to popularity. When Covid hit I recorded videos and put them on Youtube. There are links in the Github page above if you want to watch them.
I wasn't intending to self plug, but I guess it does feel relevant :-)
A good plug is always welcome. Especially when you have donated your time to create something useful and put it on Github for everyone else to benefit from.
Indeed -- such a beautiful and concise book. After the first one or two chapters you know all of the AWK language, and then the rest of the book is making the knowledge practical with interesting examples (that's my memory of the book, anyway).
I was on a Go spree when I read this book, and after the first few chapters, I thought, "it'd be fun to write a version of AWK in Go". I ended up with https://github.com/benhoyt/goawk
And OP, thanks for posting. I'm the author of the LWN article. Nice to see it re-posted here some six months later. :-)
There are some SNOBOL books on archive.org I had a look at not long ago, including a detailed manual/tutorial - fascinating, takes you back to <1970. It seems to have been widely used from the mid 60s on, predating the "Unix Epoch". It's pretty cool and Awk resembles it in many ways. The manual starts with many pages telling you how to load all the tapes for the program in correct order!
"It was one of a number of text-string-oriented languages developed during the 1950s and 1960s; others included COMIT and TRAC.
SNOBOL4 stands apart from most programming languages of its era by having patterns as a first-class data type (i.e. a data type whose values can be manipulated in all ways permitted to any other data type in the programming language) and by providing operators for pattern concatenation and alternation. SNOBOL4 patterns are a type of object and admit various manipulations, much like later object-oriented languages such as JavaScript whose patterns are known as regular expressions. In addition SNOBOL4 strings generated during execution can be treated as programs and either interpreted or compiled and executed (as in the eval function of other languages).
SNOBOL4 was quite widely taught in larger US universities in the late 1960s and early 1970s and was widely used in the 1970s and 1980s as a text manipulation language in the humanities. ...
SNOBOL4 supports a number of built-in data types, such as integers and limited precision real numbers, strings, patterns, arrays, and tables (associative arrays), and also allows the programmer to define additional data types and new functions. SNOBOL4's programmer-defined data type facility was advanced at the time—it is similar to the records of the earlier COBOL and the later Pascal programming languages.
All SNOBOL command lines are of the form
label subject pattern = object : transfer
Each of the five elements is optional. In general, the subject is matched against the pattern. If the object is present, any matched portion is replaced by the object via rules for replacement. The transfer can be an absolute branch or a conditional branch dependent upon the success or failure of the subject evaluation, the pattern evaluation, the pattern match, the object evaluation or the final assignment. It can also be a transfer to code created and compiled by the program itself during a run.
A SNOBOL pattern can be very simple or extremely complex. A simple pattern is just a text string (e.g. "ABCD"), but a complex pattern may be a large structure describing, for example, the complete grammar of a computer language. It is possible to implement a language interpreter in SNOBOL almost directly from a Backus–Naur form expression of it, with few changes. Creating a macro assembler and an interpreter for a completely theoretical piece of hardware could take as little as a few hundred lines, with a new instruction being added with a single line."
> Several developers in the last few years have written about using AWK in their "big data" toolkit
AWK can be great for simple statistics. I wrote the "num command" tool entirely in gawk, to handle typical math for averages, modes, deviations, kurtosis, etc. It's handy for use on systems where a user can't easily install R/Julia/NumPy/etc.
In our organization, for comparing two files in custom data format and generating a html file in backend of an internal app.
It was using an old bash script(!), to generate the file. I was able learn little awk, and was able to write a 100 line awk script to replicate the same. Apart from the efficiency of the script, which was the main aim, it was impressive how readable and understandable the whole thing became. Some subtle bugs became apparent.
It is not do-all language. Yet, in its small niche, it is very succinct in expressing the text processing ideas. I know it could be done easily in any other language.
I remember writing a hash join in awk at my first job because I got tired of waiting hours for python to finish. Was probably two orders of magnitude faster in awk.
How did that work? I'd expect some difference but orders of magnitude sounds like there was a substantial difference in the amount of work being performed.
Python is fairly slow and I'm pretty sure it was even slower back then. Parsing tens of millions of lines of text, splitting them, converting to numbers, looking things up in a giant dictionary, etc. adds up. I remember inlining functions by hand to increase performance in Python back then. AWK is optimized for parsing text and has the performance of a compiled language.
I'm curious what you mean when you say global variables are idiomatic in Python scripts. I work on Python scripts on a daily basis and very rarely use globals except for the occasional "const" value.
If globals ever do cause a noticeable slowdown due to dynamic lookup, this can easily be fixed by pointing a local var at the global at the start of a function.
Globals are idiomatic in all scripting languages except python.
You don’t define a function called main, then manually invoke it at the bottom of your bash, perl, etc... script. In python, you do, partially because there is a severe performance penalty for not doing that.
Defining a main function and then calling it from the bottom of your script is a mainstay of many people who program in scripting languages. It's certainly how I did things in Tcl. Really, it depends on the type of thing you're writing in the language, moreso than the language itself.
I don't totally understand it, but I was impressed by Ward Cunningham's use of awk to calculate who owes who after a vacation, http://c2.com/doc/expense/
It‘s really a nice example of what can be achieved when one starts solving a specific problem and solves it by generalizing it to create something like an ad hoc language for that kind of reports. Additional bonus for designing it exactly so that it fits to the concepts which one gets already by design of awk.
The author explains the idea: “ The program keeps a running sum that it clears when it sees a blank line (NF==0). Other calculations are performed in place. The first occurrence of a variable name defines it as that sum. Subsequent occurrences become the stored value.”
The regexp that defines what the variable name is is simply:
/^[A-Z]+[A-Z0-9]*$/ and numbers to sum and variables are always only the first word in the line.
It might be neat to introduce relational algebra with awk. At least when I was a teenager this would have made it obvious why people like RDBMSs even though awk is definitely better in some situations.
I don't see how: awk is almost completely, but not entirely, unlike sql. The thing I would foremost like is an easier way to handle multiple file input; then you could join the result any way you want.
Yep, I used it for my thesis and I thought all the scripting for shaping the training data (varying the number of parameters, prefixes, suffixes, transformations to case forms (UpperCase -> ulul) and similar will take me most of my time) but then ended up writing everything in a day.
I have written so much Awk over the years, its not even funny.
The one awk script I will never forget, ~30 years ago, was written in 20 minutes when 4 days of C code became untenable.
It was a "hmm.. wonder if I can just push the parsing to AWK" moments that became an epiphany. We didn't have to tell anyone we'd replaced their custom C database loader with a shell script, at least not in those days .. our Friday nights were precious, then.
If you are like me two years ago and think that AWK is obscure, you might want to follow my twitter account where I try to teach AWK: https://twitter.com/mawkic
These days I use AWK on a regular basis for small everyday - tasks.
Is there a good mnemonic for awk, sed, and grep? I sometimes get sent a command to run when I'm searching for text or file names but I've never gotten a hang of how all three cover their domain.
I love awk. I never used it beyond as a sort of SQL-ish method for querying semi-structured data (and adding some of that missing structure) but it's great just for that sort of usage alone.
Perl is slower, but has a ton of shortcuts (inspired by awk) to make common text processing tasks even easier. If you already know awk it may not be of much benefit. But CPAN makes it easy to scale up to a more robust program when needed.
I wrote a quick and dirty HCL parser in about an hour using functional programming in Perl, and I hadn't really touched Perl in years. It's still pretty nifty.
Perl one-liners can be incredibly powerful, but I keep using awk because it's a much smaller and simpler language to keep in my head. My trusty old Camel book, 2nd ed, is what, 500+ pages?
For more complicated stuff I usually resort to python or even a compiled language.
It doesn’t make sense when we have tools that give you autocomplete on column names these days. You don’t have to hunt down the column you need and ensure it’s in the same place every time.