A Regular Expression Matcher (2007)

dang · on Feb 13, 2020

A thread from 2016: https://news.ycombinator.com/item?id=12199836

2013: https://news.ycombinator.com/item?id=5672875

2011: https://news.ycombinator.com/item?id=2723366

Maybe others?

p.s. these links are just for curious readers; reposts are ok after a year or so—see https://news.ycombinator.com/newsfaq.html.

adriantam · on Feb 13, 2020

This is chapter 1 of "Beautiful Code" (http://shop.oreilly.com/product/9780596510046.do)

drivers99 · on Feb 13, 2020

I thought I had seen it before. Verified you can actually see the whole chapter in a nicer (book vs web) layout in Amazon's "Look inside" feature since it's at the beginning of the book.

https://www.amazon.com/Beautiful-Code-Leading-Programmers-Pr...

glangdale · on Feb 13, 2020

These posts are great for history, but regular expression implementation has moved on considerably from early Thompson implementations whether backtracking or NFAs. There is a considerable body of literature about regex implementation, including many quite convincing implementations and alternate formulations, some of which weren't even done by people at, or from, Bell Labs!

We seem to have something of an Eternal September of regex implementation knowledge (abetted by Russ Cox's amazingly selective bibliography in his posts introducing RE2).

hstaab · on Feb 13, 2020

I saw this in the repo of single file language implementations that was posted yesterday. Glad to see it on the front page now.

Here’s the repo if anyone is interested in checking out the others:

https://github.com/marcpaq/b1fipl

jstimpfle · on Feb 13, 2020

Skimming over it, the implementation looks inefficient:

    int matchstar(int c, char *regexp, char *text)
    {
        do {    /* a * matches zero or more instances */
            if (matchhere(regexp, text))
                return 1;
        } while (*text != '\0' && (*text++ == c || c == '.'));
        return 0;
    }

That can be used to match short strings, but not to grep through a filesystem. A good regex matcher has runtime O(N*M) where N is the length of the regex (typically very short) and M is the length of the scanned text.

ufo · on Feb 13, 2020

A surprising number of regex tools used in the wild employ a backtracking algorithm like this one. Some popular features such as lookaheada, lookbehinds, and backreferences can't always be matched using a finite automaton.

user2994cb · on Feb 14, 2020

An FSA matcher will be O(M) - one state transition per input character.

jstimpfle · on Feb 14, 2020

If you're looking at a specific regex (~FSA), then N is constant... If you have a function that runs an arbitrary regex, then N will vary.

raverbashing · on Feb 13, 2020

Seeing code like this I can understand why the pioneers though C code could be pretty.

But now all I see there is something more akin to a demonstration of ice carving with chainsaws than something that should be used in a production system.

leoh · on Feb 13, 2020

This would be a very good technical interview question — i.e. "implement this simple regex specification."

elteto · on Feb 13, 2020

We software types must be masochists. It took Rob Pike an hour in the comfort of his private office to write this piece of code. How is this _at all_ like a typical software interview?

dmos62 · on Feb 13, 2020

You're right. At the same time, could it be interesting to see how someone approaches (not solves) an intimidating problem?

reificator · on Feb 13, 2020

Also known as hazing.

dekhn · on Feb 13, 2020

There's a common question, "convert an RE into an NFA" (https://en.wikipedia.org/wiki/Thompson%27s_construction)

abecedarius · on Feb 13, 2020

Have you seen that used for interviewing? Yes, some people can do it in a 1-hour interview, but wouldn't you rather ask something that can't be done from memory?

dekhn · on Feb 13, 2020

I know engineers who have asked it, yes. It really seems to be a test for "who was paying attention to that one lecture in CS205 near the end of the semester"

abecedarius · on Feb 13, 2020

Wow. That's quite a tricky test, too -- if you study Thompson's original paper you can see plenty of pitfalls. E.g. his main code would loop forever on a star of a star, and he solves this in an appendix with a preprocessor to rewrite regexes of that kind, only it looked like that rewriting would blow up the size of the regex exponentially in the length of the stack of stars. No hire!

dekhn · on Feb 13, 2020

it doesn't any worse than what Huffman did for his CS classes at UCSC.

glangdale · on Feb 13, 2020

I have seen this one go by, which, from the perspective of someone who has been doing regex implementation for 14 years, more or less, is fairly amusing. It seems unfair, as someone aware of potential pitfalls of backtracking would presumably feel obligated to build a automata-based solution; someone aware also of the potential pitfalls of DFAs (failure to determinize quickly on some nasty patterns) might also feel obligated to build a Thompson or Glushkov NFA solution as well.

It's quite possible you might turn someone competent into a gibbering wreck (they are asking me to implement RE2 or Hyperscan in an hour!) :-)

adambyrtek · on Feb 13, 2020

I used to use this exact problem years ago when interviewing candidates at Google. At that time I thought it was a great question, but now I'm not that sure anymore if this approach to interviewing is the best way to assess candidates.

leoh · on Feb 13, 2020

Super interesting. Could you say more? Perhaps it depends a bit on the role?