Hacker News new | past | comments | ask | show | jobs | submit login
Regular Expressions – Mastering Lookahead and Lookbehind (rexegg.com)
252 points by filipmandaric on March 16, 2018 | hide | past | favorite | 83 comments



Most of the time I mention the topic of regular expressions to other developers, I usually hear self-critical commentary like "oh, I'm terrible at regex", and rarely anyone who loves them. I think they're great though, if you take the time to understand them. They're something like a Swiss Army knife for programming.


I think everyone who doesn't know regex should make learning regex a priority. (However, I find that lookahead and lookbehind in particular do not tend to come in handy very often. So maybe just make a mental note that this exists and then look it up when you need it.) Just learn the basics and maybe take a very quick look at the theory, finite automata (maybe the name puts people off, but its just a couple of circles connected to other circles with a bunch of characters written on the connecting lines. I'm pretty sure you could explain it all in a few sentences a few examples). You'll get an intuitive feeling for what you can and can't do with regular expressions.

You don't even have to be able to code to make use of regular expressions. You can use regular expressions when searching and replacing in editors (even slightly barebones editors like gedit or kate). You can transform input data from almost any format into any other format using nothing but your editor and a series of replace statements. (No computations though.)

I think they should teach regex in high school. Many people working in non-IT office jobs could benefit from knowing regex, and I think it's really quick to learn this. (Now if only Excel's/Word's search/replace supported regex...)


Which version of RegEx? I've "learned" RegEx two or three times and then switched language/platform and had everything I previously learned no longer work reliably.

You might think I am just talking about Microsoft's quirky implementation but even in the Linux-sphere it isn't consistent see:

http://www.greenend.org.uk/rjk/tech/regexp.html

You take a complex format string which was design to use the fewest characters instead of with clarity in mind, you then have every major application and library diverge on basic support and spec for features, and then you have all of them hack on support for UNICODE in their own unique way.

Regular Expressions likely won't ever die, but I for one would happily switch to an alternative with better readability, UNICODE support from day zero, and fewer niche features to keep things uniform. I'm tired of re-learning RegEx only to have everything I've learned either be forgot or not work the second I app switch.


That's an overstatement of the differences between various regex engines. They all follow the basic standards, with [] being character classes, () being submatches, * being "0 or more", + being "1 or more", etc.

The two main differences between various engines are which characters are "literal" and which characters are "magic" (Vim's engine is particularly annoying here), and how to write the "convenience character classes" (like what the shorthand for "alphanumeric character class" is). But these are minor issues, once you've learned how to write a regex, these are trivial to look up.

Knowledge of regular expressions transfer from one engine to another just fine.


I generally include either \v or \V in my vim regex, at which point I no longer have to think about which characters are magic. I suppose this means that I agree that vim's default is annoying here, but imho vim more than makes up for that by making magic configurable.


> They all follow the basic standards, with [] being character classes, () being submatches

You've already described a feature which has different syntax in one of the primary regex dialects I use (Emacs).


That's the syntactic differences, but there are also semantic ones.

Most notably, the choice operator can either be ordered like in PEGs (if the first branch matches, the other isn't evaluated) or pick the branch that produces the longest match, CFG-like.


For the most part, it's just a matter of knowing if you're using POSIX Basic Regular Expressions (BRE), POSIX Extended Regular Expressions (ERE), or Perl regular expressions.

Learn those, or at least the main differences between them, and the vast majority of the regular expression engines in software you use will become more recognizable.


You wouldn't be programming in Regex and for small things a google search for the platform quirks is usually faster than writing a parser, isn't it?

I'd agree with you if everything weren't so easy to look up.


But don't forget to point at the limitations. For example, you can't use regexps to match an arbitrary but equal number of nested opening and closing parentheses.


I presume you're familiar with the infamous "can you parse HTML with regex"?

https://stackoverflow.com/questions/1732348/regex-match-open...


Another potential problem with regexps is that the underlying finite state machine can grow exponentially in the size of the expression.


regex engines like PCRE can:

    ^(\((?1)?\))$


If it can then it's not "regular expressions."


But when most of the commonly used "regular expression" libraries aren't regular, I think if you really mean solving something with only regular expressions, you should probably specify that explicitly. The term's been corrupted enough that using it by itself to rule out things like backreferences isn't clear communication.


That's a shame, because regular grammars have a very important property: they're processed with a Finite State Automaton. This makes them blazing fast and quite memory efficient. (Heck, even with a non-deterministic one they're fast.)


Which is very computer-science-y approach, and totally uninteresting at that, because the (?PARNO) syntax is still embedded in the same regexp engine for any practical purpose or distinction.

If you didn't know, a regexp engine with capture groups is already stronger than what in formal languages theory is called "regular expressions".


I appreciate the sentiment same as you, but it is not quick to learn, because you'd either stop short or basically learn, well ... patterns, some for different usecases and categorize many different patterns that achieve the same, while the typical pupil has problems with simple arithmetic, calculus etc. already. So the question would be why the patterns aren't abstracted behind a nice composable gui [1]. Not to mention the confusion around the various ever so slightly differing applications.

Also, you don't wanna spoon feed students, they'll never learn to fish. You would indeed have to go as far as implementing a regex engine or implementing I don't know, a certain finite automate in regex. I'm kidding but all you could realistically achieve would be the usage of a catalogue like command-line-fu or stackexchange unless the whole thing fits in a broader cs syllabus/curriculum.

[1] lrovocative statement: sed and awk are breaking the "do one function and it well" idea of unix.


Regex is great for very simple text processing, and I tend to do a lot of that.

The most important parts to get comfortable with are:

* Capture groups and alternation

* Character sets

* Anchors (start and end of line)

* Common escape sequences (digit, word, whitespace)

* Repeat (any, one or more, n-m)

* Common flags (global, multi-line, case-insensitive)

If you need more than that, it's time to start evaluating other tools IMO. A lookahead here and there is okay, but I'd avoid them if possible.

The best way I found to learn regex was to take a set of inputs that I wanted to match, (and a set that I didn't) and play around on https://regex101.com/ until I got a pattern that did what I wanted. You'll very quickly start to learn the above bulletpoints, and before long you'll be able to write patterns without any reference.

If you find that your regexes are getting too large or unwieldy or difficult to understand (despite knowing the above bulletpoints) then you probably need a parser or some other more suitable tool.


They're definitely useful, and I can cobble them together to get lots of otherwise tedious and complex parsing tasks done, but when I come back to them a week later I have no idea what the hell the pile of wingding vomit I wrote was supposed to do.

I find myself writing simpler ones and tying them together with app code just for sanity's sake.


Some regex implementations allow for comments in the string; if your does not, you can probably make it work with concatenation, like:

  String pattern = "^https+" // match the protocol at the beginning
                 + "([a-zA-Z])+" // match the machine name
                 + ...
Honestly, I use regular expressions because, even in such format expanded with comments, I haven't seen anything more readable after you get used to regex operators. I guess the closest would be the alternative format in CL-PPCRE. For instance:

  CL-USER> (cl-ppcre:parse-string "\\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b")
  (:SEQUENCE :WORD-BOUNDARY (:GREEDY-REPETITION 1 3 :DIGIT-CLASS) #\.
   (:GREEDY-REPETITION 1 3 :DIGIT-CLASS) #\.
   (:GREEDY-REPETITION 1 3 :DIGIT-CLASS) #\.
   (:GREEDY-REPETITION 1 3 :DIGIT-CLASS) :WORD-BOUNDARY)
But then, any such form can get mouthful:

  CL-USER> (cl-ppcre:parse-string "((\\b[0-9]+)?\\.)?\\b[0-9]+([eE][-+]?[0-9]+)?\\b")
  (:SEQUENCE
   (:GREEDY-REPETITION 0 1
    (:REGISTER
     (:SEQUENCE
      (:GREEDY-REPETITION 0 1
       (:REGISTER
        (:SEQUENCE :WORD-BOUNDARY
         (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\0 #\9))))))
      #\.)))
   :WORD-BOUNDARY
   (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\0 #\9)))
   (:GREEDY-REPETITION 0 1
    (:REGISTER
     (:SEQUENCE (:CHAR-CLASS #\e #\E)
      (:GREEDY-REPETITION 0 1 (:CHAR-CLASS #\- #\+))
      (:GREEDY-REPETITION 1 NIL (:CHAR-CLASS (:RANGE #\0 #\9))))))
   :WORD-BOUNDARY)


If you're using PCRE you should also make use of named patterns. It makes the expression easier to understand as you can reuse parts of it (a little like functions) and the matched patterns can be then used in your language with their name instead of their position. Decoupling the usage from the regexp so it is more robust.

http://www.rexegg.com/regex-capture.html#namedgroups


That would be nice to have when working in Javascript, making code a lot more readable and easy to update.

Alas, we don't get such luxuries as named groups or static typing... Woe is me.


I wonder if Perl 6 regexes and grammars might show a way forward for more readable pattern matching.


An example would be helpful so here is one https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra...


> I find myself writing simpler ones and tying them together with app code just for sanity's sake.

Or just use PEG, parser combinators, or other more readable parsing abstractions


I love regex. Not just for doing pattern matching in code, but for searching and data transformation in editors and tools.

But yes, lots of people do seem to do the "I suck at regex". Even when I notice people do crazy long winded transformations by hand which I then do within seconds. Still doesn't seem enough motivation for them to learn them properly.


I absolutely unabashedly love Regex.

It's like solving a puzzle that ends up eliminating work (by doing said work) extremely concisely. What better kind of puzzle is there?


> I think they're great though, if you take the time to understand them.

Regular expression is useful for sure. What is terrible is that every language, shell and platform has different "styles and implementation". On windows, cmd, powershell, C#, sql server, etc all have their own styles. It's similar enough and at the same time different enough to drive you insane. Throw in linux with their shells, vi(m), perl, etc all using their own variants.

But the biggest problem is that regex is prone to "set it and forget it" issue. It's something we use once in a while and forget. Was it brackets or parentheses or braces for defining character ranges? Does . or + signify one or more? And ever try deciphering someone's undocumented multiline regex? Fun times.


When I was learning my way around regexes, I found a cheat sheet helped. I made my own - you can download it free from here:

https://www.cheatography.com/davechild/cheat-sheets/regular-...


A programmer saying they are terrible at regex is like a mathematician saying they are terrible at algebra.


Regex is useful when you do lots of string processing, like in webdev. Outside of that, I've found uses to be very limited - certainly not worth the upfront time investment. (I mean, sure one can cobble together something that mostly works with a regex testing tool, but you need to either take a college automata course or work through the Friedl in detail to get a basic level of proficiency).


I’ve found otherwise. No single work day passes without me inspecting/converting/refactoring calls and complex expressions via regex. Tools define the way you think and create.


Right - and regex makes you think about function and variable names as strings, instead of as the higher level abstraction that an IDE with proper refactoring support lets you think. Regular expressions are not the right tool for that sort of work in 2018.

Look, I used to write web application in the 1990's with vim on computers with video cards that didn't have X drivers for them. I'm well versed in regular expressions, having used maybe a dozen flavors of them over the last 20 years. Being snooty about how useful regular expression should be (in your opinion) to the work of every other programmer out there isn't going to change the experiences of those others. I maintain that for web development there are quite often uses, but for scientific software (which is what I do), embedded, non-web based CRUD/LoB, and many other applications - it's just not what it used to be.

EDIT: turns out I was mixing up the tone of your comment with that of bmn__ down below so I replied more belligerent than your comment warrented - no offense, I'm just going to leave it up regardless.


No offense taken. But I'm not sure if this is objective or an another point of view. I do not think in terms of strings; with regex I actually describe what syntax my writing has and use that for conversion. Obviously, I cannot manage random mixed-style codes or entire grammar that way (though some flexibility exists). But for a homogeneous style it is pretty simple -- imo simpler than getting used to yet another IDE with its can and cannot-s.

Since I may use 2-3 [non-web] languages at the same time, viable IDE options may go down to zero. I like how regex and other vim-specific features empower my typing enough to not use what constrains me in my toolset.


I use a regular expression maybe a couple days out of the year; it seems they don't come up very often in real-time biomedical algorithms engineering. I'm sure most embedded programmers feel the same way.


All programmers' text editors include search/replace with regex and external filter commands (of which many of the often used are regex-enabled).

If the biomed or embedded programmer deliberately does not make use of that functionality, he is inefficient.


What if needing to search and replace in the first place is inefficient? Also, what if that engineer is female? Is he still inefficient?

I'm finding these comments hilarious. No true programmer!


You are so right! How could you possibly go out of practice in real-time biomedical algorithms engineering when you might use them one or two times out of the year?! I must have missed the memo where all those strings are encoded in a patient's hemodynamic signal.

There are few people more frustrating to algorithms engineers and embedded engineers than people like you who think you are the only type of programmers out there or that the programming you do is somehow superior(despite largely relying on math you learned at a decent secondary school).


I see many people treat regex as an in-app library and argue about that. But in fact it is a text processing tool. It is not important if you're biomed-dev with a honorable degree or web-dev from 'secondary school'. What is important is how you manage your code and constant data, which is mostly text that doesn't care who you are. How do you manage your code? What do you do if you understand that you need a module extraction? Documentation fixes? Managing any text includes non-trivial search and replace or specific tools, right? Either you know one of these, or you leave your mess as is. If former, you have to know something anyway. If latter, well... not much honor.


I agree. If I ever need to use regex, I will dive in, but I don't mainly write software to process text. I mainly write control software for various things like subsea oil wells, oil rigs, gas turbines etc. It is enough for me to know it exists. I try to spread myself thin and broad, and then that gives me the control of which areas to dive deep, depending on the task at hand.


Well, maybe I think that the ability to use the search and replace function in a text editor is a foundational skill. Maybe you're a better programmer? I know programmers who don't even need to write code, but I wouldn't think they were good programmers.


If I'm writing code that uses regexs, it helps me if I write at least 1 test case along with a helper function to make using the regex easier for me. E.g., I did the following in Scala recently. Shown is just one of many regexs I used to read SQLServer stored procedures and turn them into functions that would write the Scala code to use them.

  val inputIdTypWidthPat = new Regex("""(?si)@(\w+)\s+(\w+)\((\d+)\)""", "id", "typ", "width")

  val inputIdTypeWidthCheck = RxInputMatchGroups(inputIdTypWidthPat,
    List(InputMatchGroups("""@COUNTRY_CODE char(2),""",
      List(MatchGroups("""@COUNTRY_CODE char(2)""",
        List("COUNTRY_CODE", "char", "2"))))))

  def getIdTypWidth(s: String): (Option[(String, String, Int)], Int) = {
    val om: Option[Match] = inputIdTypWidthPat.findFirstMatchIn(s)
    if (om.isDefined) {
      if (om.get.groupCount == 3) {
        (Some(om.get.group(1), om.get.group(2), om.get.group(3).toInt), om.get.end)
      }  else (None, 0)
    } else (None, 0)
  }


I’m not terrible at it, but while useful in general, I rarely use regular expressions in the core programs I develop.

I think the automata theory behind them is more important to know than proficiency with specific regular expression implementations.


Mobile programmers doesn’t need regex nearly as much as serverside


JS programmers don't need to know how a C pointer works, but they're not doing themselves any favors by being ignorant of it. It's very basic basic background knowledge.


I meant to say they do not need to be great beyond basics like using wild cards, and simple Inclusion and exclusion syntax. That’s probably more than 50% of time all you need on that side of he world


Yeah, that's true. If you understand the quantifiers and matching groups, you basically understand regex. That's all I'm referring to. I'm not saying everyone has to be an expert on implementing them.


I love them and I think I'm great with them. (Though I'm probably not as great at them as a bunch of you reading this comment)


Regex's are awesome as a swiss army knife.

Though in my experience, outside some edge cases where the format never changes (like matching a domain name in a URL) a regex is hell to maintain when you come back 3 years later.

There is also always the fun of people trying (and failing) to use regex in emails.


.+@.+ seems to be the only one without too many false-negatives


Well the best solution is to check if .+@(.+) matches and then try to lookup what the capture group returned via your DNS resolver. If it has an MX record (or CNAME to something with MX), then deliver the mail to there.

If you can't resole the domain part, return an error.


> It is that at the end of a lookahead or a lookbehind, the regex engine hasn't moved on the string. You can chain three more lookaheads after the first, and the regex engine still won't move.

Omg, thank you, that is the insight I needed and now I completely get it. Internet +1 for the day.


A use-case for lookarounds that I often use is:

    grep -Po '(?<=...)pattern'
Which also cuts out and prints the relevant part of the line. This saves a trip through cut, awk or perl. (-P is PCRE and -o is print only matched characters, which the lookarounds aren't a part of.)


I often use `\K` instead, which also helps if it is variable length lookbehind

    $ echo 'foo=5, Bar=3; x1=83, y=120' | grep -oP '\b[a-z]+=\K\d+'
    5
    120
further reading: https://stackoverflow.com/questions/11640447/variable-length...


I like to add the lookahead:

  grep -oP '(?<=...)pattern(?=...)'
The caveat however is that the look{ahead,behind} pattern has to be of fixed length.


Wow. I do this all the time, and the additional step of piping to sed never ceased to annoy me.

You've made my day.


> our pattern becomes:

> \A(?=\w{6,10}\z)(?=[^a-z][a-z])(?=(?:[^A-Z][A-Z]){3})(?=\D\d).

And then these guys wonder why people hate regexes? The "now you have 2 problems" quote fit perfectly for that case.


The regexp syntax was devised for write-only programming at the terminal (at a time when a terminal was a physical object, not a window in your GUI).

The regular formalism is pretty neat though. There are alternative syntaxes (e.g. multiline regexps in Python) that are better suited for complex matchers.


Can you put line returns and indents in a regex?

If you escape them it should work, I guess.


While useful to some I think advanced RE are like mixing in Perl or playing code golf with production code. They tend to make code harder to read.

My preference in such cases is for multiple separated or longer REs (which can be at least split in the surrounding code) and each part named or heavily commented. Of course it's always worthwhile to consider non-RE solutions if the problem can be broken down enough.

EDIT: Fixed typo


I agree, with regards to production code. However, I find that I use regular expressions constantly many many times a day to grep through my code locally looking for particular things. It comes in very handy to know the advanced tools when you are looking for something unusual. Grep may be my most important programming tool next to vim.


Fair enough, but I really think the benefits of advanced regular expressions are underappreciated in non-production and even non-application contexts. Laypeople (and occasionally even developers) are impressed when you show them how to search through a document or file system using a really complicated pattern, where it would have taken several iterations of data manipulation to achieve the same result without using advanced regular expressions.


It'd help a lot if the grammar was actually readable. Combinations like .* don't visually "read" like a single unit, and then to make everything worse you often need a crazy amount of backslashes.

I'm not sure how you could fix that without introducing completely new characters or color-coding parts of the expression though.


The back slashes for escaping are absolutely awful. This is one of the worst things about Java.

It's much better in languages with regex literals like Ruby and JavaScript.


It's especially nicer in Ruby (which got it from Perl) where you can use whatever delimiters you like for regexes, with /abc/, %r"abc", %r{abc}, %r#abc# and so on all being equivalent, so you can just about always pick something that won't clash with the characters in your pattern (You can even use spaces as the delimiters, which looks terrible).


I agree. Usually, I end up leaning on PEGs instead:

https://nim-lang.org/docs/pegs.html


That's pretty bad:

    import pegs
    echo "xzxy" =~ peg"""
    B <- A 'x' 'y' / C
    A <- '' / 'x' 'z'
    C <- C 'w' / 'v'
    """
Stack overflow

Nim needs to let go of its toy parsing algorithm.


I came to know about this wonderful site when I saw this article - https://www.rexegg.com/regex-best-trick.html

example:

    $ # all words except those starting with 'c' or 'C'
    $ echo 'Car Bat cod12 Map foo_bar' | grep -ioP '\bc\w+(*SKIP)(*F)|\w+'
    Bat
    Map
    foo_bar
for more details: https://www.rexegg.com/backtracking-control-verbs.html#skipf...


Which is a fancy way to say `grep -iv '^c'`. EDIT: Oh, I missed that the input was a single line.

I personally feel that control verbs are bad additions to the regexp, even though I do know that it is not a big addition to the regexp engine itself (e.g. naturally extended from posesssive quantifiers like `a++` or atomic groups `(?>foo)`). Most uses of such verbs can be expressed with combined parsers and simpler regexps, in the much simpler and maintainable way.


sorry, it is not same as `grep -iv '^c'`

the `-o` option allows to output only matching portion, the regex is meant to extract all words other than those starting with 'c' or 'C'

here's hopefully better example

    $ # do something with words not surround by quotes
    $ echo 'I like "mango" and "guava"' | perl -pe 's/"[^"]+"(*SKIP)(*F)|\w+/\U$&/g'
    I LIKE "mango" AND "guava"


Oh, you are right. I missed that all words are in the same line. That said even the original article mentions that it only moves the captured group to the entire match; I am generally in a position to avoid all uses of control verbs, especially if it only costs one or probably two lines of the additional code that I can fully control and comprehend.


for those who find regex not very readable: https://github.com/VerbalExpressions // Create an example of how to test for correctly formed URLs var tester = VerEx() .startOfLine() .then('http') .maybe('s') .then('://') .maybe('www.') .anythingBut(' ') .endOfLine();


https://github.com/pygy/compose-regexp.js is another option (800 bytes mingzipped):

    const {sequence, suffix} = composeRegexp;
    const maybe = suffix("?");
    const oneOrMore = suffix("+");

    const urlMatcher = sequence(
      /^/,
      "http"
      maybe("s"),
      "://",
      maybe("www."),
      oneOrMore(/[^ ]/),
      /$/
    );


Plugging this website as I've found it very useful to learn simple regex with / get over my "oh God I don't know Regex": https://regexone.com/


Is there regular expression to match regular expression?


No. One way to convince yourself of this is that regexp capture groups must properly nest: /())(()/ is invalid for example. Regexps famously cannot match balanced parenthesis.


I used this article about a week ago to help me with a web scrape. Good stuff!


That is truly bizarre... I went looking for info on lookaheads just today, and found myself on that very site, and now it's on HN. It's just that ol' HN magic I guess.


You’ve got a problem you think regex can solve, now you’ve got 2 problems.


I wrote a list of json keys that should be taken from a message and :‘<,’>s/\(\S\+\)\s{0,}/t.\1 = message.\1;\r/g

Hey, did you commit already? Still typing?


It looks like your quote characters might be messed up there? Anyway, to parse json on the command line one should just use jq.


Single quotes were modified by HN engine, right. It is not a command line (thinking of sed?), it is the middle of a source code in my editor. I have a json-parsed message and t is a target object.

  a b foo c

  t.a = !!message.a;
  t.b = String(message.b);
  t.foo = message.foo;
  t.c = tonum(message.c);




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: