Perl One-Liners Explained, Part VII: Handy Regular Expressions

zby · on Nov 10, 2011

These regexps might be useful as first approximations - but they are not as robust as the article suggest they are (for example numbers can contain decimal point or decimal comma depending on your language, parsing email addresses is a subject for an essay and I've heard the correct Perl regexp for that takes about a page of text). This article is harmful - because those who can benefit from this kind of basic regexp examples are also those who will not understand the limitations.

jsrn · on Nov 10, 2011

> because those who can benefit from this kind of basic regexp examples are also those who will not understand the limitations.

I share your sentiment. The article is certainly useful for learning regexps. But IMHO it should also point to the correct way of doing things - often, the correct way is using a module and thus the resulting code is not much longer than the code in the article. For email address validation:

    use Email::Valid;
    print (Email::Valid->address('john@example.com') ? 'valid' : 'invalid');

as a oneliner:

   perl -MEmail::Valid -E"say (Email::Valid->address('john@example.com') ? 'valid' : 'invalid');"

Other than installing the Email::Valid module with a

    cpanm Email::Valid

it is not much longer than the example in the article.

https://metacpan.org/module/Email::Valid

pkrumins · on Nov 10, 2011

Good point. I updated article with a note about Email::Valid.

DanielShir · on Nov 10, 2011

I guess you guys just read the code and not the text:

`Notice that I say "looks like". It doesn't guarantee it is an email address.`

And yeah, you can find the full regular expression in the back of one of O'Reilly's Perl books (the regular expression handbook I believe).

It's nice to see perl code from time to time, even if it's just one line :)

jsrn · on Nov 10, 2011

Point taken - what I primarily wanted to show is that the correct way is not much longer than the "looks like" solution. Yes, the Regexp is long, but it is nicely encapsulated in the Email::Valid module.

zby · on Nov 10, 2011

Yeah - I've read only the first sentence. I think many of those that will find that article from google and even use that code will also not pay much attention to that weak disclaimer. Also these were not the only problems with his code - see the comments at that page (in particular: http://www.catonmat.net/c/35784).

But the more important point is that an article that sounds so authoritative should present much higher quality.

dcosson · on Nov 10, 2011

But I doubt someone new to regexps would understand what "looks like" means in that context. For instance they might think "ok, so something like 'abc@efg.xyz' matches, even though it's not a real email address." They might not think to consider that a full sentence like "hey, I'll see you tomorrow @ 2. Can't wait!" also matches.

That said, perl one-liners are certainly useful so thanks to the OP for putting this together. I just think it would add a lot of value to include examples of where one is likely to go wrong.

JoachimSchipper · on Nov 10, 2011

From http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html:

Mail::RFC822::Address is a Perl module to validate email addresses according to the RFC 822 grammar. (...) Implementing validation with regular expressions somewhat pushes the limits of what it is sensible to do with regular expressions, although Perl copes well:

    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
    )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
    \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
    ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
    \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0

    ** SNIP: 76 lines **

    ?:\r\n)?[ \t])*))*)?;\s*)

tikhonj · on Nov 10, 2011

I think this collection clearly shows the simplicity and utility of short, potentially imperfect regular expressions. Reading and writing expressions like this--even if you never use them in your code--is a skill that almost every programmer would benefit from.

JadeNB · on Nov 10, 2011

In #117, the sentence ends too:

    The second regex matches <em>hello</em> because.

pkrumins · on Nov 10, 2011

Fixed now!

vph · on Nov 10, 2011

This is really the essence of Perl and also its "problem".

If you need a book to explain one-liners, it will be challenge to comprehend a program with 20 such one-liners.

JadeNB · on Nov 10, 2011

I approach these articles as a way to show off the expressive abilities of a language, not to provide code snippets for cut-and-paste.

As others have mentioned (and as the articles themselves repeatedly mention), most of the time, if you have a common problem, you should find and use a CPAN module already built to solve it.

I have never used a code snippet from a one-liner compilation in one of my programs, but I have learned a lot of new constructs by reading them.

EDIT: Or, what tikhonj (http://news.ycombinator.com/item?id=3219843) said.

fennecfoxen · on Nov 10, 2011

Man, I like perl, and learning how to operate these sorts of regular expressions is useful, but a lot of the the "one-liners" are infested with magic variables and the sort of freakish syntactical constructs which give the language a bad name.

Good Perl is basically 72% of Ruby. (Less syntactic sugar. Less-structured reflection. And mildly crufty sigils - not that you'll notice those after your second week, though.)