Hacker News new | past | comments | ask | show | jobs | submit login

It should be noted that this is not possible with regular expressions in the traditional sense (i.e. regular expressions only matching regular languages): http://math.stackexchange.com/a/181233/21405

Because of back references PCRE regular expressions can match non-regular languages as well.




For some great exposition on what "regular" actually means, check out http://nikic.github.io/2012/06/15/The-true-power-of-regular-...

"Thus the question arises: Can regular expressions match only regular grammars, or can they also match more? The answer to this is both yes and no"...


"regular" has a fairly simple mathematical definition: the set of languages that can be matched with a finite state automaton.

You can think of this as the languages that can be matched with an algorithm that is bounded (i.e., O(1) ) in memory, no matter what is the string to be matched.

The following pseudocode is not bounded in memory -- can you guess why?

    bool is_prime(Number n)
    {
        for (Number i = 2; i < n; ++i) {
            if (n % i == 0) {
                return false;
            }
        }
        return true;
    }


The value i can take an a priori unbounded amount of memory.

It should be noted, so can the value n, but in defining regular languages in this way, we allow our O(1) memory algorithms to be fed the characters of an a priori unbounded input string one by one sequentially.


Yeah, the O(1) refers to the size of the EXTRA memory you need, aside from the input. Defining it any other way would be strange. So, for instance, binary search over a sorted list needs O(log n) of memory (for the same reason as the prime algorithm, the pointer in the array is size O(log n) of the size of the array), even though the input is O(n).


Yup. Note that the method of specifically being fed the input sequentially, one by one, means that, for example, "Strings of odd length whose middle character is 'X'" does not comprise a regular language, even though one might naively reason this to be trivial to detect with O(1) memory ("Just go look at the middle character!").


FWIW, (i < n) can be replaced by (i <= floor(sqrt(n))) to avoid unnecessary iterations.


Yeah, obviously, there are lots of ways to improve that code. First off all, much more efficient to to do (i * i <= n) instead of (i <= sqrt(n)), or at the very least you can cache sqrt(n) so you don't recalculate it every step.

You can also do a separate test for 2, then start at 3 and use i+=2 instead of i++, that cuts the test time in half. Once you've done that, you can observe that all primes are 6k +/- 1 except 2 and 3, which makes your code 3 times faster than the naive version.

At this point, you've basically got a simple wheel primality testing algorithm going, so why not go all the way?

I don't think the parent poster was making the point that his was the most efficient algorithm for testing primes. I think his point was that regular numbers are O(log n) in CS terms, not O(1), which is not obvious at first.


One of my pet peeves is how PRCE destroyed the definition of "regular" in regular expressions. It has basically made a huge number of programmers illiterate as to what truly is regular in the formal sense.


That wasn't PCRE. That was Perl. PCRE just copied the features that people liked in Perl, and made them available elsewhere.

The same features have also been implemented independently many times. For example Ruby, Java and Python all have independent implementations of regular expressions with a lot of the same features.


When were back-references introduced in sed?


http://sed.sourceforge.net/sedfaq3.html

> Grouping and backreferences: All versions of sed support grouping and backreferences on the LHS and backreferences only on the RHS.


I saw that. I wasn't sufficiently confident as to whether it meant "all historical versions" or just "all current versions."

If the former, then it predated perl by quite a bit, of course.


Perl started with Henry Spenser's regular expression library, which was written in the mid-1980s. That got used in a lot of places, and may have been used in sed. It includes backreferences. I don't know what predates that.

Perl used that as a starting place and began adding extensions such as the difference between .* and .*?.

So using "non-regular thing" as a regular expression predates Perl. But the dialect we standardized on with the features people now expect is due to Perl's influence.


Ah I get you.

I've tried finding some source for early versions but all the links I've found are dead. It's an interesting question!


But why should people care about what is regular in the formal sense? Rather, regular in this context would mean it can be recognized with a restricted type of algorithm, which resembles the formalism.


http://davidvgalbraith.com/how-i-fixed-atom/ http://stackstatus.net/post/147710624694/outage-postmortem-j... https://swtch.com/~rsc/regexp/regexp3.html

The difference isn't academical. It has huge practical implications, from software that is just annoyingly slow to DoS attacks.

People need to understand that. And stop using non-regular regexes (or better yet: regex in general) in production code. It increases world-suck.


It destroys mechanical sympathy. Many people can no longer understand why some of their regex calls are so slow or why some are so much quicker.


PCRE doesn't implement that "restricted type of algorithm", and I think that's what the parent is suggesting is wrong, not so much the difference between formal understanding vs informal, intuition of the same subject (I would argue that the latter would be sufficient, but few even have that).

Understanding REs (a little more) formally, one can exploit their properties for e.g. performance and/or concision. For an example of what that understanding gives you, I'd recommend checking out the Ragel state machine compiler.


They were named "regular" in the formal sense. Thus if they are not regular, it is a misnomer. It would have been just as useful to call them "Perl Matching Expressions", but without all the additional confusion about what a regular expression can do.


Is there a standard for which additional features one can add on top of regular expressions in the original limited sense and still be considered a regex?


"regular language" is a mathematical term. You can add any features you want to your matcher, but some features allow you to match languages that are not "regular".

However, most programmers are not mathematicians or computer scientists, and so are effectively using lay terminology which is similar to but not precisely the same as the mathematical terminology. Most programmers only care that their regex library allows them to concisely match strings, not that it allows them to match regular languages.

Thus there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition.


I don't buy the layman argument. Programmers might not be mathematicians or computer scientists, but they should be professionals and held to the same standards as professionals in other fields. For example, a mechanical engineer doesn't use a "laymen" definition of tensile strength when choosing materials for a design.


No, but he probably does use layman's terms when describing molecular lattices and electron shells. If he ever does.

The point here is that users of (colloquial) regular expressions, however professional they are or aren't, are just engaging with the surface of a deeper set of mathematical principles that require greater nuances of meaning to discuss than their applications do.


I basically agree, but I would say that a slightly deeper understanding of regular expressions is a really useful thing to know for basically all professional programmers. If for no other reason they can recognize the "pathological" cases where backtracking implementations (i.e. almost all modern regex implementations) would need until the heat death of the universe to evaluate.


Programmers might not be mathematicians or computer scientists, but they should be professionals

What? No. Many programmers out there are just working on fun projects in their spare time. I'm sure there exist some aerospace engineers doing that, but it's not many. I don't see why your average bedroom open source programmer should be treated like an aerospace engineer.


Programmers have more freedom from the constraints of reality than mechanical or civil engineers.


What the fuck are you talking about? Those aren't even remotely the same thing at all.


Right. My point (which may not be a particularly great point, but it was what I was trying to convey) is that the lay definition doesn't have any particular tight borders; presumably different regex libraries are capable of matching different things, use different syntax, etc.

(And this is part of why I find it confusing terminology and would prefer "regular expression" only to mean specifications of regular languages in the original formal sense. But, of course, I'm not the king of language, and it's natural that people would speak the way they speak, given the circumstances.)


The lay definition isn't a definition, per se, it's a misappropriate of the term. Since the lay user only cares about text matching, the term for what they need is "text matching syntax" or some such thing.

And surely you don't think those implementing the regular expression engine are lay people. They should know better.


> there are two standards you could follow: the precise mathematical definition of "regular", or the lay definition

Considering this writeup leads with a giant headline saying "Regular Expressions - The Theory"... I think the complaints about how it's not actually a regular expression are even more valid than usual.


I always understood it as being whatever can be straightforwardly tacked onto a typical DFA implementation. I though that's how people came up with it—whichever extras were the easiest to implement without mucking anything else up, so in a way they "came for free". (It's possible I misunderstood, I don't know for sure.)


Well, sure, but "whatever can be straightforwardly tacked on" is highly subjective, no?


No, it is not.

You can invent a variety of notations to describe a DFA. But you can't change the formal definition of a DFA, or what kinds of things a DFA can match without making it not a DFA.


I'll say to you what I said above, but even moreso: Er... I'm not sure what you're saying here, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.

I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.


You can actually quite easily add features to regular expressions, that are typically excluded while still being actually regular. Examples would be an `and` or a `not` operator.


What I am saying is that you can come up with a lot of extensions of the basic regular expression language. But there is no subjectivity at all in whether the extensions can be implemented by a DFA. Either you can, or you can't.


Not really. A graph is either a valid DFA or it is not. At that point, it's a matter of tooling in terms of "straightforward"ness.


Er... I'm not sure what you're saying here, in that it doesn't seem to be engaging with what I was saying, or trying to say, so I'll clarify what I was saying. What I mean is, we all agree on what DFAs are, and so we all agree on what regular languages in the original, Kleene sense are. But once we start admitting non-regular languages into our notion of "regular expression" because they're "straightforwardly tacked on", it's unclear what the limits are supposed to be.

I'm not actually a fan of using "regular expressions" in a sense broader than that of Kleene's regular, DFA-accepted languages, to be clear. That is exactly what I've been questioning.


I think the intent of the parent comments was to say that it's fine to consider a regex a standard "regular expression" in the mathematical sense if it still translates to a DFA and matches a regular language. For instance, mathematical regular expressions typically only include '*' (0 or more), not '+' (1 or more). However, you can still translate a regex including '+' to a DFA. Likewise arbitrary repetition mechanisms like {m,n}, case-insensitivity, and other features that add convenience without actually matching non-regular languages.

On the other hand, backreferences can allow matching non-regular languages.


Ah, yeah, I see now that that is indeed probably what they meant; thanks! In that case, yes, I'm sympathetic to that, using regex to just mean "Anything that ultimately describes a regular language".


I don't think so, not too much anyway, I think the "software engineering" aspects of it would place reasonably strong constraints on what is desirable and on what is feasible (for a DFA). That's probably how the (informal?) consensus emerged.


If you can compile any string in the language to a DFA then it's equivalent in power to regular expressions.


Maybe we should stop talking about regular expressions and talk about pattern matching instead...


Sure, but on the pragmatic hand, "regular" isn't super useful unless you're designing a language yourself. If you're parsing something, not understanding "regular" is going to hurt a lot less than attempting to use PCRE to do something it can't.


Regular expressions has implications in space time complexity. PCRE tosses that away.


Indeed. To support backreferences, "regex" libraries are forced to use algorithms that can be very slow in the worst case.

The sad thing is that the libraries use the same algorithms even if the expression doesn't contain backreferences. A while ago, Stack Overflow had a brief outage because of regular expression performance, although the expression that caused it didn't even use backreferences or other non-regular features:

http://stackstatus.net/post/147710624694/outage-postmortem-j...

In contrast, Google Code Search - when it still existed - supported regular expression searches over world's public codebases. One key ingredient making this possible was to only use proper regular expressions:

https://swtch.com/~rsc/regexp/regexp4.html


Regular is actually incredibly useful because it allows you to confidently parse or validate (large amounts of) untrusted data without having to worry about DoS attacks.

I'd argue that this is useful, even if you use it as part of a parser that doesn't just parse regular languages, as you'll have a slightly easier time reasoning about the time and space complexity of the construction.


That's true, but it is actually possible to construct true regular expressions that check if a binary string of digits is divisible by an arbitrary integer. Obviously not the quickest or most intuitive way to do it, though.

There's an interesting book that covers some unusual aspects of automata and regular expressions: https://www.amazon.com/Finite-Automata-Regular-Expressions-S...


Sure, for a fixed arbitrary integer M. (Supposing the input is fed in in big-endian order, you just keep track of the remainder modulo M as each new bit comes in, updating it by turning r into (r * 2 + new bit) mod M. Since this only requires finite state, it is regular).

But that won't get you a primality checker. You can't get a primality checker. Primes don't comprise a regular language, neither in unary nor in any nontrivial base.


What do you mean by "non trivial base?"


Oh, just as opposed to unary. Writing numbers in the ordinary way in any ordinary base, which is to say, bases greater than 1; base 2, base 3, base ten, whatever.


It's worth pointing out that we are talking about representing numbers in unary, though. Or even more directly, plain ol' string length since the RE under discussion doesn't even care what symbols you are using, so "11111" = "abcde" under this RE.


Back-references may not be part of the mathematical definition of regular language, but they are definitely part of the traditional (UNIX) definition of regular expressions.

The regexp in the article is perfectly compatible with UNIX grep:

    $ s=.; for i in `seq 50`; do echo $s | grep -qE '^.?$|^(..+)\1+$' || echo -n $i\ ; s=$s.; done; echo
    2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 
(Tested on Mac OS X)

In this case, bashing PCRE is completely off-topic.


Nobody is bashing PCRE.

>but they are definitely part of the traditional (UNIX) definition of regular expressions.

This is not exactly true. In 1968 or ealier Ken Thompson wrote first grep implementation using NFA simulation, an algorithm which is now known as Thompson Construction [1][2] There were no back-references in that grep.

In 1975 Al Aho wrote an egrep which used DFA instead of NFA [3] (both NFA and DFA accepts regular languages, but in some cases DFA will have exponentially more states than the same regular language accepting NFA automata [4]) This added features such as alternation and grouping which was not supported by grep, but it not supported back-references.

Current GNU grep have -E switch which accepts extended regular expressions as described in Posix standard. Theses extended regular expressions supports back-references as do PCRE available in grep with -P switch

So no, back-references are not in traditional Unix definition of regular language.

[1] https://en.wikipedia.org/wiki/Thompson%27s_construction

[2] http://dl.acm.org/citation.cfm?doid=363347.363387

[3] http://dl.acm.org/citation.cfm?id=55333

[4] http://cs.stackexchange.com/questions/3381/nfa-with-exponent...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: