Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Technically not a regular expression, as (intuitively) the backreference has to “count” the number of 1’s that were matched. Regular expressions are equivalent to finite state machines, which cannot count arbitrarily high (any state machine that can count arbitrarily high needs an infinite number of states, since there are infinite natural numbers).

The 2022 discussion has a more formal proof using the pumping lemma.



Fun fact about the pumping lemma - my best friend's grandparent worked on it (Eli Shamir). When we were taught it during my bsc the TA who knew he was the grandson of one of the inventors, asked during the lecture where he was, if he could say something about his grandfather, and it turns out he played hockey from the class to work on something else, embarrassing!

The pumping lemma is a great milestone in this field, so simple, yet so powerful.


Did you mean "played hooky" instead of "played hockey" by chance? :)


> The TA [...] asked during the lecture where he was, if he could say something about his grandfather, it turns out he played [hooky] from the class to work on something else, embarrassing!

If the "working on something else" means "working on some hard research problem", this is not embarrassing, but stylish. :-)


> he played hockey from the class to work on something else

Sorry I can't parse that


Pretty sure they meant "playing hooky" :)


I think hockey is supposed to be hooky.


As in, the grandson wasn't attending when the TA asked him the question. I see.


Agreed. I read the post because the result was so shocking.

TIL that the computer-sciencey definition of RE isn't the only one in use.


When programmers say regular expression they usually mean Perl Compatible Regular Expressions (PCRE). Or I guess PCRE-compatible expressions, since there are other implementations of the same syntax out there. If you go back far enough in time they were once regular expressions in the computer science sense, but then features got added, and now they are incredibly powerful but can take basically infinite memory and time.

https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expres...


Although general PCRE syntax has clearly won, that’s distinct from the infinite-memory-and-time matter. Engines that don’t use backtracking and can search in linear time (O(regex-size × search-text-size)) have been making a serious comeback in recent years, at only the cost of not supporting lookaround assertions or backreferences. For example, the re2 library, or Rust’s regex crate, both of which are very popular.


Russ Cox has an excellent write up on it.

"Regular Expression Matching Can Be Simple And Fast" (2007)

https://swtch.com/~rsc/regexp/regexp1.html


Alas that approach only works well for your basic regular expressions that support Kleene-Star, concatenation and union.

Regular languages are also closed under intersection and taking prefix etc (see https://en.wikipedia.org/wiki/Regular_language#Closure_prope...), and if you want to expose those operators in your regular expressions, Russ Cox's technique doesn't really work.


I suspect it should be possible to transform two intersected (K*,C,U) expressions into a new (K*,C,U) expression by walking over the two, dropping incompatible terms, reducing sets, and redistributing and expanding sub-expressions to combine the two, though I do not have time to take a serious look at this at the moment.

I also suspect that this might result in a combinatorial explosion of fused terms, but that's irrelevant as we're only worried about whether intersection can be implemented in Russ Cox's method, not whether a regular expression might require several earths or so worth of RAM to hold :)

Nothing in the wiki article mentions taking a prefix, but that just sounds like `(prefix|)`?


You can still explicitly build the finite automaton for your closure-extended regular language. That one might be big, but matching is still linear in the length of the text you are searching through.

As you point out Russ Cox's method can not deal with complement and intersection. You can hack it up by putting a compiler in front, but then you might as well compile to finite automata directly. Or you use derivatives: https://en.wikipedia.org/wiki/Brzozowski_derivative or https://www.ccs.neu.edu/home/turon/re-deriv.pdf (or https://well-typed.com/blog/2020/06/fix-ing-regular-expressi... for a Haskell flavour).

> Nothing in the wiki article mentions taking a prefix, but that just sounds like `(prefix|)`?

They implicitly mention prefixes, when they talk about 'the trio operations: string homomorphism, inverse string homomorphism, and intersection with regular languages.' If I remember right, the prefix operation is:

Take a language L, restrict it to the strings that have the right prefix: `L' := L & (prefix.*)` then chop off that prefix from every member of L'.

See https://en.wikipedia.org/wiki/Quotient_of_a_formal_language which this is a special case of. The Brzozowski derivative article also explains it in more detail.


> and now they are incredibly powerful but can take basically infinite memory and time.

Which is why the "RE" in the article is excrutiatingly slow, given that it needs to perform insane amounts of backtracking. In contrast, "real" regular expression checkers run in linear time.

Also, nitpicking, but they take unbounded time. It's still finite.


They said “basically,” so no need to nitpick. And I’m not sure about unbounded; surely it is bounded by a suitable exponential function?


> They said “basically,” so no need to nitpick.

The inflationary use of "infinite" especially in tech crowds that should know better deserves pushback.

There are models of computation that actually model infinite computation. A well-studied one is Büchi automata, basically a generalization of finite state automata to infinite sequences.

> And I’m not sure about unbounded; surely it is bounded by a suitable exponential function?

Nope. You can nest them and for any function that you give you can produce a RE that takes longer than that function to evaluate. (That's literally what "unbounded" means. There is no bound. Doesn't mean it's "infinite". All those computations terminate, which is literally what "finite" means in this context.)


I think with "unbounded but finite" he meant potential infinity rather than actual infinity.


funnily enough the original POSIX regular expressions (BRE) were strictly less powerful than regular expressions (CS sense), lacking a general alternation operator (though one was provided in ERE)


[flagged]


So what is it that they mean, in your opinion?



And that article says:

> Perl regexes have become a de facto standard, having a rich and powerful set of atomic expressions.


Come now, every time a colleague says they parsed a file with a regex in javascript, do you turn to them and tell them "well, you didn't actually use a _regular_ expression..."


Only every time they claim a major mathematical breakthrough using regular expressions.


Nobody claimed that.


It's true, the article doesn't make that claim. We think of primes as being mysterious and out of reach, and we're conditioned to think "mathematical / computational breakthrough" when we read the words "prime number" in a headline, but this is only because we've found all of them that aren't huge enough to be mysterious. Computing prime numbers is trivial as long as you don't mind them being the same ones everybody else already found.


Not sure what you mean by "trivial", I mean a simple prime finding algorithm is easy to write down, but it will be very inefficient, especially for large primes (the very ones you use in cryptography).

Coming up with an efficient prime-finding algorithm like Miller-Rabin* is far from trivial.

* Technically it's just a prime-checking algorithm, but you can just generate random numbers until you've verified one of them is prime.


I do find it useful to know that regular expressions in JS(and many other languages really) are not actually regular and why - allows one to decide if a given problem can be solved at all using them.


> Regular expressions are equivalent to finite state machines, which cannot count arbitrarily high (any state machine that can count arbitrarily high needs an infinite number of states, since there are infinite natural numbers).

The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.

Informally, regular expressions like in Perl are a kind of string rewriting. And many string rewriting rules, if they can be iteratively applied, are Turing-complete (with the usual caveat about infinite storage and time). The usual definition of a regular expression excludes that kind of iterative application, though. It's just a matching pattern; rewriting is technically outside of the concept of a regular expression.


> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.

Right. And if you really want to push that argument further then there are numbers for which your machine can't decide whether they are prime or not. Most of them, in fact, i.e., for all but a finite number of primes and non-primes, your machine can't tell.

For the theory of computability, this is not a useful model. Neither for complexity theory. Statements like "My machine can sort in constant time" which is arguably true but not useful either. It only holds up to a certain collection size which in a sense makes it both nonsensical and useless. Even concepts like the Chomsky hierarchy with context free languages being one of the prominent components make little sense as long as your computational model is finite state.

Instead, we model them as unbounded, to be able to differentiate between classes of computation problems as well as reason about complexity. The Turing Machine is one of the simple and elegant models, but there are others for other purposes.

I wish that basic education about the theory of computation would stress more why theory treats things like this, to avoid this kind of confusion. (And then people would need to take those classes as well, and not just claim they can write React apps just fine without a degree.)


> And then people would need to take those classes as well, and not just claim they can write React apps just fine without a degree.

Weird tangent. They can write React apps without a degree, can't they? What's wrong with that?

The amount of complexity reasoning you need for something like that is around "this part seems slow, maybe there's a better algorithm somewhere" or "that's a lot of nested loops, maybe I can do with less". Knowing more might help in some pretty rare cases but doesn't seem like a requirement. On the other hand, knowing the complexity of all algorithms and blindly assuming "good complexity = good performance" while never checking is a recipe for desaster.


> On the other hand, knowing the complexity of all algorithms and blindly assuming "good complexity = good performance" while never checking is a recipe for desaster.

Exactly. That's the difference between an actual education and having looked up some Big-O notation on Wikipedia.


> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high

Come on, can we drop this useless trivia that floats around? No, computers are Turing-machines for all practical purposes and no the memory is not the tape, unless your computer is sandboxed from everything. As soon as it has some side effect channels, its “tape” can be as large as needed. For example, it can use the internet to store petabytes of state. But if we want to be more pedantic, it can control and observe physical things, making the whole universe its state space.

Which is still finite, so yeah, even more pedantically you are right due to mathematical infinity not existing, but the actual distinction of a Turing machine is “lazy”, if you don’t hit the limit, you may as well reason as if it were a Turing machine.


> The machine I'm using to type this comment is also a finite state machine, as due to having non-infinite memory, it also cannot count arbitrarily high.

I'm not sure how that's relevant. Even given infinite space, there still wouldn't exist a Regular Expression (the CS kind) you could write down in finite time that would match only primes.

At least IIRC, it's been a 20+ years since I last took a Discrete Math course :)


> At least IIRC, it's been a 20+ years since I last took a Discrete Math course :)

Your memory is correct and there is very simple intuition for this. Regular languages are those that can be matched by a machine with bounded memory. So, let's say that bound is "n" bits. Then your machine will not be able to read in unary strings larger than 2^n.


> Even given infinite space, there still wouldn't exist a Regular Expression (the CS kind) you could write down in finite time that would match only primes.

The point is, neither would there exist a non-(Regular Expression) method that could do such, if you're running it on a finite state machine (your computer).


But that argument is not interesting. "We can't count arbitrarily high because the universe is finite." Yeah, duh. (Observable universe, for the pedants.)

It's more interesting to ask: suppose you did have a magical stick of RAM that had infinite storage, and suppose you augmented your computer to store and query that memory. Then -- yes -- your computer could determine whether any number, however large, is prime. An FSM could not.


If you have an infinite stick of RAM, regular expressions wouldn't be FSMs, either, they could have infinite state space. They would terminate in finite time that is exponential in the size of the prime but nevertheless would still halt.


A finite state machine can’t have infinite state space.. also, you can write a finite Turing program that does exactly that, but you can’t write a finite description of an “infinite” state space, which is the whole point.

Ad absurdum everything would be computable (recognizable) by FSMs, simply by listing all the infinite number of possibilities. listOfAllPrimes.concat(“|”), here you are..


The point being made is that “regular expressions”, in the original CS meaning, do not have stack space or anything analogous. “Regular expressions” in modern programming practice are often different and more powerful, of course.


No, you could create arbitrarily large FSMs, but there are no infinite finite state machines.


There is, the property of being a Turing machine would only be thrown out when you actually hit the limit. Besides computers being able to modify state not only in their memory but e.g. outside memory, I can also look at my computer as a lazy-taped Turing machine that chugs along writing only to its memory, and when it were to go off it pauses and waits for me to add additional memory.

This is just useless pedantism that diminishes the very imporrant categorization of Chomsky.


The important point here was given infinite space.


Personally, I use "regex" for the informal string-matching/rewriting engines found in most languages and "regular expression" for the formal definition.


By which, of course, this submission title is a lie.


I don't think the GP is the original poster, so them claiming how they "personally" differentiate between the two does not make the submission title a lie.

We don't know how original poster differentiates.


A finite self referential algorithm that checks all numbers for being prime does exist. Running it on a finite computer will eventually hit the bounds of the computer, but that is a limitation outside of the algorithm.

A finite regular expression that checks all numbers for being prime cannot exist already in principle.


Technically, a computer is also just a large finite state machine. Which suggests computability theory is unable to accurately account for the intuitive difference between computers and simpler automatons. I suspect it has something to to with the computational time/space complexity of the algorithms they are able to execute.


If you want something with an actual regular expression, have a look at https://codegolf.stackexchange.com/questions/3503/hard-code-... which is a code golf for regular expressions to test divisibility by 7.


Very cool! The top answer has a 12,731 character regular expression that matches base 10 numbers that are divisible by 7. (There is an answer with only 105 characters, but that uses .NET "regular expressions" which have some pretty gnarly extensions and are definitely not regular languages).

The regex itself is generated from a relatively simple DFA based on mod 7 arithmetic on digits using a program (JFLAP apparently, not familiar).

Just from looking at the start of the regex you can see some cool things. For instance, TIL that all numbers of the form 1555....55554 and 85555....5554 are multiples of 7. Fun exercise to prove it.


I'm the guy who wrote the original implementation that the question mentions. I have no clue who 'Charles' who asked the question is, or how he found my toy implementation.


Could the AKS primality test, which showed that PRIMES is in P, be implemented as an honest regular expression? Presumably not because then we'd know that PRIMES is not just in P but is a regular language, which would be too good to be true.

https://en.wikipedia.org/wiki/AKS_primality_test


Via the pumping lemma, you can easily prove that testing whether a number is a prime number cannot be implemented as a regular expression.


Via the context-free pumping lemma, you can even prove that prime numbers don't form a context-free language either.


And via Parikh's theorem, with a one-symbol alphabet, regular and context-free languages are the same.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: