Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> 3 years ago I wrote a blog post about broken regular expressions in Ruby, ^$ meaning new lines \n.

Are they actually broken? One of the first quirks I learned writing Ruby is that you use \A and \z instead of ^ and $.

I know better than to blame security vulnerabilities on "bad programmers" rather than usability problems with a language (or plain old PoLA violations), but changing the meaning of these anchors will be a tough migration, possibly.



"Broken" is a subjective thing, but quirks that break extremely well established patterns (like how people validate the beginning and end of a line in regex) is where a lot of security issues tend to crop up.

Considering that I would venture to say that this is indeed a broken thing since it's a convention that I've only heard about in Ruby. It's a special snowflake in a component that developers use almost exclusively for validation (regex), with a pretty huge gotcha, especially because it appears to function as intended.


Ruby inherited this behavior from Emacs (matz' favorite text editor) among other things.


While I tend to agree with you, how many people have tried to downplay C's deficiencies with "good programmers following good practices don't introduce buffer overflows into their code?" Where has that gotten us?

I'm not saying that I have a solution, but I can't help be see the parallels.


Why is everybody talking about ruby's regex implementation? ^, \A, $, \Z, \z are standard across many languages, based on PCRE syntax. This isn't a ruby problem; this is simply developers who are not experts with regular expressions not fully understanding how to write patterns with the tool they are using. There are no problems with regex implementations, only problems with regex use.

Anybody interested should have a read through http://pcre.org/pcre.txt. The syntax presented here is used in perl, php, ruby, python, and many others.

Also, nobody ever uses \A without the /m flag. You use ^, it has the same meaning unless you specifically add the /m flag to allow ^ to match at the beginning of any line rather than at the front of the string only. This distinction will only bite developers who just add flags like /msig for every regex, because again they don't understand exactly what every flag actually does.


> ^, \A, $, \Z, \z are standard across many languages, based on PCRE syntax.

Uhm... No? ^$ have always meant beginning and end of string, not line, unless you turn on a flag. Not by default. You can check PCRE's documentation if you don't believe me. And then, there's Ruby:

  irb(main):005:0> "foo\nbar".match /o$/
  => #<MatchData "o">


ruby has half multine mode by default, which causes the issue. Namely, /m only changes the behaviour of "." to match \n, but ^ and $ work the same.

(and incidentally, that would make changing it easier, you can just request that users specify a flag all the time and deprecate the one without).


Eww what the hell. I don't use ruby, and if I did I might very well - as a PCRE half-expert - fall into this trap based purely on the assumption that ruby was using PCRE. I just looked at their Regexp class, and it matches PCRE in most regards. The fact that /m makes . match newline \n is horrible - every PCRE-based implementation uses /s for that, whereas /m only affects ^ and $.

It still falls on the developer to understand the exact flavor of regex available in their language. And yet ruby is doing a disservice to anybody coming to their language with existing PCRE knowledge by having syntax that is almost an exact match to PCRE used in many languages... only to find out someday that it's not. Harsh.


Regular expressions are dangerously complex for validation code, it's too easy to overlook something or miss some context (like multiline mode, see also riffraff's comment).

So the problem is probably not developer knowledge (note that you got it wrong, too!), but rather that regexps are too hard to get right.


I use ^$ vs \A\z as an interview question for senior engineers. Sadly, it filters a lot of them out.


Sounds like a pretty successful bidirectional filter to me. (As in: you apparently don't want engineers who don't know this detail, and I wouldn't want to work at a company where that kind of trivia is a litmus test.)


It's not trivia. If someone doesn't know the difference, they're going to allow bad data into our database. Large webapps with poor model validations are security and maintenance nightmare.


Actually, I am reminded of an error that happened which was similar to this. After I left a past company, an engineer flubbed a validation which allowed a subtle bug to go undetected for 10 days which cost the company $500,000.

$50,000/day is an expensive lesson!


Was it abused? money stolen?


Hey Homakov, I'm a big fan of yours. :)

No, money wasn't being stolen, but the validation error meant that clients' money was being spent and not being tracked. The company had to eat the costs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: