Hacker Newsnew | past | comments | ask | show | jobs | submit | more NtrllyIntrstd's commentslogin

Butt at what cost!!!


The author seems to be missing the point, in my opinion. While it is certainly true that often one can solve simple, seemingly innocent sub-problems within more general languages, the transitions from "I see I can solve this simple program with regex'es!" to "Then I can probably solve this other, almost identical problem as well!" and have the problem explode right into your face are subtle (almost imperceivable to a novice) and it would be a more robust solution to go for the right tools (i.e. an (x)html parser), as well as a good learning example. On a side note: regular expressions can not - by definition - parse recursive languages. A regular expression matcher that does is not a regular expression parser but an ugly-duckling in the family of context-free grammar matchers. People should learn when and how to use those.


But the original SO question does not imply that they want to solve a more complex problem. The SO asker explicitly asked for opinions, so that’s what they got. However, I absolutely think it is the right choice to choose simpler tools to solve simpler problems, as long as you are aware of the implications.


The article goes as far as to say that a parser is not the right tool.

> Not only can the task be solved with a regular expression - regular expressions are basically the only practical way to solve the problem. Which is why none of the clever answers actually suggest another way to solve the problem.

So no, the author is not missing the point at all.


The point is that a parser could very well use regexes under the hood to perform the tokenization. Because it is the right tool for the job. A language without regex-support might use something like lex to compile a lexer. Of course you can write a character-by-character lexer by hand, but this is just equivalent to what a regex would generate.

So saying "this is not possible, use a parser instead" is completely misunderstanding the relationship between lexing and parsing. I wonder how these people think a parser works?


I mean that bit is clearly wrong. An XML/HTML parser is a perfectly practical way to solve the problem.

However I completely agree that they didn't miss the point. A regex to do this might be fine for hacky things that you don't need to be robust (e.g. for searching for stuff, measuring stats, one-off scripts etc.).


Regular expressions can be as robust as you need them to be, just like any other kind of code. They are a DSL to create lexers, and they are exactly as robust (or hacky) as if you wrote the same lexer by hand.


C code can be as robust as you need it to be. So why bother with formal verification, safe C coding standards, Rust, etc?

The answer is that it can be robust, but the effort required to do that is so large that in practice it usually isn't.


Are you arguing that the effort required to make a regex robust and correct is larger than the effort required to make some hand-rolled character-by-character based lexer robust and correct?

Because that sounds counter-intuitive to me. A regex is a higher level DSL for lexing.


That's exactly what I'm arguing. Especially because it's very unlikely that you'd write an XML/HTML parser yourself instead of using somebody else's well-tested library.


OK but these are two separate question.

Of course you should use an existing library if it solves the exact problem you have. Don't waste time re-implementing the wheel unless you are doing if for educational purposes. Whether such a library used regexes or not under the hood would be irrelevant as long as it works and it well tested.

But I would certainly like to hear an argument why you think a regex is less robust that a similar manual character-by-character matcher.


The regex is surely faster for the specific case. I can't say I've seen an XHTML parser off hand that allows me to stop parsing after just the start tag. Perhaps a lazy parser could start to compete, but I'm just guessing.


Aren't most XML parsers SAX or STaX based? Only time I ran into a library that only offered a full DOM without the underlying event based parser was whatever browsers consider the JavaScript standard library.


You're totally right! Many good stock parsers already stream things (more or less).

Still, I'm just making a comment about the overhead... I would hedge a guess that you're going to have a hard time beating a regex with an HTML parser for speed, assuming what you want can be done with both.

This is all irrelevant, because as the OP mentions, the SO question at hand cannot be solved with standards compliant parsers because self-closing tags will not be distinguishable.


I believe you could build such a parser out of parsec. Altough, I am not sure if that is exactly what you are going for.


I think it is more of a message than a solution


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: