Somewhat off topic, but is there a "regex alternation optimizer" out there? And would something like that be worth it?
I've looked through some textmate-style syntax highlighting packages (used in sublime and in github's atom and probably others), and most of them need big (or somewhat big) sets of alternations for a bunch of keywords, and more often than not they are just set up as a list of full keywords with no thought to order or size.
Combining them into something like the below should theoretically be faster while also taking up less space (which is important in web libraries), and I feel like it wouldn't even be all that difficult.
Hmm I've never heard of tries before but that looks pretty close to what I'm looking for.
As for the size aspect, it might not make much of a difference in the gzipped filesize but there are other benefits to smaller raw source as well (but honestly I don't have any idea if it would be at all worth anything, I'm assuming no).
Wow, this is an interesting idea for a library, thanks @hk__2 for sharing!
TLDR; frak transforms collections of strings into regular expressions for matching those strings. It is available as a command line utility and for the browser as a JavaScript library.
Emacs users have used regexp-opt in their own syntax highlighting (aka font-lock) for a while now, at least. If you wanted to port it, it'd depend on whether you're okay with the license, presumably, though I don't think writing a simple version from scratch would be that hard. (Do not let your eyes jump to the source block in the manual and then assume that it's not what you asked: that example is specifically described as the inefficient equivalent.) http://www.gnu.org/software/emacs/manual/html_node/elisp/Reg...
While I don't know the internals of different regex engines, anything that sounds so simple will probably be done by a good regex compiler/engine anyway. The simple | separated list of keywords seems a lot less error prone.
That's where a tool like this could come in handy.
A general purpose regex optimizer can't make assumptions like you don't care about the ordering of sub-groups, but a tool like this can.
If you are just looking for the fastest way to to match any 1 of these 40 keywords, this could make a fast regex that can minimize backtracking.
Like I said, I have no idea if it would make any real difference, but if it did incorporating something like that into a build process for syntax highlighting packages could improve performance for a bunch of people.
2. It modifies the DOM. Not desirable in most cases and heavy: each DOM element takes 0.2k..1.0k in memory. Yet DOM handling in browsers is O(N) complex (N - number of DOM elements).
Just in case, in Sciter (http://sciter.com) I've added an option to style character runs so without DOM modification: Selection.applyMark(runStart, runEnd, name) and special ::mark(name) pseudo-element in CSS to style those runs.
Pretty neat that the styling is included in the 2.2k, but it would be more practical if it would apply classes to allow me to style it myself. Glowing text is pretty opinionated.
This is cool, but I prefer code highlighting without requiring visitors to run JavaScript. It seems unnecessary when tools like Pygments [1] do the highlighting once and output html/css.
The content might be dynamic not static. For example, an SO clone might allow users to type a comment including code in mark down and then need to preview it.
Surely highlighting it on the server is the waste in this case? The library is only 2174 bytes (not even gzipped).
Let's say you used the smallest markup possible for each highlighted token. Something like <i class="x">[token here]</i>. You'd only be able to serve up at most 120 server-highlighted tokens before this JS library becomes a smaller payload, and that's without even defining the CSS styling or considering tokens longer than 1 character.
120 isn't even enough tokens to highlight the tiny code snippet at the top of this demo page. The same tiny snippet highlighted with Pygments comes out to 5819 bytes of markup alone (no styling) – already more than 2.5 times the size of this whole library. Plus you can highlight any number and size of code snippets while just serving the library once... which one is wasteful again? :)
It's not the most high-quality syntax highlighting, but it provides decent highlighting with minimal setup and works well with user generated content where the programming language isn't known.
It's not going to revolutionise syntax highlighting, but I think it has plenty use cases.
Is it really the regex causing the lag? I haven't jumped into the code yet, but I would expect (or at least hope) the highlighter runs once and the lag is due to whatever DOM/styles it's generating.
It's because there's a wrapper div with "overflow: auto" around the entire page. This means you're scrolling inside the div (unaccelerated, need to repaint) instead of scrolling the entire document (accelerated, no need to repaint, can just move around pre-rendered tiles). Removing the "overflow: auto" makes scrolling smooth.
I don't think the intent is to flag up invalid syntax/usage, but to make sensible guesses about what strings should be highlighted, to aid readability.
There are few cases where having wchar_t in any code snippet would not indicate some sort of keyword/type annotation
* [1] Disclaimer: In my experience, and admittedly not using javascript. I have recently confirmed minimal hand-implemented state machines generally beating Regexp's in Golang.
What we are actually looking at here is the fact that some words are highlighted in any programming languages despite the fact that the language itself is unknown.
Usually the user has to manually select which programming language is being used and THEN the words get highlighted.
It would be nice to have some nice default colors for different things (non-keyword variables, numbers...), I don't think it would blow up the library size, but it would make a huge deal in readability of the code
I've looked through some textmate-style syntax highlighting packages (used in sublime and in github's atom and probably others), and most of them need big (or somewhat big) sets of alternations for a bunch of keywords, and more often than not they are just set up as a list of full keywords with no thought to order or size.
Combining them into something like the below should theoretically be faster while also taking up less space (which is important in web libraries), and I feel like it wouldn't even be all that difficult.
Is there something out there which can do this, and would it even be worth it or is this something best left to the JIT/optimizer of the regex engine?