Microlight.js, a code highlighting library

Klathmon · on May 30, 2016

Somewhat off topic, but is there a "regex alternation optimizer" out there? And would something like that be worth it?

I've looked through some textmate-style syntax highlighting packages (used in sublime and in github's atom and probably others), and most of them need big (or somewhat big) sets of alternations for a bunch of keywords, and more often than not they are just set up as a list of full keywords with no thought to order or size.

Combining them into something like the below should theoretically be faster while also taking up less space (which is important in web libraries), and I feel like it wouldn't even be all that difficult.

    de(bugger|cimal|clare|f(ault|er)?|init|l(egate|ete)?)

Is there something out there which can do this, and would it even be worth it or is this something best left to the JIT/optimizer of the regex engine?

julian37 · on May 30, 2016

I'm not aware of any tools out of the box, but you could trivially build your example regex from a Trie which is also easy to construct.

https://en.wikipedia.org/wiki/Trie#Algorithms

I'm not so sure it would take up much less space though, if you take gzip compression into account. See for example here:

https://github.com/google/closure-compiler/wiki/FAQ#closure-...

ybx · on May 30, 2016

Something like that would probably be better off with Aho-Corasick, with is similar to a trie but ends up with more compact FSMs

Klathmon · on May 30, 2016

Hmm I've never heard of tries before but that looks pretty close to what I'm looking for.

As for the size aspect, it might not make much of a difference in the gzipped filesize but there are other benefits to smaller raw source as well (but honestly I don't have any idea if it would be at all worth anything, I'm assuming no).

hk__2 · on May 30, 2016

Frak [1] does something like that and was written with syntax higlighters in mind.

[1]: https://github.com/noprompt/frak

edit: typo

jaytaylor · on May 30, 2016

Wow, this is an interesting idea for a library, thanks @hk__2 for sharing!

TLDR; frak transforms collections of strings into regular expressions for matching those strings. It is available as a command line utility and for the browser as a JavaScript library.

sooheon · on May 31, 2016

I think emacs has the equivalent built in: `(rx (or "foo" "bar" "baz" "quux"))` gives you: `;; => "\\(?:ba[rz]\\|foo\\|quux\\)"`

Klathmon · on May 30, 2016

Wow that's almost exactly what I had in mind.

I might take a look at this later and see if incorporating this or something like it has any kind of meaningful impact.

dasyatidprime · on May 30, 2016

Emacs users have used regexp-opt in their own syntax highlighting (aka font-lock) for a while now, at least. If you wanted to port it, it'd depend on whether you're okay with the license, presumably, though I don't think writing a simple version from scratch would be that hard. (Do not let your eyes jump to the source block in the manual and then assume that it's not what you asked: that example is specifically described as the inefficient equivalent.) http://www.gnu.org/software/emacs/manual/html_node/elisp/Reg...

jankassens · on May 30, 2016

While I don't know the internals of different regex engines, anything that sounds so simple will probably be done by a good regex compiler/engine anyway. The simple | separated list of keywords seems a lot less error prone.

An 2009 article on the V8 blog about a regex optimization they did: http://blog.chromium.org/2009/02/irregexp-google-chromes-new...

Klathmon · on May 30, 2016

That's where a tool like this could come in handy.

A general purpose regex optimizer can't make assumptions like you don't care about the ordering of sub-groups, but a tool like this can.

If you are just looking for the fastest way to to match any 1 of these 40 keywords, this could make a fast regex that can minimize backtracking.

Like I said, I have no idea if it would make any real difference, but if it did incorporating something like that into a build process for syntax highlighting packages could improve performance for a bunch of people.

simlevesque · on May 30, 2016

You are looking for Spintax ! That's how bots do comment-spam.

Example : "{|||Wow! |Hey! |Looks {nice|cool}! }{{I like|I love|Like|Love} {this|your}|{Very nice|Nice|Cool}} {photo|foto|picture|pic|image|img}{.|!|!!| :)| ;)| :D| <3}"

hk__2 · on May 30, 2016

That’s not what OP asked for because it’s not optimized. You can e.g. optimize "I (love|like)" as "I l(ov|ik)e".

jacobolus · on May 31, 2016

Allan Odgaard, creator of TextMate, made one in 2005: https://github.com/textmate/bundle-development.tmbundle/blob...

It’s a command in the “bundle development” bundle.

sophiebits · on May 31, 2016

A friend of mine just wrote an article on this:

http://engineering.khanacademy.org/posts/shortest-regex.htm

akx · on May 31, 2016

Yup! @satyr, best? known for their Coco language, has a tool that does just that: http://satyr.github.io/retrie/

c-smile · on May 30, 2016

There are two main problems with such solution (and any other existing JS based code colorizers):

1. Use of regular expressions. Hard to get effective solution with that. Yet that famous parsing HTML by regex answer: http://stackoverflow.com/questions/1732348/regex-match-open-...

2. It modifies the DOM. Not desirable in most cases and heavy: each DOM element takes 0.2k..1.0k in memory. Yet DOM handling in browsers is O(N) complex (N - number of DOM elements).

Just in case, in Sciter (http://sciter.com) I've added an option to style character runs so without DOM modification: Selection.applyMark(runStart, runEnd, name) and special ::mark(name) pseudo-element in CSS to style those runs.

Illustrations: http://sciter.com/tokenizer-mark-syntax-colorizer/

tobr · on May 30, 2016

Pretty neat that the styling is included in the 2.2k, but it would be more practical if it would apply classes to allow me to style it myself. Glowing text is pretty opinionated.

rahiel · on May 30, 2016

This is cool, but I prefer code highlighting without requiring visitors to run JavaScript. It seems unnecessary when tools like Pygments [1] do the highlighting once and output html/css.

[1]: http://pygments.org/

danjc · on June 1, 2016

The content might be dynamic not static. For example, an SO clone might allow users to type a comment including code in mark down and then need to preview it.

ioquatix · on May 31, 2016

I was more of the opinion that syntax highlighting was purely presentational. Does that mean that it is more sensible to do it on the client?

JustSomeNobody · on May 30, 2016

Yup. Hilight it one time and serve it up vs hilight every time (client side).

I don't like waste for the sake of waste.

exogen · on May 31, 2016

Surely highlighting it on the server is the waste in this case? The library is only 2174 bytes (not even gzipped).

Let's say you used the smallest markup possible for each highlighted token. Something like <i class="x">[token here]</i>. You'd only be able to serve up at most 120 server-highlighted tokens before this JS library becomes a smaller payload, and that's without even defining the CSS styling or considering tokens longer than 1 character.

120 isn't even enough tokens to highlight the tiny code snippet at the top of this demo page. The same tiny snippet highlighted with Pygments comes out to 5819 bytes of markup alone (no styling) – already more than 2.5 times the size of this whole library. Plus you can highlight any number and size of code snippets while just serving the library once... which one is wasteful again? :)

wongarsu · on May 30, 2016

It's not the most high-quality syntax highlighting, but it provides decent highlighting with minimal setup and works well with user generated content where the programming language isn't known.

It's not going to revolutionise syntax highlighting, but I think it has plenty use cases.

hougaard · on May 30, 2016

And it made the site super sluggish. Guess that huge regex needs a dedicated cpu.

Almost un readable on a fairly quick phone...

mcbits · on May 30, 2016

Is it really the regex causing the lag? I haven't jumped into the code yet, but I would expect (or at least hope) the highlighter runs once and the lag is due to whatever DOM/styles it's generating.

lucideer · on May 30, 2016

Site works with no lag on my low end android. Someone else commented on issues on iPhone6 so possibly a Safari-specific bottleneck being hit.

yxlx · on May 30, 2016

Hello, Firefox on Android user here on a tablet from circa 2012. Extreme lag, unlike most websites.

There are definitely unacceptable performance issues with the syntax highlighter javascript.

It looks great, it's a cool project, but it is not suitable for use anywhere.

STRML · on May 30, 2016

Looks like the `overflow:auto` on the page container div, not actually anything to do with the library.

jaytaylor · on May 30, 2016

Disclaimer: This is not my library, this is what I've gleaned from reading the code. I could always be wrong!

It actually can use RegExp's, but they're only being used to locate the language primitive types words (e.g. char, wchar_t, etc..).

Most of the library is actually written as a hand-implemented state machine rather than relying on regular expressions.

usaphp · on May 30, 2016

For some reason scrolling that page on my iPhone 6s is insanely lagging, is it due to a plugin?

panic · on May 30, 2016

It's because there's a wrapper div with "overflow: auto" around the entire page. This means you're scrolling inside the div (unaccelerated, need to repaint) instead of scrolling the entire document (accelerated, no need to repaint, can just move around pre-rendered tiles). Removing the "overflow: auto" makes scrolling smooth.

NegativeLatency · on May 30, 2016

Also slow on a 5s

spilk · on May 30, 2016

Also laggy in Safari on OS X.

pmlnr · on May 30, 2016

I'm going to stick to http://prismjs.com; that is fast.

geuis · on May 30, 2016

You should remove the semi colon. It's breaking the link

inglor · on May 30, 2016

This library is actually very HTML specific - look at the code here: https://github.com/asvd/microlight/blob/master/microlight.js... - it assumes what the language looks like.

red_hare · on May 30, 2016

Wow, almost 1k of the 2.2k of source is a single regular expression.

That's eerily beautiful.

nephyrin · on May 30, 2016

Better hope your browser of choice has a regex JIT.

And GPU compositing, 'cause damn.

aorth · on May 30, 2016

What could possibly go wrong? :)

phoboslab · on May 30, 2016

Somewhat related: I wrote a JavaScript syntax highlighter for js1k a few years back. It's 1008 bytes including a quine - highlighting itself:

http://js1k.com/2010-first/demo/194

verandaguy · on May 30, 2016

How well does it work for non C-like languages? Particularly, ones like Haskell, Erlang, SQL?

I'd check myself, but I'm away from a PC for a while.

xpostman · on May 30, 2016

it works... somehow :-)

http://asvd.github.io/microlight/haskel.png

well, since the lib is general, it's built upon compromises. But I am open for suggestions concerning updating the logic for some particular cases

amelius · on May 30, 2016

Does it also work for languages where you can define // to be an operator?

xpostman · on May 30, 2016

with except for SQL, of course. SQL syntax is a nonsence, highlighting would not make it any readable

codexon · on May 30, 2016

It looks like it highlights random keywords like wchar_t.

Hardly for "any programming language".

lifthrasiir · on May 31, 2016

Isn't that an appropriate keyword to highlight, just like `char`?

codexon · on May 31, 2016

Not if you use a language where it isn't a keyword.

_gok2 · on May 31, 2016

I don't think the intent is to flag up invalid syntax/usage, but to make sensible guesses about what strings should be highlighted, to aid readability.

There are few cases where having wchar_t in any code snippet would not indicate some sort of keyword/type annotation

codexon · on May 31, 2016

I have used universal syntax highlighting and when it highlights stuff that shouldn't be highlighted, it is very annoying and confusing.

armamut · on May 30, 2016

I think for 2.2k (look at the minified code), it's quite nice. I liked it.

VeejayRampay · on May 30, 2016

Well done. It covers an important niche and it's a finished product so props to the author :)

jaytaylor · on May 30, 2016

    library size is extremely compact
    2.2k, seriously, can you imagine!

I wonder if I can do better than this, since it seems mostly a matter of a few regular expressions and then DOM manipulation?

EDIT

Reviewing the source code [0] the state-machine approach, when properly implemented can beat* [1] an equivalent RE performance-wise.

It still may be interesting to see if the code size could substantially reduced this way.

[0] https://github.com/asvd/microlight/blob/master/microlight.js

* [1] Disclaimer: In my experience, and admittedly not using javascript. I have recently confirmed minimal hand-implemented state machines generally beating Regexp's in Golang.

rl3 · on May 30, 2016

The glow effect is really nice. Appears to be simply good use of the text-shadow property.

xpostman · on May 30, 2016

It's a separate project of mine, actually

https://asvd.github.io/intence/

jaimehrubiks · on May 30, 2016

I'd say it's beautiful for some situations, but not for syntax hightlighting

Karawebnetwork · on May 30, 2016

The blurry / highlighted effect is simply CSS.

What we are actually looking at here is the fact that some words are highlighted in any programming languages despite the fact that the language itself is unknown.

Usually the user has to manually select which programming language is being used and THEN the words get highlighted.

xiphias · on May 30, 2016

It would be nice to have some nice default colors for different things (non-keyword variables, numbers...), I don't think it would blow up the library size, but it would make a huge deal in readability of the code

cabirum · on May 30, 2016

The example looks buggy in Chrome, even more buggy in Canary: http://i.imgur.com/6BAfGD0.png

amelius · on May 31, 2016

Does it interpret CSS's background-color as a single word? How about balance-amount, where the hyphen is used as a minus operator?

z3t4 · on May 31, 2016

It would be interesting to read a blog about why just these keywords should be highlighted and why highlighting is a good idea.

vatotemking · on May 31, 2016

Side question: What programming topics should I learn to create a syntax highlighter?