Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microlight.js, a code highlighting library (asvd.github.io)
214 points by xpostman on May 30, 2016 | hide | past | favorite | 62 comments


Somewhat off topic, but is there a "regex alternation optimizer" out there? And would something like that be worth it?

I've looked through some textmate-style syntax highlighting packages (used in sublime and in github's atom and probably others), and most of them need big (or somewhat big) sets of alternations for a bunch of keywords, and more often than not they are just set up as a list of full keywords with no thought to order or size.

Combining them into something like the below should theoretically be faster while also taking up less space (which is important in web libraries), and I feel like it wouldn't even be all that difficult.

    de(bugger|cimal|clare|f(ault|er)?|init|l(egate|ete)?)
Is there something out there which can do this, and would it even be worth it or is this something best left to the JIT/optimizer of the regex engine?


I'm not aware of any tools out of the box, but you could trivially build your example regex from a Trie which is also easy to construct.

https://en.wikipedia.org/wiki/Trie#Algorithms

I'm not so sure it would take up much less space though, if you take gzip compression into account. See for example here:

https://github.com/google/closure-compiler/wiki/FAQ#closure-...


Something like that would probably be better off with Aho-Corasick, with is similar to a trie but ends up with more compact FSMs


Hmm I've never heard of tries before but that looks pretty close to what I'm looking for.

As for the size aspect, it might not make much of a difference in the gzipped filesize but there are other benefits to smaller raw source as well (but honestly I don't have any idea if it would be at all worth anything, I'm assuming no).


Frak [1] does something like that and was written with syntax higlighters in mind.

[1]: https://github.com/noprompt/frak

edit: typo


Wow, this is an interesting idea for a library, thanks @hk__2 for sharing!

TLDR; frak transforms collections of strings into regular expressions for matching those strings. It is available as a command line utility and for the browser as a JavaScript library.


I think emacs has the equivalent built in: `(rx (or "foo" "bar" "baz" "quux"))` gives you: `;; => "\\(?:ba[rz]\\|foo\\|quux\\)"`


Wow that's almost exactly what I had in mind.

I might take a look at this later and see if incorporating this or something like it has any kind of meaningful impact.


Emacs users have used regexp-opt in their own syntax highlighting (aka font-lock) for a while now, at least. If you wanted to port it, it'd depend on whether you're okay with the license, presumably, though I don't think writing a simple version from scratch would be that hard. (Do not let your eyes jump to the source block in the manual and then assume that it's not what you asked: that example is specifically described as the inefficient equivalent.) http://www.gnu.org/software/emacs/manual/html_node/elisp/Reg...


While I don't know the internals of different regex engines, anything that sounds so simple will probably be done by a good regex compiler/engine anyway. The simple | separated list of keywords seems a lot less error prone.

An 2009 article on the V8 blog about a regex optimization they did: http://blog.chromium.org/2009/02/irregexp-google-chromes-new...


That's where a tool like this could come in handy.

A general purpose regex optimizer can't make assumptions like you don't care about the ordering of sub-groups, but a tool like this can.

If you are just looking for the fastest way to to match any 1 of these 40 keywords, this could make a fast regex that can minimize backtracking.

Like I said, I have no idea if it would make any real difference, but if it did incorporating something like that into a build process for syntax highlighting packages could improve performance for a bunch of people.


You are looking for Spintax ! That's how bots do comment-spam.

Example : "{|||Wow! |Hey! |Looks {nice|cool}! }{{I like|I love|Like|Love} {this|your}|{Very nice|Nice|Cool}} {photo|foto|picture|pic|image|img}{.|!|!!| :)| ;)| :D| <3}"


That’s not what OP asked for because it’s not optimized. You can e.g. optimize "I (love|like)" as "I l(ov|ik)e".


Allan Odgaard, creator of TextMate, made one in 2005: https://github.com/textmate/bundle-development.tmbundle/blob...

It’s a command in the “bundle development” bundle.


A friend of mine just wrote an article on this:

http://engineering.khanacademy.org/posts/shortest-regex.htm


Yup! @satyr, best? known for their Coco language, has a tool that does just that: http://satyr.github.io/retrie/


There are two main problems with such solution (and any other existing JS based code colorizers):

1. Use of regular expressions. Hard to get effective solution with that. Yet that famous parsing HTML by regex answer: http://stackoverflow.com/questions/1732348/regex-match-open-...

2. It modifies the DOM. Not desirable in most cases and heavy: each DOM element takes 0.2k..1.0k in memory. Yet DOM handling in browsers is O(N) complex (N - number of DOM elements).

Just in case, in Sciter (http://sciter.com) I've added an option to style character runs so without DOM modification: Selection.applyMark(runStart, runEnd, name) and special ::mark(name) pseudo-element in CSS to style those runs.

Illustrations: http://sciter.com/tokenizer-mark-syntax-colorizer/


Pretty neat that the styling is included in the 2.2k, but it would be more practical if it would apply classes to allow me to style it myself. Glowing text is pretty opinionated.


This is cool, but I prefer code highlighting without requiring visitors to run JavaScript. It seems unnecessary when tools like Pygments [1] do the highlighting once and output html/css.

[1]: http://pygments.org/


The content might be dynamic not static. For example, an SO clone might allow users to type a comment including code in mark down and then need to preview it.


I was more of the opinion that syntax highlighting was purely presentational. Does that mean that it is more sensible to do it on the client?


Yup. Hilight it one time and serve it up vs hilight every time (client side).

I don't like waste for the sake of waste.


Surely highlighting it on the server is the waste in this case? The library is only 2174 bytes (not even gzipped).

Let's say you used the smallest markup possible for each highlighted token. Something like <i class="x">[token here]</i>. You'd only be able to serve up at most 120 server-highlighted tokens before this JS library becomes a smaller payload, and that's without even defining the CSS styling or considering tokens longer than 1 character.

120 isn't even enough tokens to highlight the tiny code snippet at the top of this demo page. The same tiny snippet highlighted with Pygments comes out to 5819 bytes of markup alone (no styling) – already more than 2.5 times the size of this whole library. Plus you can highlight any number and size of code snippets while just serving the library once... which one is wasteful again? :)


It's not the most high-quality syntax highlighting, but it provides decent highlighting with minimal setup and works well with user generated content where the programming language isn't known.

It's not going to revolutionise syntax highlighting, but I think it has plenty use cases.


And it made the site super sluggish. Guess that huge regex needs a dedicated cpu.

Almost un readable on a fairly quick phone...


Is it really the regex causing the lag? I haven't jumped into the code yet, but I would expect (or at least hope) the highlighter runs once and the lag is due to whatever DOM/styles it's generating.


Site works with no lag on my low end android. Someone else commented on issues on iPhone6 so possibly a Safari-specific bottleneck being hit.


Hello, Firefox on Android user here on a tablet from circa 2012. Extreme lag, unlike most websites.

There are definitely unacceptable performance issues with the syntax highlighter javascript.

It looks great, it's a cool project, but it is not suitable for use anywhere.


Looks like the `overflow:auto` on the page container div, not actually anything to do with the library.


Disclaimer: This is not my library, this is what I've gleaned from reading the code. I could always be wrong!

It actually can use RegExp's, but they're only being used to locate the language primitive types words (e.g. char, wchar_t, etc..).

Most of the library is actually written as a hand-implemented state machine rather than relying on regular expressions.


For some reason scrolling that page on my iPhone 6s is insanely lagging, is it due to a plugin?


It's because there's a wrapper div with "overflow: auto" around the entire page. This means you're scrolling inside the div (unaccelerated, need to repaint) instead of scrolling the entire document (accelerated, no need to repaint, can just move around pre-rendered tiles). Removing the "overflow: auto" makes scrolling smooth.


Also slow on a 5s


Also laggy in Safari on OS X.


I'm going to stick to http://prismjs.com; that is fast.


You should remove the semi colon. It's breaking the link


This library is actually very HTML specific - look at the code here: https://github.com/asvd/microlight/blob/master/microlight.js... - it assumes what the language looks like.


Wow, almost 1k of the 2.2k of source is a single regular expression.

That's eerily beautiful.


Better hope your browser of choice has a regex JIT.

And GPU compositing, 'cause damn.


What could possibly go wrong? :)


Somewhat related: I wrote a JavaScript syntax highlighter for js1k a few years back. It's 1008 bytes including a quine - highlighting itself:

http://js1k.com/2010-first/demo/194


How well does it work for non C-like languages? Particularly, ones like Haskell, Erlang, SQL?

I'd check myself, but I'm away from a PC for a while.


it works... somehow :-)

http://asvd.github.io/microlight/haskel.png

well, since the lib is general, it's built upon compromises. But I am open for suggestions concerning updating the logic for some particular cases


Does it also work for languages where you can define // to be an operator?


with except for SQL, of course. SQL syntax is a nonsence, highlighting would not make it any readable


It looks like it highlights random keywords like wchar_t.

Hardly for "any programming language".


Isn't that an appropriate keyword to highlight, just like `char`?


Not if you use a language where it isn't a keyword.


I don't think the intent is to flag up invalid syntax/usage, but to make sensible guesses about what strings should be highlighted, to aid readability.

There are few cases where having wchar_t in any code snippet would not indicate some sort of keyword/type annotation


I have used universal syntax highlighting and when it highlights stuff that shouldn't be highlighted, it is very annoying and confusing.


I think for 2.2k (look at the minified code), it's quite nice. I liked it.


Well done. It covers an important niche and it's a finished product so props to the author :)


    library size is extremely compact
    2.2k, seriously, can you imagine!
I wonder if I can do better than this, since it seems mostly a matter of a few regular expressions and then DOM manipulation?

EDIT

Reviewing the source code [0] the state-machine approach, when properly implemented can beat* [1] an equivalent RE performance-wise.

It still may be interesting to see if the code size could substantially reduced this way.

[0] https://github.com/asvd/microlight/blob/master/microlight.js

* [1] Disclaimer: In my experience, and admittedly not using javascript. I have recently confirmed minimal hand-implemented state machines generally beating Regexp's in Golang.


The glow effect is really nice. Appears to be simply good use of the text-shadow property.


It's a separate project of mine, actually

https://asvd.github.io/intence/


I'd say it's beautiful for some situations, but not for syntax hightlighting


The blurry / highlighted effect is simply CSS.

What we are actually looking at here is the fact that some words are highlighted in any programming languages despite the fact that the language itself is unknown.

Usually the user has to manually select which programming language is being used and THEN the words get highlighted.


It would be nice to have some nice default colors for different things (non-keyword variables, numbers...), I don't think it would blow up the library size, but it would make a huge deal in readability of the code


The example looks buggy in Chrome, even more buggy in Canary: http://i.imgur.com/6BAfGD0.png


Does it interpret CSS's background-color as a single word? How about balance-amount, where the hyphen is used as a minus operator?


It would be interesting to read a blog about why just these keywords should be highlighted and why highlighting is a good idea.


Side question: What programming topics should I learn to create a syntax highlighter?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: