Hacker Newsnew | past | comments | ask | show | jobs | submit | more dmsnell's commentslogin

Fun fact: this is very close but slightly inaccurate. I used to think this is how it worked before scrutinizing a rule in the HTML tree-building specification.

The tag leads the parser to interpret everything following it as character data, but doesn’t impact rendering. In these cases, if there are active formatting elements that would normally be reconstructed, they will after the PLAINTEXT tag as well. It’s quite unexpected.

  <a href="https://news.ycombinator.com"><b><i><u><s><plaintext>hi
In this example “hi” will render with every one of the preceding formats applied.

https://software.hixie.ch/utilities/js/live-dom-viewer/?%3Ca...

After I discovered this the note in the spec was updated to make it clearer.

  https://html.spec.whatwg.org/multipage/parsing.html#:~:text=A start tag whose tag name is "plaintext"


I’ve become quite a fan of writing in SGML personally, because much of what you note is spot-on. Some of the points seem a bit of a stretch though.

Any type-checking inside of SGML is more akin to unused-variable checking. When you say that macros/entities may contain parameters, I think you are referring to recursive entity expansion, which does let you parameterize macros (but only once, and not dynamically within the text). For instance, you can set a `&currentYear` entity and refer to that in `copywrite "&currentYear/&currentDay`, but that can only happen in the DTD at the start of the document. It’s not the case that you could, for instance, create an entity to generate a Github repo link and use it like `&repoName = "diff-match-patch"; &githubLink`. This feature was used in limited form to conditionally include sections of markup since SGML contains an `IGNORE` “marked section”.

   <!ENTITY % private-render "IGNORE">
   ...
   <![%private-render[
   <side-note>
   I’m on the fence about including this bit.
   It’s not up to the editorial standards.
   </side-note>
   ]]>

SGML also fights hard against stream processing, even more so than XML (and XML pundits regret not deprecating certain SGML features like entities which obstruct stream processing). Because of things like this, it’s not possible to parse a document without having the entire thing from the start, and because of things like tag omission (which is part of its syntax “MINIMIZATION” features), it’s often not possible to parse a document without having _everything up to the end_.

Would love to hear what you are referring to with “safe” third-party transclusion and also what features are available for removal or rejection of undesired script in user content.

Apart from these I find it a pleasure to use because SGML makes it easy for _humans_ to write structured content (contrast with XML which makes it easy for software to parse). SGML is incredibly hard to parse because in order to accommodate human factors _and actually get people to write structured content_ it leans heavily on computers and software doing the hard work of parsing.

It’s missing some nice features such as namespacing. That is, it’s not possible to have two elements of the same name in the same document with different attributes, content, or meanings. If you want to have a flight record and also a list of beers in a flight, they have to be differentiated otherwise they will fail to parse.

   <flight-list>
   <flight-record><flight-meta pnr=XYZ123 AAL number=123>
   </flight-list>

   <beer-list>
   <beer-flight>
   <beer Pilsner amount=3oz>Ultra Pils 2023
   <beer IPA>Dual IPA
   <beer Porter>Chocolate milk stout
   </beer-list>

DSSSL was supposed to be the transforms into RSS, page views, and other styles or visualizations. With XML arose XSL/XSLT which seemed to gain much more traction than DSSSL ever did. My impression is that declarative transforms are best suited for simpler transforms, particularly those without complicated processing or rearranging of content. Since `osgmls` and the other few SGML parsers are happy to produce an equivalent XML document for the SGML input, it’s easy to transform an SGML document using XSL, and I do this in combination with a `Makefile` to create my own HTML pages (fair warning: HTML _is not XML_ and there are pitfalls in attempting to produce HTML from an XML tool like XSL).

For more complicated work I make quick transformers with WordPress’ HTML API to process the XML output (I know, XML also isn’t HTML, but it parses reliably for me since I don’t produce anything that an HTML parser couldn’t parse). Having an imperative-style processor feels more natural to me, and one written in a programming language that lets me use normal programming conveniences. I think getting the transformer right was never fully realized with the declarative languages, which are similar to Angular and other systems with complicated DSLs inside string attribute values.

I’d love to see the web pick up where SGML left off and get rid of some of the legacy concessions (SGML was written before UTF-8 and its flexibility with input encodings shows it — not in a good way either) as well as adopt some modern enhancements. I wrote about some of this on my personal blog, sorry for the plug.

https://fluffyandflakey.blog/2024/10/11/ugml-a-proposal-to-u...

Edit: formatting


Nice to meet a fellow SGML fan!

> When you say that macros/entities may contain parameters, I think you are referring to recursive entity expansion,

No, I'm referring to SGML data attributes (attributes declared on notations having concrete values defined on entities of the respective notation); cf. [1]. In sgmljs.net SGML, these can be used for SGML templating which is a way of using data entities declared as having the SGML notation (ie. stand-alone SGML files or streams) to replace elements in documents referencing those entities. Unlike general entities, this type of entity expansion is bound to an element name and is informed of the expected content model and other contextual type info at the replacement site, hence is type-safe. Data attributes supplied at the expansion site appear as "system-specific entities" in the processing context of the template entity. See [2] for details and examples.

Understanding and appreciating the construction of templating as a parametric macro expansion mechanism without additional syntax may require intimate knowledge of lesser known SGML features such as LPDs and data entities, and also some HyTime concepts.

> create an entity to generate a Github repo link

Templating can turn text data from a calling document into an entity in the called template sub-processing context so might help with your use case, and with the limitation to have to declare things in DTDs upfront in general.

> it’s not possible to parse a document without having the entire thing from the start, and because of things like tag omission (which is part of its syntax “MINIMIZATION” features), it’s often not possible to parse a document without having _everything up to the end_.

Why do you think so and why should this be required by tag inference specifically? In sgmljs.net SGML, for external general entities (unlike external parameter entities which are expanded at the point of declaration rather than usage), at no point does text data have to be materialised in its entirety. The parser front-end just switches input events from another external source during entity expansion and switches back afterwards, maintaining a stack of open entities.

Regarding namespaces, one of their creators (SGML demi-good James Clark himself) considers those a failure:

> the pain that is caused by XML Namespaces seems massively out of proportion to the benefits that they provide (cf. [3]).

In sgmljs.net SGML, you can handle XML namespace mappings using the special processing instructions defined by ISO/IEC 19757-9:2008. In effect, element and attributes having names "with colons" are remapped to names with canonical namespace parts (SGML names can allow colons as part of names), which seems like the sane way to deal with "namespaces".

I haven't checked your site, but most certainly will! Let's keep in touch; you might also be interested in sgmljs.net SGML and the SGML DTD for modern HTML at [4], to be updated for WHATWG HTML review draft January 2025 when/if it's published.

Edit:

> Would love to hear what you are referring to with “safe” third-party transclusion and also what features are available for removal or rejection of undesired script in user content.

In short, I was mainly referring to DTD techniques (content models, attribute defaults) here.

[1]: https://sgmljs.net/docs/sgmlrefman.html#data-entities

[2]: https://sgmljs.net/docs/templating.html

[3]: https://blog.jclark.com/2010/01/xml-namespaces.html

[4]: https://sgmljs.net/docs/html5.html


Took me a while to process your response; thanks for providing to so much to think about. And yes, it is indeed nice meeting other SGML enthusiasts!

Can’t say I’m familiar with the use of notations the way you are describing, and it doesn’t help that ISO-8879-1986 is devoid of any reference to the italicized or quoted terms you’re using. Mind sharing where in the spec you see these data attributes?

The SGML spec is hands-down the most complicated spec I’ve ever worked with, feeling like it’s akin to how older math books were written before good symbolic notation was developed — in descriptive paragraphs. The entire lack of any examples in the “Link type definition” section is one way it’s been hard to understand the pieces.

> some HyTime concepts

I’m familiar with sgmljs but have you incorporated HyTime into it? That is, is sgmljs more like SGML+other things? I thought it already was in the way it handles UTF-8.

> Why do you think so and why should this be required by tag inference specifically?

Perhaps it’d be clearer to add chunkable to my description of streaming. Either the parser gets the document from the start and stores all of the entities for processing each successive chunk, or it will mis-parse. The same is true not only for entities for also for other forms of MINIMIZATION and structure.

Consider encountering an opening tag. Without knowing that structure and where it was previously, it’s not clear if that tag needs to close open elements. So I’m contrasting this to things that are self-synchronizing or which allow processing chunks in isolation. As with many things, this is an evaluation on the relative complexity of streaming SGML — obviously it’s easier than HTML because HTML nodes at the top of a document can appear at the end of the HTML text; and obviously harder than UTF-8 or JSON-LD where it’s possible to find the next boundary without context and start parsing again.

> Regarding namespaces, one of their creators (SGML demi-good James Clark himself) considers those a failure:

Definitely aware of the criticism, but I also find them quite useful and that they solve concrete problems I have and wish SGML supported. Even when using colons things can get out of hand to the point it’s not convenient to write by hand. This is where default namespacing can be such an aid, because I can switch into a different content domain and continue typing with relatively little syntax.

Of note, in the linked reference, James Clark isn’t arguing against the use of namespaces, but more of the contradiction between what’s written in the XML and the URL the namespace points to, also of the over-reliance on namespacing for meaning in a document.

What feels most valuable is being able to distinguish a flight of multiple segments and a flight of different beers, and to do so without typing this:

  <com.dmsnell.things.liquids.beer.flight>
    <com.dmsnell.things.liquids.beer.beer>

  <com.dmsnell.plans.travel.itineraries.flight>
    <com.dmsnell.plans.travel.flights.segment>
Put another way, I’m not concerned about the problem of universal inter-linking of my documents and their semantics across the web, but neither was SGML.

Cheers!


Get a copy of The SGML Handbook! It includes the ISO 8879 text with much needed commentary. The author/editor deliberately negotiated this to get some compensation for spending over ten years on SGML; it's meant to be read instead of the bare ISO 8879 text.

Regarding HyTime, the editor (Steven Newcomb or was it Elliot Kimber?) even apologized for its extremely hard-to-read text, but the gist of it is that notations were used as a general-purpose extension mechanism, with a couple of auxiliary conventions introduced to make attribute capture and mapping in link rules more useful. sgmljs templating uses these to capture attributes in source docs and supply the captured values as system-specific entities to templates (the point being that no new syntax is introduced that isn't already recognized and specced out in HyTime general facilities). The concept and concrete notations of Formal System Identifiers is also a HyTime general facility.


HyTime was Kimber’s work, and I found this reflections of his on “worse is better” to be quite refreshing, especially when contrasting the formality of SGML against the flexibility of HTML.

https://drmacros-xml-rants.blogspot.com/2010/08/worse-is-bet...

I’ll have to spend more time in the SGML Handbook. It’s available for borrowing at https://archive.org/details/sgmlhandbook0000gold. So far, I’ve been focused on the spec and trying to understand the intent behind the rules.

> a couple of auxiliary conventions introduced to make attribute capture and mapping in link rules more useful

Do you have any pointers on how to find these in the spec or in the handbook? Some keywords? What I’ve gathered is that notations point to some external non-SGML processor to handle the rendering and interpretation of content, such as images and image-viewing programs.

Cheers!


> Do you have any pointers on how to find these in the spec or in the handbook?

I just checked now, and apart from FSIDR I was mainly referring to DAFE [1], a HyTime 2nd ed. facility actually part of AFDR (archforms) allowing additional attribute capture and mappings in link rules towards a more complete transformation language as SGML's link rules are a little basic when it comes to attribute handling. Note in a HyTime context, transformations (for lack of a better word) are specified by AFDR notations/data attributes in not only a highly idiosyncratic way but also redundantly when SGML LINK is perfectly capable to express those classes of transformations, and more. sgmljs only implements LPD-based transformations (yet with more power such as inheritance of DTDs into type checking of template processing contexts and pipelining) and only the DAFE part of AFDR but not other parts.

[1]: http://mailman.ic.ac.uk/pipermail/xml-dev/1998-July/004834.h... (see also ISO/IEC 10744 - A 5.3 Data Attributes for Elements (DAFE))


DOM\HTMLDocument is a huge boon. Niels Dosche incorporated lexbor into PHP to do so, and it maintains the same interface as DOMDocument once it’s instantiated.

In case people aren’t aware, DOMDocument is dangerous. You can’t parse HTML with an XML parser; so everyone currently using DOMDocument for HTML would benefit by replacing that with DOM\HTMLDocument immediately, eliminating both security and corruption issues.

Have you tried using an HTML parser?


Fun fact: HTML void elements came first.

XML brought in "empty elements" by adopting SGML's "null end-tag" (NET), part of the SHORTTAG minimization (though it changes the syntax slightly to make it work).

In SGML's reference concrete syntax, the Null End-Tag delimiter is `/`, and lets one omit the end tag.

    <p>This is <em/very interesting/.</p>
Here the EM surrounds "very interesting." In XML the "Null End-Tag Start" becomes "/" and the "Null End-Tag End" is ">".

So in XML a `<br />` is syntax sugar over `<br></br>` and identical to it (an element with _no_ child nodes), but in HTML `<br>` is an element which _cannot contain child nodes_ and the `/` was never part of it, required, or better.

As called out in another comment, the trailing `/` on HTML void elements is dangerous because it can lead to attribute value corruption when following an unquoted attribute value. This can happen not only when the HTML is written, but when processed by naive tools which slice and concatenate HTML. It's not _invalid_ HTML, but it's practically a benign syntax error.


For better or worse XHTML, also known as the XML serialization of HTML, cannot represent all valid HTML documents. HTML and XML are different languages with vastly different rules, and it's fairly moot now to consider replacing them.

Many of the "problems" with HTML are still handled adequately simply by using a spec-compliant parser instead of regular expressions, string functions, or attempting to parse HTML with XML parsers like PHP's `DOMDocument`.

Every major browser engine and every spec-compliant parser interprets any given HTML document in the same prescribed deterministic way. HTML parsers are't "loose" or "forgiving" - they simply have fully-defined behavior in the presence of errors.

This turned out to be a good thing because people tend to prefer being able to read _most_ of a document when _some_ errors are present. The "draconian error handling" made software easier to write, but largely deals with errors by pretending they can't exist.


> Is not valid HTML, it's merely valid grammar syntax for a loose parser.

It's an incredible journey writing a spec-compliant HTML parser. One of the things that stands out from the very first steps are that the "loose parser" is kind of a myth.

Parsing HTML is fully-specified. The syntax is full of surprises with their own legacy, but every spec-compliant parser will produce the same result from the same input. HTML is, in a sense, a shorthand notation for a DOM tree - it is not the tree itself.

The term "invalid HTML" also winds up fairly meaningless, as HTML error are mainly there as warnings for HTML validators, but are unnecessary for general parsing and rendering.

And these are things we can't easily say about XML parsers. There are certain errors from which XML processors are allowed to recovery, but which ones those are depends on which parser is run.

---

> I do like adding restrictions on confusing patterns with no known legitimate use cases or better alternatives.

HTML was based loosely on SGML, a language designed to encode structure in a way that humans could easily type. Particular care was made in SGML to allow syntax "minimizations" (omitted tags, for example), so that humans would overcome the effort to encode the required structure. It was noted in the spec that if people had to type every single tag they would likely give up. They did.

But SGML also had well-specified content models in the DTD, formalizing features like optional tags, short tags, tags derived from content templates, default attribute values. Any compliant SGML parser could reconstruct the missing syntax by parsing against that DTD.

HTML missed out on this and effectively the DTD was externalized in the browsers. The effort was made to produce a proper SGML DTD for HTML, but it was too late. Perhaps if there had been widely-available SGML spec and parsers at the time HTML was created the story would be different.

Needless to say, these patterns are the result of formal systems taking human factors into their designs. XML came later as a way to make software parsers easier, largely abandoning the human factors and use-case of people writing SGML/HTML/XML in a text editor.

SGML is still rather fun to write and many of these minimization features are far more ergonomic than they might seem at first. If you have a parser that properly understands them, they are basically just convenient macros for writing XML.


Yes, thanks for pointing out that "valid" should not be thrown out too easily. And it happens that I made and mistake and the snippet is actually valid, a pattern shared with a small set of others exceptions, exactly as you point out !

Thanks for pointing out key aspects of the story, I had a loose knowledge about it.


My teammate built the WordPress Playground

https://playground.wordpress.net

The motivation was initially to build interactive documentation where you could play with the code examples and see how they change things, but it's being used to demonstrate plugins on the WordPress directory, build staging sites, add no-risk places to experiment.

Surprisingly has been taking over some of the native development tools because it provides a sandboxed and isolated environment that needs no other dependencies and leaves no junk on your computer once it's done. It's trivial to change the PHP version being run (making testing on different version of PHP easy) by adding `--php 5.6`, for example, which is something a bit gnarly to do otherwise unless running PHP inside Docker.

Another surprise is that once loaded, it runs very fast, faster than native PHP and MySQL (the Playground can connect to MySQL but runs sqlite by default). I attribute this to having everything in memory and bypassing the filesystem.

It's been the kind of project that only gets better the more it's used. It turns out that having a no-setup, no-trace, isolated, risk-free app playground is an incredible tool for more purposes than we first realize. Someone even put the Playground inside an iOS app and it runs smoother than the native WordPress app.

As for a good solution? Just a lot of work I guess but WASM is critical to it. Any device or platform that supports JavaScript can run the sandbox/playground app: VSCode, `node` itself, a browser, a mobile device…


The Playground stores more than the SQLite database in OPFS. This includes things like installed plugins and file uploads and other artifacts brought in during boot. If there's enough interest and time I could imagine support being added to separate these, but for now it's easiest to have a single load and store mechanism for everything.

There's a third option supported only in Chrome at the moment which loads a directory from your computer's filesystem, bypassing even OPFS.

> using OPFS directly from SQLite doesn't work?

not entirely following your question. any way you could reword and explain what you were hoping to accomplish?


> Is there a reason why using OPFS directly from SQLite doesn't work?

I'm guessing this means using SQLite WASM's built-in OPFS integration as described in these articles:

- sqlite3 WebAssembly documentation - Persistent Storage Options: OPFS - https://sqlite.org/wasm/doc/trunk/persistence.md#opfs

- SQLite Wasm in the browser backed by the Origin Private File System - https://developer.chrome.com/blog/sqlite-wasm-in-the-browser...

Within the Playground, SQLite interacts with the database file in MEMFS only, and the Playground coordinates the syncing from MEMFS to OPFS.

https://github.com/WordPress/wordpress-playground/tree/trunk...

The reason for this, I believe, is that the primary use case is/was to have the entire file system in memory, including SQLite's database file. This was the original implementation, and is still the default behavior. Persistence was later added as an optional feature.

The good news is that browser support for OPFS seems to be getting better. From the SQLite docs:

  As of March 2023 the following browsers are known to have the necessary APIs:

  - Chromium-derived browsers released since approximately mid-2022
  - Firefox v111 (March 2023) and later


I'm not entirely sure on this, but I believe that what we're seeing is the WASM runtime allocating memory for its potential needs, but not totally using it all.

The entire filesystem, all the executables, all uploaded files, and all database data lives in memory in the virtual filesystem. I think we could trim down the default allocation but it might actually be meaningless anyway; that GB won't steal real RAM unless it fills up inside the environment.

Someone can probably check me on this, but if my memory serves me right this is what's going on.


the Playground can connect to a MySQL instance when running in a Node environment. it's not supported in the browser because there's no existing mechanism to relay the necessary socket connection there.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: