You are correct. But like the other comment mentioned, the cool part about this is it automatically learns the mapping from image to sequence of tokens. To handle arbitrary html, we just need to extend it a little bit and convert all possible html input into tokens. I think the take away is this might be a feasible approach towards automatic UI code gen.
Automatically learning the mapping from an image to a sequence of tokens is a very fundamental task for CNNs and not particularly new.
I don't think it's clear, or likely, that this can extend to all possible html input tokens. As you add more tokens, it becomes more difficult for the network to choose among them accurately. Additionally, as the token set becomes more fine-grained the size of the output space will grow exponentially and the network will likely struggle to learn from the training examples as well as output valid structure.
I think you can compare this to approaches that receive an image as input and provide a caption of the image as output. Works surprisingly well in simple cases but no where near fully functional or actually capable of understanding all inputs.
I agree that this might be a feasible approach toward automatic UI code generation eventually, but this is several significant levels of complication away from that.
Agree with you that this is by no means a complete solution and there is a long way to go to make it actually usable.
I think one big problem with image captioning could be lack of high quality training data. While in this case we can generate lots of good training data. Whether we will be able to generate enough good data and have enough compute power to train on them is something that we need to find out.
Playing go was considered a problem too complex to solve couple years ago, but it's now a solved problem. So I am hoping we can get a breakthrough on this sooner than we think.
Not wanting to dampen your high hopes, yet, the rules of Go seem a lot simpler than the rules and grammar of the current hypertext markup "language", particularly if taking the "browser dialects" into account, which are crucial for professional pages....