You are correct. But like the other comment mentioned, the cool part about this ...

chrisfosterelli · on Jan 11, 2018

Automatically learning the mapping from an image to a sequence of tokens is a very fundamental task for CNNs and not particularly new.

I don't think it's clear, or likely, that this can extend to all possible html input tokens. As you add more tokens, it becomes more difficult for the network to choose among them accurately. Additionally, as the token set becomes more fine-grained the size of the output space will grow exponentially and the network will likely struggle to learn from the training examples as well as output valid structure.

I think you can compare this to approaches that receive an image as input and provide a caption of the image as output. Works surprisingly well in simple cases but no where near fully functional or actually capable of understanding all inputs.

I agree that this might be a feasible approach toward automatic UI code generation eventually, but this is several significant levels of complication away from that.

houqp · on Jan 11, 2018

Agree with you that this is by no means a complete solution and there is a long way to go to make it actually usable.

I think one big problem with image captioning could be lack of high quality training data. While in this case we can generate lots of good training data. Whether we will be able to generate enough good data and have enough compute power to train on them is something that we need to find out.

Playing go was considered a problem too complex to solve couple years ago, but it's now a solved problem. So I am hoping we can get a breakthrough on this sooner than we think.

fnl · on Jan 11, 2018

Not wanting to dampen your high hopes, yet, the rules of Go seem a lot simpler than the rules and grammar of the current hypertext markup "language", particularly if taking the "browser dialects" into account, which are crucial for professional pages....