It's being researched with great interest. Building models from text and images,...

It's being researched with great interest. Building models from text and images, describing internal structure and relations between objects, building rich prior knowledge about the world in order to do inference and guide behavior.

I see lots of papers that go in this direction, of creating a rich, semantic, predictive representation of images, video and text and then using it as the basis for reinforcement learning. Learning to understand the world and to act based on that understanding.