The two tasks are surprisingly interchangable. I once worked on a project where we used a statistical MT approach to "translate" between image features and captions--and I don't think we were the only ones trying such things.
In a pleasing bit of symmetry, the attentional network used here looks like it was initially developed for image captioning.
The two tasks are surprisingly interchangable. I once worked on a project where we used a statistical MT approach to "translate" between image features and captions--and I don't think we were the only ones trying such things.
In a pleasing bit of symmetry, the attentional network used here looks like it was initially developed for image captioning.