Thirty years ago, I was doing an object-recognition PhD. It goes without saying ...

Thirty years ago, I was doing an object-recognition PhD. It goes without saying that the field has moved on a lot from back then, but even then hierarchical and comparative classification was a thing.

I used to have the Bayesian maths to show the information content of relationships, but in the decades of moving (continent, even) it's been lost. I still have the code because I burnt CD's, but the results of hours spent writing TeX to produce horrendous-looking equations have long since disappeared...

The basics of it were to segment and classify using different techniques, and to model relationships between adjacent regions of classification. Once you could calculate the information content of one conformation, you could compare with others.

One of the breakthroughs was when I started modeling the relationships between properties of neighboring regions of the image as part of the property-state of any given region. The basic idea was the center/surround nature of the eye's processing. My reasoning was that if it worked there, it would probably be helpful with the neural nets I was using... It boosted the accuracy of the results by (from memory) ~30% over and above what would be expected from the increase in general information load being presented to the inference engines. This led to a finer-grain of classification so we could model the relationships (and derive information-content from connectedness). It would, I think, cope pretty well with your hypothetical scenario.

At the time I was using a blackboard[1] for what I called 'fusion' - where I would have multiple inference engines running using a firing-condition model. As new information came in from the lower levels, they'd post that new info to the blackboard, and other (differing) systems (KNN, RBF, MLP, ...) would act (mainly) on the results of processing done at a lower tier and post their own conclusions back to the blackboard. Lather, rinse, repeat. There were some that were skip-level, so raw data could continue to be available at the higher levels too.

That was the space component. We also had time-component inferencing going on. The information vectors were put into time-dependent neural networks, as well as more classical averaging code. Again, a blackboard system was working, and again we had lower and higher levels of inference engine. This time we had relaxation labelling, Kalman filters, TDNNs and optic flow (in feature-space). These were also engaged in prediction modeling, so as objects of interest were occluded, there would be an expectation of where they were, and even when not occluded, the prediction of what was supposed to be where would play into a feedback loop for the next time around the loop.

All this was running on a 30MHz DECstation 3100 - until we got an upgrade to SGI Indy's <-- The original Macs, given that OSX is unix underneath... I recall moving to Logica (signal processing group) after my PhD, and it took a week or so to link up a camera (an IndyCam, I'd asked for the same machine I was used to) to point out of my window and start categorizing everything it could see. We had peacocks in the grounds (Logica's office was in Cobham, which meant my commute was always against the traffic, which was awesome), which were always a challenge because of how different they could look based on the sun at the time. Trees, bushes, cars, people, different weather conditions - it was pretty good at doing all of them because of its adaptive/constructive nature, and it got to the point where we'd save off whatever it didn't manage to classify (or was at low confidence) to be included back into the model. By constructive, I mean the ability to infer that the region X is mislabelled as 'tree' because the surrounding/adjacent regions are labelled as 'peacock' and there are no other connected 'tree' regions... The system was rolled out as a demo of the visual programming environment we were using at the time, to anyone coming by the office... It never got taken any further, of course... Logica's senior management were never that savvy about potential, IMHO :)

My old immediate boss from Logica (and mentor) is now the Director of Innovation at the centre for vision, speech, and signal processing at Surrey university in the UK. He would disagree with you, I think, on the categorization side of your argument. It's been a focus of his work for decades, and I played only a small part in that - quickly realizing that there was more money to be made elsewhere :)

1:https://en.wikipedia.org/wiki/Blackboard_system