Self-Supervised Tracking via Video Colorization

cs702 · on June 27, 2018

Very clever, and "obvious" only in hindsight: Training a deep convnet to colorize all frames in a grayscale video clip from a single color frame taken from the same clip induces the neural net to learn to track all objects in the video, with robustness to occlusions, change of viewing angles, etc. Labels are not required; only a color frame from each clip. Most impressively, the embeddings learned by the convnet (i.e., the representations learned by the next-to-last layer) are linearly separable by object. Very nice!

Eridrus · on June 27, 2018

Simple to explain, yes, but I feel like this isn't even really "obvious" even in hindsight. This whole thing is very clever.

The only complaint I have is that it's not better than supervised object tracking, so I wonder if this idea is too late?

To draw a parallel to image classification, at one point in time neural nets were trained with a bunch of unsupervised pre-training using reconstruction loss, but that technique has basically fallen by the wayside as we've gotten larger datasets and found a pile of tricks for training them from scratch.

cs702 · on June 27, 2018

Labeling object locations in all frames of a large number of video clips is significantly more expensive than labeling a comparably large number of images.

Eridrus · on June 27, 2018

Sure, but like Imagenet, these datasets already exist. So unless these models are quite brittle to the objects being tracked, this is likely not going to be an issue.

vanderZwan · on June 28, 2018

> Sure, but like Imagenet, these datasets already exist.

I'm pretty sure that making it cheap to use new datasets is very valuable in the long run

vanderZwan · on June 28, 2018

Now I want to know if using two input videos to mimic our two eyes makes this even more robust, or even gives depths sense for free?

arnioxux · on June 27, 2018

I saw something similar before (from gifs.com's sticker editor of all things lol) where you annotate the segmentation of the first frame of the video and it will propagate that segmentation to the rest of the frames:

https://medium.com/gifs-ai/interactive-segmentation-with-con...

AboutTheWhisles · on June 27, 2018

I'm very skeptical that there is any merit to the 'tracking' over other techniques, as well as the colorization being better than this 14 year old paper:

http://webee.technion.ac.il/people/anat.levin/papers/coloriz...

The results in their videos look very poor.

IanCal · on June 27, 2018

Tracking is not the method they're using for colourizing, but the other way around. Your linked paper has no tracking.

AboutTheWhisles · on June 27, 2018

I realize that.

They are for some reason saying they can track things with their colorization, when their colorization is extremely unimpressive as well as the tracking that results from using it.

There is no reason colorization needs to happen to do the tracking anyway. The tracking is unimpressive and now indirect.

This isn't some sort of epiphany they've discovered, they are just reinventing video image segmentation very poorly.

Here are half a dozen examples from a 30 second google search:

https://www.youtube.com/watch?v=juDvLrFQF0U

https://www.youtube.com/watch?v=JYgyDdLf7GQ

https://static.googleusercontent.com/media/research.google.c...

https://perso.liris.cnrs.fr/nicolas.bonneel/InteractiveMulti...

http://files.is.tue.mpg.de/black/papers/TsaiCVPR2016.pdf

https://graphics.ethz.ch/~perazzif/bvs/files/bvs.pdf

The only reason this is news is because it's google and the researchers seem to think they've discovered something. Techniques like this with much better results have been shown at Siggraph for decades.

dimatura · on June 28, 2018

Sorry, but I think you're fundamentally misunderstanding the idea of the paper. Colorization is not the point - it's an auxiliary task, that lets the algorithm discover how to do a form of tracking.

As the paper itself states, the tracking results are not the absolute state of the art, but they are in the same ballpark, and more importantly, learned without supervision - just watching video. This makes it easier to train on whatever dataset you might have lying around, and more importantly, it's a clever, simple idea that can be improved on and adapted for different tasks.

(Disclaimer: Authors are acquaintances of mine.)

AboutTheWhisles · on June 28, 2018

Again, I understand what they are doing very well. They noticed that colorization tends to track things in video.

Of course colorization tracks to objects, it wouldn't work if it didn't.

This is essentially automatic video image segmentation, which itself is heavily derived from and related to natural image matting.

Natural image matting could even be seen to be a combination of clustering and somehow solving (or minimizing) the error in the matting equation described by porter duffman compositing algebra.

So, automatic video segmentation can be seen as clustering over 3 dimensions of pixels - x, y and time, with some loose expectations of coherency over time.

There are many ways to achieve this, which should be obvious if you watch some of the videos or glance at some of the papers I've linked.

One simple way is with a bilateral filter to iterate over the volume of pixels, which gradually clusters them together. One of the papers shows this technique.

Everything I linked gives much better results. They don't require 'deep learning' and the idea that colorization follows objects is so trivial it's nonsense to make a paper out of it. This is more a case of visibility and most people not knowing the research that has already been done. That's understandable for people here, but the authors of this paper should have known better.

nickbecker · on June 28, 2018

I tend to disagree on this. A method that improves on a previous best in class benchmark by ~30 percent is certainly impressive. While supervised learning state of the art benchmarks for visual tracking are significantly higher, we shouldn't ignore unsupervised (or essentially unsupervised, as in this case) methods simply because they are currently weaker. While they may remain weaker, there's immense value in increasing our unsupervised benchmarks, as another commenter pointed out.

felippee · on June 28, 2018

Right, I totally agree. What is more, I think this same result could actually be sold without the pompous deep learning bullshit and be received quite differently. If they did not claim to invent the wheel, but rather modestly noted their observation (which in a limited way is actually quite cool - that is from a known - at least on this forum - deep learning skeptic like me), it would make a much better impression.

Same is true actually for many DL papers. They'd be actually cool, if they weren't oversold.

dimatura · on June 28, 2018

What is the "pompous deep learning bullshit"? They don't really dwell a lot on DL itself, just describe a reasonable architecture for the task and how they train it. While it's possible to conceive the same concept realized with other machine learning methods (e.g. Random Forest or linear embeddings), it's hard to imagine it would work as well.

As for overselling, I'd say that it's somewhat standard writing style in CS academia to oversell[1] but this is hardly an example of that.

[1] It'd be unusual, but admittedly refreshing for a paper to say "this is just an incremental tweak on existing methods" or something along those lines.

felippee · on June 28, 2018

No, there is something special to machine learning and AI in general. E.g. SIGGRAPH papera are different. No one there claims to solve more than they actually do solve. DL is soaked with hype and self congratulatory BS. The best way to spot it is to check the citations. Typically they solve an already solved problem, skipping entirely any pre deep learning literature on it (or if they do cite it, only to dump BS on it) and then just cite a few of their own more or less relevant papers. I'm aware I'm overgeneralizing here and not every paper is like that, but I've seen enough to detect a trend.

It is as if defending or advertising "deep learning" was the purpose of the paper. It is not. The purpose of a paper is to show a solution to a problem. Much of DL literature (again not all) is a "solution in a desperate search of a problem" rather than the opposite.

I think many of these papers (including this one) would make a great blog post, but just isn't quite enough in terms of scientific content for a full blown paper. A curiosity, nice gimmick, but nothing more. Not really a solution to a problem, not really any idea of non trivial universality.