YOLO: Real-Time Object Detection

natvert · on April 2, 2019

Interesting this is trending now. We have actually just recently released an improved version of YoloV3 (called G-Darknet) https://github.com/generalized-iou/g-darknet, using GIoU as a loss, which is described here: https://giou.stanford.edu

Also notable in G-Darknet are some tools useful for training (called darkboard), see https://github.com/generalized-iou/g-darknet/tree/master/dar...

w-m · on April 3, 2019

Heads up: the boxes are drawn in the wrong places using Firefox 66 on Ubuntu 18.04. https://imgur.com/a/4d51spv

A bit confusing as the drawn boxes don't match the text. Works with Chromium though.

natvert · on April 3, 2019

Thx, I'll check that

elephantum · on April 3, 2019

Interesting idea!

Though I have a question: in order to calculate C you need a way to attribute proposal and ground truth. It's trivial in case when there's only one instance of each class in the image.

But how does it work, when you're working with a set of same-class object? For example detecting each car in traffic.

natvert · on April 3, 2019

Good question, we use the same method as coco, described in https://arxiv.org/pdf/1405.0312.pdf and implemented in the coco evaluation scripts here: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI... -- basically the best matching proposal, ground truth pair

syntaxing · on April 2, 2019

Suprised to see this here since YOLO has been out for a while now. Shameless plug, I wrote an article on how to use use transfer learning on your custom dataset with the pretrained weights [1]. One of the downside of YOLO is that it uses his own deep learning library darknet. I find that the Tensorflow port dark flow easier to use but it haven't seen a v3 port yet.

[1] https://www.powu3.com/ml/yolo/

joshvm · on April 2, 2019

There is a pytorch port from Ultralytics (https://github.com/ultralytics/yolov3). Nobody seems to have figured out how to achieve the training performance of darknet though, which is entirely uncommented C. The source is all there, but the loss function changed between v2 and v3, and its not documented in the paper. I think it's been fixed in that pytorch port now though. The only frustrating thing is that every commit in the repo is called update...

Alternatively... you can train in darknet and then run inference in another framework of choice.

Also shameless plug: I wrote an annotation tool which is designed to output darknet formatted labels: https://github.com/jveitchmichaelis/deeplabel

syntaxing · on April 2, 2019

Yeah, I don't remember where I read it but it took them a couple weeks to train the model from scratch. I tried training my own weights by scratch it was practically impossible using a Tesla K80. But it's pleasantly surprising how good the transfer learning results are on a custom data set. You can get some "state of the art" results when you train for a couple hours. It's really impressive how he came up with YOLO and wrote his own deep learning library from scratch.

Thank you for the links! I'm going to check both out. I want to see if the PyTorch port works with the new deployment feature from 1.0.

buildbot · on April 2, 2019

Chainercv seems to: https://github.com/chainer/chainercv/blob/master/README.md

geofft · on April 2, 2019

The YOLOv3 paper is pretty delightful: https://arxiv.org/pdf/1804.02767.pdf

turbinerneiter · on April 2, 2019

It's honest, instead of the inflated bs you usually have to spin around your half working experiments.

roryokane · on April 7, 2019

For another example of such honesty, see the paper “HonestNN: An Honest Neural Network ‘Accelerator’”, from the SIGBOVIK 2019 proceedings: http://sigbovik.org/2019/proceedings.pdf#page=107. I love that paper.

syntaxing · on April 2, 2019

You should read his site, his IDGAF attitude is pretty funny. The FAQ section is the most entertaining.

airstrike · on April 2, 2019

Nothing beats this resume https://pjreddie.com/static/Redmon%20Resume.pdf

canada_dry · on April 2, 2019

Like executives that shun computers as a symbol of their power and status... having a resume that emphasizes my-little-pony suggests his dance card is full (i.e. he has his pick of top job opportunities).

SimplyUnknown · on April 3, 2019

My new theory is that he is Bill Wurtz

bawana · on April 3, 2019

his EULA does. named 'license.f*ck' included in the github

perennate · on April 3, 2019

That is the Do What the F*ck You Want to license: http://www.wtfpl.net/

airstrike · on April 2, 2019

> Reviewer #2 AKA Dan Grossman (lol blinding who does that)

lmao

CamperBob2 · on April 3, 2019

That's hilarious, and definitely makes me want to learn more about what the authors are talking about. What's a good place to refer to the various acronyms this paper uses, for those who aren't familiar with the field?

daenz · on April 2, 2019

YOLO, no! https://i.imgur.com/R1RZ2N0.png

Jokes aside, we need better temporal consistency, especially when we start arming AI. citizen -> citizen -> citizen -> armed insurgent

IshKebab · on April 2, 2019

The problem there isn't temporal consistency (although I agree that often sucks), it is over-reliance on context. The invisible sheep problem: http://aiweirdness.com/post/171451900302/do-neural-nets-drea...

yellowapple · on April 2, 2019

With that particular example, a citizen could very well also be an armed insurgent. Whether that citizen/insurgent is an ally or neutral or enemy is the distinction worth solving (even if it's significantly harder for an AI).

Of course, that matters far less when Skynet decides that every human is a hostile armed insurgent...

nan0 · on April 3, 2019

I am assuming that would be solved by having the AI also take in inputs of where your troops and allies are located. Perhaps with something like the Blue Force Tracker [0].

One of the first priorities of an operation is not knowing where your enemy is, but where you are.

0: https://www.viasat.com/products/blue-force-tracking-2

stared · on April 2, 2019

Having worked with YOLO, I really recommend this intro: https://blog.paperspace.com/how-to-implement-a-yolo-object-d.... And in general, YOLO is performant and at the same time, it has a simpler architecture than the Fast(er) R-CNN family.

And in general, due to its head, it is WAY more readable in PyTorch than in TensorFlow; to the point, I use it as an example in Keras vs PyTorch example https://deepsense.ai/keras-or-pytorch/ (was here at some point).

deepsun · on April 2, 2019

It still seems to be using only the single frame, without past/present context. E.g. a dog sometimes is recognized as teddy bear for a split second.

Is there any "continuous" models for that? Sounds like a simple bayesian post-processing would do a great deal (e.g. recording the probability of dogs mutating to teddy bears as very low).

alexcnwy · on April 2, 2019

yeah it’s easy to fix those with a filter on the predictions. could use bayesian approach or just smooth using a majority vote over rolling window of say 3 frames...

JasonGenova25 · on April 2, 2019

YOLO stands for "You Only Look Once" so I don't think this will ever become "continuous"

saynay · on April 2, 2019

AFAIK, the 'Look Once' part refers to other systems that re-ran a section of the frame at a time through an object detector, resulting in a lot of reprocessing.

You could still look only once, but have that look include multiple sequential frames. Or do something like an LSTM of frames.

JasonGenova25 · on April 2, 2019

Good point. I hadn't considered this.

newen · on April 2, 2019

For each frame, it returns a list of candidate detections with confidence values if I remember correctly. Should be pretty straightforward to make it smooth using that.

razeonex · on April 2, 2019

It maybe depends on the weights that you're loading with the model

barrystaes · on April 3, 2019

https://pjreddie.com/media/files/papers/YOLOv3.pdf

Sounds to good to be true. Also reads like that. :) A gem from this paper:

But maybe a better question is: “What are we going todo with these detectors now that we have them?” A lot ofthe people doing this research are at Google and Facebook.I guess at least we know the technology is in good handsand definitely won’t be used to harvest your personal infor-mation and sell it to.... wait, you’re saying that’s exactlywhat it will be used for?? Oh.Well the other people heavily funding vision research arethe military and they’ve never done anything horrible likekilling lots of people with new technology oh wait..... 1

1 The author is funded by the Office of Naval Research and Google.

pjreddie · on April 2, 2019

I don’t really understand how this is any different from Overfeat or SSD...

elephantum · on April 3, 2019

YOLO is a combination of backend encoder and head. Backend (Darknet) is unique to YOLO, that would be the main difference with SSD, if I'm not mistaken.

_Wintermute · on April 3, 2019

You might want to check the username of who you're relying to.

edshiro · on April 2, 2019

YOLO is a very good and approachable object detection technique. I recently re-read the paper for the original YOLO [1] from 2015 and loved the apparent simplicity of this technique.

As a shameless plug, I wrote an intuitive guide to understanding SSD (Single Shot Detector), another popular object detection technique: https://towardsdatascience.com/understanding-ssd-multibox-re...

[1] https://arxiv.org/abs/1506.02640

maverick384 · on April 3, 2019

It seems that the commercialized version of this technology is here: https://www.xnor.ai/technology/.

Xnor's founding team developed YOLO, a leading open source object detection model used in real world applications. We use a proprietary, high performance, binarized version of YOLO in our models for enterprise customers.

Too good to be true? Seems that they're running YOLO on conventional multi-core CPUs. On ARM even.

godelski · on April 2, 2019

This guy gave a talk at my university a few weeks ago. He did some live demonstrations and I was really impressed. With a video camera he did live detection in the room and was classifying dozens of objects. Like the screen was filled with identification boxes. He also did a demo where he used his cell phone. Not as many classifications, but still about a dozen.

Everyone was pretty impressed. I'm always impressed when I see live demos go (almost) flawlessly.

gusdeboer · on April 2, 2019

It's hilarious the main video detects a dromedary as three cows at 3:26

olalonde · on April 3, 2019

If I recall correctly, Andrew Ng covers this in his CNN course［0］ and implementing it is one of the exercises.

［0］ https://www.coursera.org/learn/convolutional-neural-networks

abledon · on April 2, 2019

Whats the best route to deploy a python YOLO system to a desktop app? E.g. have .zip file you extract, install, then run - everything is included , tensorflow/keras libs ... no need for user to setup envronment with conda yadda yadda

quietbritishjim · on April 2, 2019

At the risk of incurring HN's wrath: Docker is an option. Another is to use C/C++ instead of Python and statically link it. Either way, if you want to use the GPU you'll have a world of pain with NVidia stuff.

olalonde · on April 3, 2019

Check out PyInstaller.

villgax · on April 3, 2019

Checkout Cxfreeze

amelius · on April 2, 2019

Can we also get the orientation of each detected object?

indutny · on April 2, 2019

With some changes - yes. I did this in my experimental project: https://github.com/indutny/resistenz/blob/master/python/mode...

The idea is to add an extra 2 params to the output of each classifier cell. Then do L2 normalization on them ( https://github.com/indutny/resistenz/blob/master/python/mode... ) and treat them as a cosine/sine pair.

The loss in this case would be the Euclidean distance between the actual and predicted pairs, which is equal to "2 * (1 - cos(x-y))".

razeonex · on April 2, 2019

Darknet is a framework for Neural Networks, YOLO is more an algorithm focused on object detection I think it could be relatively easy to perform the detection of an object's detection.

p1esk · on April 2, 2019

Does your training dataset supply that info about each object?

delaaxe · on April 2, 2019

Capsule networks would be better suited for that

darepublic · on April 2, 2019

I cannot get YOLO to detect at 30fps, even on gpu machines. This was true when I tried keras yolo as well as following the instructions for c compilation on this page.

someguy1234567 · on April 2, 2019

720p webcam with cuda is about 90 for me.

tango12 · on April 3, 2019

Awesome stuff!

I understand the benefits (as mentioned); would be interesting to know what disadvantages this has compared to the classifier type detection methods?

dacox · on April 2, 2019

Great project, but pretty old now.

joshvm · on April 2, 2019

Yolov3 is about a year old and is still state of the art for all meaningful purposes. It's fast and works well. You might get "better" results with a Faster RCNN variant, but it's slow and the difference will likely be imperceptible. Using map50 as pjreddie points out, isn't a great metric for object detection.

worldexplorer · on April 3, 2019

Interestingly in our production systems yolo object detection speed was much faster and accurate.

detaro · on April 2, 2019

recommendations for similarly easy but better/more "modern" alternatives?

joshvm · on April 2, 2019

See Faster-RCNN, R-FCN, SSD, etc

I've ignored mask RCNN becuase it's significantly more time consuming to label your data.

The main candidates are all found in Facebook’s Detectron package, but they didn't feel it necessary to document anything in any significant level of detail: https://github.com/facebookresearch/Detectron

You can see also: https://paperswithcode.com/sota/object-detection-coco

sabujp · on April 3, 2019

those are some awesome humped horses and cows. The police brutality scene was cool also.

mtw · on April 2, 2019

Isn't this from 1 year ago?

apoph3nia · on April 5, 2019

Comedy option: every detected object is labeled with the word "noumena"