Hacker News new | past | comments | ask | show | jobs | submit login
YOLO: Real-Time Object Detection (pjreddie.com)
320 points by headalgorithm on April 2, 2019 | hide | past | favorite | 61 comments



Interesting this is trending now. We have actually just recently released an improved version of YoloV3 (called G-Darknet) https://github.com/generalized-iou/g-darknet, using GIoU as a loss, which is described here: https://giou.stanford.edu

Also notable in G-Darknet are some tools useful for training (called darkboard), see https://github.com/generalized-iou/g-darknet/tree/master/dar...


Heads up: the boxes are drawn in the wrong places using Firefox 66 on Ubuntu 18.04. https://imgur.com/a/4d51spv

A bit confusing as the drawn boxes don't match the text. Works with Chromium though.


Thx, I'll check that


Interesting idea!

Though I have a question: in order to calculate C you need a way to attribute proposal and ground truth. It's trivial in case when there's only one instance of each class in the image.

But how does it work, when you're working with a set of same-class object? For example detecting each car in traffic.


Good question, we use the same method as coco, described in https://arxiv.org/pdf/1405.0312.pdf and implemented in the coco evaluation scripts here: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI... -- basically the best matching proposal, ground truth pair


Suprised to see this here since YOLO has been out for a while now. Shameless plug, I wrote an article on how to use use transfer learning on your custom dataset with the pretrained weights [1]. One of the downside of YOLO is that it uses his own deep learning library darknet. I find that the Tensorflow port dark flow easier to use but it haven't seen a v3 port yet.

[1] https://www.powu3.com/ml/yolo/


There is a pytorch port from Ultralytics (https://github.com/ultralytics/yolov3). Nobody seems to have figured out how to achieve the training performance of darknet though, which is entirely uncommented C. The source is all there, but the loss function changed between v2 and v3, and its not documented in the paper. I think it's been fixed in that pytorch port now though. The only frustrating thing is that every commit in the repo is called update...

Alternatively... you can train in darknet and then run inference in another framework of choice.

Also shameless plug: I wrote an annotation tool which is designed to output darknet formatted labels: https://github.com/jveitchmichaelis/deeplabel


Yeah, I don't remember where I read it but it took them a couple weeks to train the model from scratch. I tried training my own weights by scratch it was practically impossible using a Tesla K80. But it's pleasantly surprising how good the transfer learning results are on a custom data set. You can get some "state of the art" results when you train for a couple hours. It's really impressive how he came up with YOLO and wrote his own deep learning library from scratch.

Thank you for the links! I'm going to check both out. I want to see if the PyTorch port works with the new deployment feature from 1.0.



The YOLOv3 paper is pretty delightful: https://arxiv.org/pdf/1804.02767.pdf


It's honest, instead of the inflated bs you usually have to spin around your half working experiments.


For another example of such honesty, see the paper “HonestNN: An Honest Neural Network ‘Accelerator’”, from the SIGBOVIK 2019 proceedings: http://sigbovik.org/2019/proceedings.pdf#page=107. I love that paper.


You should read his site, his IDGAF attitude is pretty funny. The FAQ section is the most entertaining.



Like executives that shun computers as a symbol of their power and status... having a resume that emphasizes my-little-pony suggests his dance card is full (i.e. he has his pick of top job opportunities).


My new theory is that he is Bill Wurtz


his EULA does. named 'license.f*ck' included in the github


That is the Do What the F*ck You Want to license: http://www.wtfpl.net/


> Reviewer #2 AKA Dan Grossman (lol blinding who does that)

lmao


That's hilarious, and definitely makes me want to learn more about what the authors are talking about. What's a good place to refer to the various acronyms this paper uses, for those who aren't familiar with the field?


YOLO, no! https://i.imgur.com/R1RZ2N0.png

Jokes aside, we need better temporal consistency, especially when we start arming AI. citizen -> citizen -> citizen -> armed insurgent


The problem there isn't temporal consistency (although I agree that often sucks), it is over-reliance on context. The invisible sheep problem: http://aiweirdness.com/post/171451900302/do-neural-nets-drea...


With that particular example, a citizen could very well also be an armed insurgent. Whether that citizen/insurgent is an ally or neutral or enemy is the distinction worth solving (even if it's significantly harder for an AI).

Of course, that matters far less when Skynet decides that every human is a hostile armed insurgent...


I am assuming that would be solved by having the AI also take in inputs of where your troops and allies are located. Perhaps with something like the Blue Force Tracker [0].

One of the first priorities of an operation is not knowing where your enemy is, but where you are.

0: https://www.viasat.com/products/blue-force-tracking-2


Having worked with YOLO, I really recommend this intro: https://blog.paperspace.com/how-to-implement-a-yolo-object-d.... And in general, YOLO is performant and at the same time, it has a simpler architecture than the Fast(er) R-CNN family.

And in general, due to its head, it is WAY more readable in PyTorch than in TensorFlow; to the point, I use it as an example in Keras vs PyTorch example https://deepsense.ai/keras-or-pytorch/ (was here at some point).


It still seems to be using only the single frame, without past/present context. E.g. a dog sometimes is recognized as teddy bear for a split second.

Is there any "continuous" models for that? Sounds like a simple bayesian post-processing would do a great deal (e.g. recording the probability of dogs mutating to teddy bears as very low).


yeah it’s easy to fix those with a filter on the predictions. could use bayesian approach or just smooth using a majority vote over rolling window of say 3 frames...


YOLO stands for "You Only Look Once" so I don't think this will ever become "continuous"


AFAIK, the 'Look Once' part refers to other systems that re-ran a section of the frame at a time through an object detector, resulting in a lot of reprocessing.

You could still look only once, but have that look include multiple sequential frames. Or do something like an LSTM of frames.


Good point. I hadn't considered this.


For each frame, it returns a list of candidate detections with confidence values if I remember correctly. Should be pretty straightforward to make it smooth using that.


It maybe depends on the weights that you're loading with the model


https://pjreddie.com/media/files/papers/YOLOv3.pdf

Sounds to good to be true. Also reads like that. :) A gem from this paper:

But maybe a better question is: “What are we going todo with these detectors now that we have them?” A lot ofthe people doing this research are at Google and Facebook.I guess at least we know the technology is in good handsand definitely won’t be used to harvest your personal infor-mation and sell it to.... wait, you’re saying that’s exactlywhat it will be used for?? Oh.Well the other people heavily funding vision research arethe military and they’ve never done anything horrible likekilling lots of people with new technology oh wait..... 1

1 The author is funded by the Office of Naval Research and Google.


I don’t really understand how this is any different from Overfeat or SSD...


YOLO is a combination of backend encoder and head. Backend (Darknet) is unique to YOLO, that would be the main difference with SSD, if I'm not mistaken.


You might want to check the username of who you're relying to.


YOLO is a very good and approachable object detection technique. I recently re-read the paper for the original YOLO [1] from 2015 and loved the apparent simplicity of this technique.

As a shameless plug, I wrote an intuitive guide to understanding SSD (Single Shot Detector), another popular object detection technique: https://towardsdatascience.com/understanding-ssd-multibox-re...

[1] https://arxiv.org/abs/1506.02640


It seems that the commercialized version of this technology is here: https://www.xnor.ai/technology/.

Xnor's founding team developed YOLO, a leading open source object detection model used in real world applications. We use a proprietary, high performance, binarized version of YOLO in our models for enterprise customers.

Too good to be true? Seems that they're running YOLO on conventional multi-core CPUs. On ARM even.


This guy gave a talk at my university a few weeks ago. He did some live demonstrations and I was really impressed. With a video camera he did live detection in the room and was classifying dozens of objects. Like the screen was filled with identification boxes. He also did a demo where he used his cell phone. Not as many classifications, but still about a dozen.

Everyone was pretty impressed. I'm always impressed when I see live demos go (almost) flawlessly.


It's hilarious the main video detects a dromedary as three cows at 3:26


If I recall correctly, Andrew Ng covers this in his CNN course[0] and implementing it is one of the exercises.

[0] https://www.coursera.org/learn/convolutional-neural-networks


Whats the best route to deploy a python YOLO system to a desktop app? E.g. have .zip file you extract, install, then run - everything is included , tensorflow/keras libs ... no need for user to setup envronment with conda yadda yadda


At the risk of incurring HN's wrath: Docker is an option. Another is to use C/C++ instead of Python and statically link it. Either way, if you want to use the GPU you'll have a world of pain with NVidia stuff.


Check out PyInstaller.


Checkout Cxfreeze


Can we also get the orientation of each detected object?


With some changes - yes. I did this in my experimental project: https://github.com/indutny/resistenz/blob/master/python/mode...

The idea is to add an extra 2 params to the output of each classifier cell. Then do L2 normalization on them ( https://github.com/indutny/resistenz/blob/master/python/mode... ) and treat them as a cosine/sine pair.

The loss in this case would be the Euclidean distance between the actual and predicted pairs, which is equal to "2 * (1 - cos(x-y))".


Darknet is a framework for Neural Networks, YOLO is more an algorithm focused on object detection I think it could be relatively easy to perform the detection of an object's detection.


Does your training dataset supply that info about each object?


Capsule networks would be better suited for that


I cannot get YOLO to detect at 30fps, even on gpu machines. This was true when I tried keras yolo as well as following the instructions for c compilation on this page.


720p webcam with cuda is about 90 for me.


Awesome stuff!

I understand the benefits (as mentioned); would be interesting to know what disadvantages this has compared to the classifier type detection methods?


Great project, but pretty old now.


Yolov3 is about a year old and is still state of the art for all meaningful purposes. It's fast and works well. You might get "better" results with a Faster RCNN variant, but it's slow and the difference will likely be imperceptible. Using map50 as pjreddie points out, isn't a great metric for object detection.


Interestingly in our production systems yolo object detection speed was much faster and accurate.


recommendations for similarly easy but better/more "modern" alternatives?


See Faster-RCNN, R-FCN, SSD, etc

I've ignored mask RCNN becuase it's significantly more time consuming to label your data.

The main candidates are all found in Facebook’s Detectron package, but they didn't feel it necessary to document anything in any significant level of detail: https://github.com/facebookresearch/Detectron

You can see also: https://paperswithcode.com/sota/object-detection-coco


those are some awesome humped horses and cows. The police brutality scene was cool also.


Isn't this from 1 year ago?


Comedy option: every detected object is labeled with the word "noumena"




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: