Hacker Newsnew | past | comments | ask | show | jobs | submit | lukas's commentslogin

I don't think MobileNetV2 is designed to train on GPUs - according to this https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-depl... MobileNetV2 gets bigger gains from GPUs vs several CPUs than ResNet. You could argue the batch size doesn't fully use the V100 but these comparisons are tricky and this looks like fairly normal training to me.

It's pretty surprising to me that an M1 performs anywhere near a V100 on model training and I guess the most striking thing is the energy efficiency of the M1.


MV2 is memory-limited, the depthwise + groups + 1x1 convs has a long launch time on GPU. Shattered kernels are fine for CPU, but not for GPU.

Though per your note on the scales, that's really interesting empirical results. I'll have to look into that, thanks for passing that along.


Do you really train more than one model at the same time on a single GPU? In my experience that's pretty unusual.

I completely agree with your conclusion here.


Depends on model size, but if the model is small enough that I actually do training on a PCIe board, I do. I partition an A100 in 8, and train 8 models at a time, or just use MPS on a V100 board. The bigger A100 boards can fit multiple of the same models that do fit in a single V100..

Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.

I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.


I found training multiple models on same GPU hit other bottlenecks (mainly memory capacity/bandwidth) fast. I tend to train one model per GPU and just scale the number of computers. Also, if nothing else, we tend to push the models to fit the GPU memory.


Memory became less of an issue for me with V100, and isn't really an issue with A100, at least when quickly iterating for newer models, when the sizes are still relatively small.


For sure! I'm emailing you now.


I totally agree with this and I built wandb (wandb.com) to solve this problem. We try to do this in as lightweight a way as possible - for example we can do keras tracking with a single line (https://www.wandb.com/articles/visualize-keras-models-with-o...) and pytorch with just a couple lines (https://www.wandb.com/articles/monitor-your-pytorch-models-w...). Would love any feedback on it.


Hey Lukas! Love your work on wandb and very keen to find ways to integrate/collaborate :)


This stuff looks super cool. I’m a fan of pytorch and can’t help but add a shameless plug for the ML experiment management tools I built - https://www.wandb.com/blog/monitor-your-pytorch-models-with-...


Could you say a little more? I think I understand the "scaler" and it's how I learned the scales and how I practice, but I'm curious what the pentanizer is suggesting to do. Picking a root and an interval and finding it all over the fretboard?


This may help: https://www.reddit.com/r/Guitar_Theory/comments/8c1s9i/penta...

There is a youtube link video link included which explains the concept.


That’s pretty awesome - I hadn’t seen it before. Thanks for the link!


Weights and Biases | Engineer and Designer | SF | Full-time

We're three experienced technical founders building machine learning tools. We're well funded with some traction but we're still a very small team.

We're looking for: full stack engineer https://www.wandb.com/job-full-stack-engineer product design lead https://www.wandb.com/job-product-design


I've been teaching classes on machine learning for engineers (shameless self promotion: https://www.eventbrite.com/e/technical-introduction-to-ai-ma...)

One of the coolest parts of teaching these classes is how awesome the people are that show up. The engineers that want to learn new things mid career are exactly the kind of people I want to work with and hang out with. I think there's a real opportunity for more classes like this.


I really appreciate the author's critical analysis of this correlation presented as "fact" by Radiolab and I love how Hacker News and other blogs take these types of scientific findings and dig in for the truth. I think the PNAS paper refutes the original conclusion pretty thoroughly - I wish the Nautilus author would just explain that.

I don't think we should dismiss effects just because they seem really large (as the Nautilus author claims) but I do think that it's incredibly irresponsible of Sapolsky and Radiolab to be uncritically citing a study that looks like it was debunked in 2011.

I also think it's strange that the author cites the SJDM paper which is much, much less convincing, claiming that it refutes the original experiment. It looks to me like that paper just shows that by simulating a non-random order of parole requests they can create data that looks like the original experiment.

I love that Hacker News posts these things and people go through and analyze the papers. No one outside of the specialized field could possibly have time to analyze all of these papers but they clearly have implications that matter for everyone. I wish that popular science shows would do a more thorough analysis of these results on their own.


This is clearly not a true story. I think someone should flag it or put a warning in the title.


I think this comments section is enough. It's more fun to go into the story without knowing that it's fake.


NO it's not fun. I just sent it to all my colleagues...


You are not alone: so did I.


No, someone shouldn't. Since, as you say, it's already clear.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: