Hacker News new | past | comments | ask | show | jobs | submit login

Why test distributed computing technologies on a single core? Am I missing something?



The experiments were conducted on a 30 core Intel Xeon machine with 132 GB memory and 2 hyperthreads per core


But the "--master local[1]" setting they're using for Spark will run it on a single thread.

And, in the article they state "The algorithm took around 500 seconds to train on the NETFLIX dataset on a SINGLE processor, which is good for data as large as 1 billion ratings."

-emphasis mine


Being the one who conducted these experiments, I confirm that the number of threads was varied along(the graph shows performance scaling). I am sorry for the confusion caused, this was a typo, should have been "--master local[N]".


edit: local[1] has been updated to local[N], thank you for the update!

Ok thanks, I didn't know that's what "local[1]" did, so the more relevant comparison would be with --master local[30]?

The algorithm took around 500 seconds to train on the NETFLIX dataset on a SINGLE processor, which is good for data as large as 1 billion ratings. - this is from the sequential portion of the test, the parallel portion is the next section.


That was a typo in the blog post. If you look at the graph, with more cores, spark gets faster as does Julia. The typo is now fixed.


Thanks for the update. The typo had me misinterpreting things. Now it makes more sense.

Assuming you're part of the team? Keep up the good work.


We wanted to do this on a true distributed setup. However, all the largest datasets we could find where everyone has run ALS just fit on a single machine (even with lesser RAM than this one).


Probably meant "on a single machine".




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: