A parallel recommendation engine in Julia

pathsjs · on Oct 6, 2016

The experiments were conducted by invoking spark with flags --master local[1]

This means running on a single core. Not really a fair comparison with the Julia multithreaded or multiprocess version

ViralBShah · on Oct 6, 2016

I am pretty sure that is a typo in the blog post, since the performance of Spark improves as more cores are used, as does Julia's and they both show similar scaling characteristics.

The performance plot would probably be more readable if the y axis was on log scale.

The typo is also now fixed.

shipman05 · on Oct 6, 2016

Why test distributed computing technologies on a single core? Am I missing something?

optimali · on Oct 6, 2016

The experiments were conducted on a 30 core Intel Xeon machine with 132 GB memory and 2 hyperthreads per core

shipman05 · on Oct 6, 2016

But the "--master local[1]" setting they're using for Spark will run it on a single thread.

And, in the article they state "The algorithm took around 500 seconds to train on the NETFLIX dataset on a SINGLE processor, which is good for data as large as 1 billion ratings."

-emphasis mine

abhichan · on Oct 6, 2016

Being the one who conducted these experiments, I confirm that the number of threads was varied along(the graph shows performance scaling). I am sorry for the confusion caused, this was a typo, should have been "--master local[N]".

optimali · on Oct 6, 2016

edit: local[1] has been updated to local[N], thank you for the update!

Ok thanks, I didn't know that's what "local[1]" did, so the more relevant comparison would be with --master local[30]?

The algorithm took around 500 seconds to train on the NETFLIX dataset on a SINGLE processor, which is good for data as large as 1 billion ratings. - this is from the sequential portion of the test, the parallel portion is the next section.

ViralBShah · on Oct 6, 2016

That was a typo in the blog post. If you look at the graph, with more cores, spark gets faster as does Julia. The typo is now fixed.

shipman05 · on Oct 6, 2016

Thanks for the update. The typo had me misinterpreting things. Now it makes more sense.

Assuming you're part of the team? Keep up the good work.

ViralBShah · on Oct 6, 2016

We wanted to do this on a true distributed setup. However, all the largest datasets we could find where everyone has run ALS just fit on a single machine (even with lesser RAM than this one).

qznc · on Oct 6, 2016

Probably meant "on a single machine".

minimaxir · on Oct 6, 2016

The Spark comparison, given the April 2016 posting of the article, was likely done with Spark 1.6. Spark 2.0, released in July, added significant performance improvements (https://docs.cloud.databricks.com/docs/latest/sample_applica...), so it is possible the performance difference may be different nowadays.

ViralBShah · on Oct 6, 2016

Quite possible, and would be interesting to see how this stacks up today. I was just glad to see that Julia's parallel computing could out of the box give results comparable to Spark, with the ALS algorithm completely written in Julia without crazy optimized code.

jlrubin · on Oct 6, 2016

Since Viral seems to be responding here...

What's going on with Multithread support? I was trying to do a project a while back to make a pure julia mapreduce like engine with a distributed file system, but it was hard to get off the ground due to poor multithreading support.

For the uninitiated, Julia has two types of concurrency built in. Tasks, which are co-routines on the same thread and Clusters, which are "separate machines".

ViralBShah · on Oct 6, 2016

The multi-threading in Julia is really new and limited. The plan is first to get the whole codebase to be thread-safe and provide some simple parallelism models and then figure out what a good composable multi-threading model could be.

For now, since the GC effectively runs only in one thread, you get good speedup with multi-threading if you avoid allocation and thus GC in the parallel code sections. In some cases this is possible, but in many cases it is unnatural. Of course, all this is under heavy development.

To build a julia mapreduce engine on a distributed filesystem, Julia's multi-processing should be pretty good though. For simple problems we attempted with packages like Elly.jl, that is what our experience has been.

StefanKarpinski · on Oct 6, 2016

In particular, note that if you were doing this experiment more than a month ago, there was no threading support (except on master); there has been support for distributed computing from the first release. In the new 0.5 release there is support for multithreading, but it's still (as Viral said), experimental.

yarapavan · on Oct 6, 2016

Impressive results! Congrats Julia team.

mikestaszel · on Oct 6, 2016

Does anyone have a link to the code that was used for the comparison?

yarapavan · on Oct 6, 2016

https://github.com/abhijithch/RecSys.jl should have it

StreamBright · on Oct 6, 2016

This was written 22 Apr 2016, still pretty good read.

coldtea · on Oct 6, 2016

Still? Because being merely 6 months is supposed to date a programming article?

ChrisRackauckas · on Oct 6, 2016

6 months is a long time ago in Julia land. Since then, the standard Julia install is a minor version higher which makes -O3 optimization standard, has . syntax for automatic fused broadcasts (simple loop fusing for MATLAB-style vectorization), anonymous functions are orders of magnitude faster, Base has been slimmed and many new organizations like JuliaMath and JuliaDiffEq have been developed with new packages to enhance performance for these kinds of things. In Julia you can program really fast so the language (written in Julia) and the package ecosystem evolves fast as well. Code from 6 months ago is recognizably different (at least right now).

coldtea · on Oct 6, 2016

>months is a long time ago in Julia land.

Regarding minutiae such as these maybe. But I've been following Julia on and off for close to 5 years, and it's mostly crickets even among "major" versions (e.g 0.2 to 0.3. to 0.4 etc).

That said are little better/faster/picking up this year.

geodel · on Oct 6, 2016

It depends. Specially if it has anything to do with Javascript related framework. In that world I think best practices and technologies change weekly.

ellisv · on Oct 6, 2016

I think the point was to note that this isn't a new article.

Also Julia has had several bug fixes and a minor version bump (with some major changes) since the article was written.

the_duke · on Oct 6, 2016

It's called sarcasm, people.

optimuspaul · on Oct 6, 2016

It was ancient history in July. It's a fast paced word my friend.

StreamBright · on Oct 6, 2016

You are on HackerNews, the site name might suggest that you are getting news.

news n(y)o͞oz/Submit noun newly received or noteworthy information, especially about recent or important events.

adrianN · on Oct 6, 2016

From the guidelines:

What to Submit

On-Topic: Anything that good hackers would find interesting. That includes more than hacking and startups. If you had to reduce it to a sentence, the answer might be: anything that gratifies one's intellectual curiosity.

Off-Topic: Most stories about politics, or crime, or sports, unless they're evidence of some interesting new phenomenon. Videos of pratfalls or disasters, or cute animal pictures. If they'd cover it on TV news, it's probably off-topic.