Then why big data land is dominated by JVM-based frameworks?

corysama · on Oct 29, 2015

Because a couple decades ago Java convinced Enterprise Land that they can't hire millions of C++ jockeys and expect them to work effectively in huge projects that plan to evolve into the next decades' (aka: the present's) legacy mudball. Instead, they decided it would be easier to hire millions of Java jockeys and have them build enormous kiln-fired mudballs using the same architectural strategy as the Egyptian pyramids. They convinced academia to raise an entire generation of Java jockeys, hired them all right out of school, and set them immediately to piling up enormous mud bricks forever.

So, now they have a few million Java jockeys churning away and a few million person-decades of work put into their mud piles. When starting any new project, there isn't much question about how to build it: More Mud!

nascentmind · on Oct 29, 2015

This is the problem here.

As an embedded developer where every cycle counts I have come up with the same question as the poster above why bother with such languages. If a switch processes packets at line rate with the use of ASIC's why not have some similar development in the world of big data.

schmidtc · on Oct 29, 2015

Thank you

leetNightshade · on Oct 28, 2015

You assume that JVM is slow, yes? That's not always the case. Interestingly, there's cases where JVM applications run just as fast as if not faster than native code. This blows my mind, as a C++ programmer myself.

http://codexpi.com/java-vs-cpp-performance-comparison-jit-co...

http://stackoverflow.com/questions/5641356/why-is-it-that-by...

http://beautynbits.blogspot.com/2013/01/performance-java-vs-...

abc_lisper · on Oct 28, 2015

Once compiled to native code, which it will be for big data because the same classes are reused over and over, I would assume it would be in same ball-park as C/C++ code.

nostrademons · on Oct 29, 2015

There's still a pretty big speed penalty for Java because the object model encourages a lot of pointer-chasing, which will blow your data locality. In C++, it's common for contained structs to be flat in memory, so accessing a data member in them is just an offset from a base address. In Java, all Object types are really pointers, which you need to dereference to get the contained object. HotSpot can't really optimize this beyond putting really frequently used objects in registers.

A lot of big-data work involves pulling out struct fields from a deeply nested composite record, and then performing some manipulation on them.

vvanders · on Oct 29, 2015

Listen to the parent here, I've seen 10x performance in production Java code just using flatbuffers(and paying the marshaling costs from ByteBuffer).

50x is not unreasonable for C/C++ code that was OO and uses a data oriented approach instead.

vitalyd · on Oct 29, 2015

Memory indirection is the biggest issue indeed. However, I'd also add that java has a terrible performance model, as a language. Unless you stick to primitives only, the abstraction costs start to add up (beyond pointer chasing). It shoves the entire optimization burden onto the JVM which by the time it runs has lost a bunch of semantic and type information in some cases. There are also codegen deficiencies in current hotspot C2 compiler (i.e. generated code subpar compared to roughly equivalent gcc).

abc_lisper · on Oct 29, 2015

> In C++, it's common for contained structs to be flat in memory, so accessing a data member in them is just an offset from a base address

JVM inlines virtual method calls as one of its optimizations. See: http://www.oracle.com/technetwork/java/whitepaper-135217.htm...

dsharlet · on Oct 29, 2015

How is that related to the parent's point?

vitalyd · on Oct 29, 2015

I think this trend may stop soon. There are already OSS big data projects written in more performant languages (e.g. c++) coming around (e.g. scylladb, cloudera's kudu).

swah · on Oct 29, 2015

Welp, what about Rust?

vitalyd · on Oct 29, 2015

Rust is exciting, no doubt, and I have high hopes for its adoption, but I've personally not seen/heard of any visible OSS big data style projects using it. I see Frank McSherry's stuff has been mentioned, but I think that's still his pet project (hopefully not putting words in his mouth).

But really I was using C++ as an example of something more fit for these types of projects than Java, it doesn't have to be only C++ of course.

kibwen · on Oct 29, 2015

Rust has Frank McSherry (formerly working on Naiad for Microsoft Research) and his work on timely dataflow and differential dataflow: https://github.com/frankmcsherry/blog

buremba · on Oct 29, 2015

Most JVM-based query engines uses bytecode generation and once JIT compiler decides that the code block is hot enough and can generate native code for generated bytecode, the output is identical to C and C++.

The author actually indicates that every CPU cycle is important for code block that will be executed for each row. So once you optimize hot code blocks, you're good to go.

vvanders · on Oct 29, 2015

Data access patterns are much more important than hot code optimization. Sadly Java offers few options on this front(until maybe Java 9 when values types might become a thing).

Modern CPUs have DRAM fetch time in the 100's of cycles. Any cache friendly algorithm is going to walk circles around something that plays pointer pinball instead.

buremba · on Oct 29, 2015

This is why bytecode generation is used by query engines. They don't meant to be used for creating ArrayList or HashMap. Generally, they work with buffers instead of objects to avoid the issues you mentioned and garbage collection pressure.

Let's say we want to compile a predicate expression "bigintColumn > 4 and varcharColumn = 'str'". A generic interpreter would suffer from the addressed issues but if you generate bytecode for Java source "return longPrimitive > 5 && readAndCompare(buffer, 3, "str".getBytes(UTF8))" then you won't create even a single Java object the output is usually identical to C and C++.

vvanders · on Oct 29, 2015

Wouldn't you still pay the bounds checking penalty on those buffers though? Also anything that uses floating point will probably be trashed by int->float conversion(unless jvm bytecode has a load to float from addr, although I freely admit that I know less about bytecode than plain Java).

Either way the average Java dev isn't going to be writing bytecode so I feel like C/C++ still has the advantage in performance cases.

buremba · on Oct 29, 2015

If you use ByteBuffer, then yes, the application may suffer from unnecessary checks. However the performance of ByteBuffer is not usually good enough anyway, that's why people use off-heap buffers (sun.misc.Unsafe) which is a native call that allocates memory in off-heap.

Also bytecode has instruction sets for all primitive types, otherwise there wouldn't be any point to have these primitive types in Java language since it will also be converted to bytecode instructions.

There are solutions for all the addressed issues but they need to much work to implement in Java compared to C++. However, once you solve this specific problem (I admit that it's not a small one), there are lots of benefits of using Java compared to C++.

jazzyk · on Oct 29, 2015

Well, CPU utilization is the least of Hadoop's problems (picking on it, because it is the most well-known JVM-based framework)

Hadoop core has some shockingly bad design choices (lots of disk IO), and no amount of layers on top of it is going to fix latency issues.

It has nothing to do with JVM "overhead" (which is mostly a myth, anyway).