Most JVM-based query engines uses bytecode generation and once JIT compiler decides that the code block is hot enough and can generate native code for generated bytecode, the output is identical to C and C++.
The author actually indicates that every CPU cycle is important for code block that will be executed for each row. So once you optimize hot code blocks, you're good to go.
Data access patterns are much more important than hot code optimization. Sadly Java offers few options on this front(until maybe Java 9 when values types might become a thing).
Modern CPUs have DRAM fetch time in the 100's of cycles. Any cache friendly algorithm is going to walk circles around something that plays pointer pinball instead.
This is why bytecode generation is used by query engines. They don't meant to be used for creating ArrayList or HashMap. Generally, they work with buffers instead of objects to avoid the issues you mentioned and garbage collection pressure.
Let's say we want to compile a predicate expression "bigintColumn > 4 and varcharColumn = 'str'". A generic interpreter would suffer from the addressed issues but if you generate bytecode for Java source "return longPrimitive > 5 && readAndCompare(buffer, 3, "str".getBytes(UTF8))" then you won't create even a single Java object the output is usually identical to C and C++.
Wouldn't you still pay the bounds checking penalty on those buffers though? Also anything that uses floating point will probably be trashed by int->float conversion(unless jvm bytecode has a load to float from addr, although I freely admit that I know less about bytecode than plain Java).
Either way the average Java dev isn't going to be writing bytecode so I feel like C/C++ still has the advantage in performance cases.
If you use ByteBuffer, then yes, the application may suffer from unnecessary checks. However the performance of ByteBuffer is not usually good enough anyway, that's why people use off-heap buffers (sun.misc.Unsafe) which is a native call that allocates memory in off-heap.
Also bytecode has instruction sets for all primitive types, otherwise there wouldn't be any point to have these primitive types in Java language since it will also be converted to bytecode instructions.
There are solutions for all the addressed issues but they need to much work to implement in Java compared to C++. However, once you solve this specific problem (I admit that it's not a small one), there are lots of benefits of using Java compared to C++.
The author actually indicates that every CPU cycle is important for code block that will be executed for each row. So once you optimize hot code blocks, you're good to go.