This is really great work, and awesome that they're developing this as open source. Combined with the results from HyPer folks it sure is starting to look like using LLVM to specialize code on the fly is a good idea for any data processing engine.
Looking more closely at the benchmarking results has me scratching my head, though: their reported 16x performance benefits from codegen for TPCH Q1 has seemingly dropped to 2x when compared to the [REDACTED] database. What's happening?
My guess is that Impala is sort of inefficient in a few places that still need work (which is OK, this is not a criticism of that). I bet that [REDACTED] is quite efficient due to having been in development for least 2x longer than Impala. Maybe even closer to 10x. In which case, getting within 2x is fantastic!
Looking more closely at the benchmarking results has me scratching my head, though: their reported 16x performance benefits from codegen for TPCH Q1 has seemingly dropped to 2x when compared to the [REDACTED] database. What's happening?
My guess is that Impala is sort of inefficient in a few places that still need work (which is OK, this is not a criticism of that). I bet that [REDACTED] is quite efficient due to having been in development for least 2x longer than Impala. Maybe even closer to 10x. In which case, getting within 2x is fantastic!