Thanks. Unless something is escaping me, that's an optimizer bug. I'm pretty sure the ABI allows you to do whatever you want with the sret pointer, including passing it to another function to chain return for free.
I guess you could say that RVO is more robust since it's implemented in the frontend and does not rely on finding the optimization after a fair amount of lowering.
I'd wager the optimisations you're talking about happen just fine if blah is a static function, where the compiler can assume nothing from the outside will call this function, so calling conventions can be broken at will.
Seeing as blah isn't a static function, I think the calling convention for C that g++ uses somehow dictates that a memcpy is to be used in this case.
Note that I haven't acually tried this, so no guarantees.
I don't claim to have any answers, but I found all of this interesting and surprising. I wondered about a couple of things. What happens if the do_stuff_to_foo is actually defined (and what happens in that actual function)? And is there a difference between value semantics and pointer/reference semantics?
These questions were my take-away from one of Chandler Carruth's C++ compiler optimization talks, I think it was this talk.
https://youtu.be/eR34r7HOU14
My take aways were that the optimizer gets a huge chunk of its performance by inlining. And with value semantics, the optimizer can "cheat like crazy".
So I defined two different variants for do_stuff_to_foo
Original Pointer/Reference Semantics:
void do_stuff_to_foo(foo* a)
{
a->x++;
}
Value Semantics:
foo do_stuff_to_foo(foo a)
{
a.x++;
return a;
}
In both cases, the compiler emits effectively the same output for C and C++ (I only tested clang.) (The main difference was name mangling. I omit stuff for brevity.)
Having a module boundary in a place where function call overhead is going to be signficant is a code smell. C++ and Rust programmers just don't notice because the entire standard library reeks of it.
I guess you could say that RVO is more robust since it's implemented in the frontend and does not rely on finding the optimization after a fair amount of lowering.
Edit: I don't have time to debug this but at least for llvm, I think this optimization should trigger in the optimizer's memcpy elimination pass ( https://github.com/llvm-mirror/llvm/blob/0818e789cb58fbf6b5e... ).
However I don't see why clang could not simply apply the RVO logic to C code as well.