> Note also that C doesn't have return-value-optimization, hence all your struct...

BeeOnRope · on June 3, 2018

That was exactly my thought as well, but the examples seems to show otherwise (at least on gcc and clang[1]).

The compilers are using basically the same underling optimizer and back-end with different front-ends, and since in C there are no "user-defined constructors" and no destructors, one would expect that you don't need any special RVO rule in C: the compiler can simply observe that a local object is returned and construct it in-place as necessary.

Thinking about this example, this may not be the case: distinct objects have to have distinct addresses, right? So in C you might not be able to make this optimization since the do_stuff_to_foo method (a black box to the compiler) could save its argument, and the caller of blah() could see that the argument it passed has the same address as the local f object in blah, a violation of "distinct objects, distinct addresses".

C++ has a the RVO escape hatch for this: it is expected that some objects that appear distinct in the source may not actually be distinct if they fit the RVO (or NVRO) pattern - but C does not. So perhaps gcc and clang are doing the right there here.

---

[1] All numbered versions of clang up to 6.0 seem to behave the way indicated in the GP post, but trunk in godbolt, which shows version as 7.0.0 (trunk 333657) compiles C efficiently like C++.

obl · on June 3, 2018

very good point on the "addresses compare == iff same object" rule.

In that case though, I think clang is right to optimize the callee (but it does introduce a problem in the caller) :

the only place you could do the equality check and observe the rule being broken is before the callee returns since the lifetime of its variable is bound to the call.

It seems that clang will not let the return pointer alias a local in the caller except when the call is the initialization of said local.

So if the caller goes :

foo x; leak(&x); x = returns_foo();

the memory will be temporary stack (and then memcpy), thus upholding the rule. (and it seems to me that this inefficiency is really required to respect the standard if we actually leak the pointer)

in the case :

foo x = returns_foo();

clang will pass the actual address of x down but that's before the object exists (and its address cannot be known yet) so the rule is still fine.

I stand corrected though, this does mean that RVO would be useful for C as well, as a way to relax the aliasing rule.

edit: nevermind that, in the first case it's perfectly legal to read/write foo through the pointer downstream so you cannot make the optimization anyway.

BeeOnRope · on June 3, 2018

Yes, the same thought occurred to me (that perhaps clang is careful in the caller in the case the address escapes), but I seemed to find cases where clang optimizes the caller also, so that two distinct objects receive the same pointer and both pointers escape.

Here's an example:

https://godbolt.org/g/yxFzqT

This happens on clang versions back 3.6.

Note that if you change the caller to:

    Foo f;
    f = callee(&f);

the code changes and distinct objects are passed. I'm not sure if the first form (all in the definition) has a relevant difference per the standard that lets clang do this.

josephg · on June 3, 2018

That seems to match the logic clang-trunk is using:

If the assignment is in the initializer then its using the simple C++ RVO-style call to blah():

    void zot() {
        foo f = blah();
    }

    zot:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 1040
        lea     rdi, [rbp - 1040]
        call    blah
        add     rsp, 1040
        pop     rbp
        ret

But if the variable has the opportunity to leak then it adds a memcpy in the caller:

    void zot() {
        foo f;
        leak(&f);
        f = blah();
    }

    zot:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 2080
        lea     rdi, [rbp - 1040]
        call    leak
        lea     rdi, [rbp - 2080]
        call    blah
        mov     eax, 1036
        mov     edx, eax
        lea     rdi, [rbp - 1040]
        lea     rcx, [rbp - 2080]
        mov     rsi, rcx
        call    memcpy
        add     rsp, 2080
        pop     rbp
        ret

Although weirdly the memcpy is removed when optimizations are turned on. This may be a bug.

jcelerier · on June 2, 2018

> There is no need for RVO in C.

I updated my comment with some assembly.

obl · on June 2, 2018

Thanks. Unless something is escaping me, that's an optimizer bug. I'm pretty sure the ABI allows you to do whatever you want with the sret pointer, including passing it to another function to chain return for free.

I guess you could say that RVO is more robust since it's implemented in the frontend and does not rely on finding the optimization after a fair amount of lowering.

Edit: I don't have time to debug this but at least for llvm, I think this optimization should trigger in the optimizer's memcpy elimination pass ( https://github.com/llvm-mirror/llvm/blob/0818e789cb58fbf6b5e... ).

However I don't see why clang could not simply apply the RVO logic to C code as well.

tomsmeding · on June 2, 2018

I'd wager the optimisations you're talking about happen just fine if blah is a static function, where the compiler can assume nothing from the outside will call this function, so calling conventions can be broken at will.

Seeing as blah isn't a static function, I think the calling convention for C that g++ uses somehow dictates that a memcpy is to be used in this case.

Note that I haven't acually tried this, so no guarantees.

dottrap · on June 3, 2018

I don't claim to have any answers, but I found all of this interesting and surprising. I wondered about a couple of things. What happens if the do_stuff_to_foo is actually defined (and what happens in that actual function)? And is there a difference between value semantics and pointer/reference semantics?

These questions were my take-away from one of Chandler Carruth's C++ compiler optimization talks, I think it was this talk. https://youtu.be/eR34r7HOU14

My take aways were that the optimizer gets a huge chunk of its performance by inlining. And with value semantics, the optimizer can "cheat like crazy".

So I defined two different variants for do_stuff_to_foo

Original Pointer/Reference Semantics:

    void do_stuff_to_foo(foo* a)
    {
        a->x++;
    }

Value Semantics:

    foo do_stuff_to_foo(foo a)
    {
        a.x++;
        return a;
    }

In both cases, the compiler emits effectively the same output for C and C++ (I only tested clang.) (The main difference was name mangling. I omit stuff for brevity.)

Pointer/Reference Semantics:

    Lcfi2:
        .cfi_def_cfa_register %rbp
        incl	(%rdi)
        popq	%rbp
        retq
        .cfi_endproc
                                        ## -- End function
        .globl	_blah                   ## -- Begin function blah
        .p2align	4, 0x90
    _blah:                                  ## @blah
        .cfi_startproc
    ## BB#0:
        pushq	%rbp
    Lcfi3:
        .cfi_def_cfa_offset 16
    Lcfi4:
        .cfi_offset %rbp, -16
        movq	%rsp, %rbp
    Lcfi5:
        .cfi_def_cfa_register %rbp
        movq	%rdi, %rax
        popq	%rbp
        retq
        .cfi_endproc

Value Semantics:

    Lcfi3:
        .cfi_offset %rbx, -24
        movq	%rdi, %rbx
        incl	16(%rbp)
        leaq	16(%rbp), %rsi
        movl	$1036, %edx             ## imm = 0x40C
        callq	_memcpy
        movq	%rbx, %rax
        addq	$8, %rsp
        popq	%rbx
        popq	%rbp
        retq
        .cfi_endproc

What I find interesting here is that in both the C and C++ , the memcpy now appears. And in the C case, there is still only one memcpy, not two.

So as I said, I don't have any answers and really don't know what the take away is. But RVO no longer seems to be a factor in these variants.

adrianratnapala · on June 3, 2018

Having a module boundary in a place where function call overhead is going to be signficant is a code smell. C++ and Rust programmers just don't notice because the entire standard library reeks of it.