Thanks. Unless something is escaping me, that's an optimizer bug. I'm pretty sur...

tomsmeding · on June 2, 2018

I'd wager the optimisations you're talking about happen just fine if blah is a static function, where the compiler can assume nothing from the outside will call this function, so calling conventions can be broken at will.

Seeing as blah isn't a static function, I think the calling convention for C that g++ uses somehow dictates that a memcpy is to be used in this case.

Note that I haven't acually tried this, so no guarantees.

dottrap · on June 3, 2018

I don't claim to have any answers, but I found all of this interesting and surprising. I wondered about a couple of things. What happens if the do_stuff_to_foo is actually defined (and what happens in that actual function)? And is there a difference between value semantics and pointer/reference semantics?

These questions were my take-away from one of Chandler Carruth's C++ compiler optimization talks, I think it was this talk. https://youtu.be/eR34r7HOU14

My take aways were that the optimizer gets a huge chunk of its performance by inlining. And with value semantics, the optimizer can "cheat like crazy".

So I defined two different variants for do_stuff_to_foo

Original Pointer/Reference Semantics:

    void do_stuff_to_foo(foo* a)
    {
        a->x++;
    }

Value Semantics:

    foo do_stuff_to_foo(foo a)
    {
        a.x++;
        return a;
    }

In both cases, the compiler emits effectively the same output for C and C++ (I only tested clang.) (The main difference was name mangling. I omit stuff for brevity.)

Pointer/Reference Semantics:

    Lcfi2:
        .cfi_def_cfa_register %rbp
        incl	(%rdi)
        popq	%rbp
        retq
        .cfi_endproc
                                        ## -- End function
        .globl	_blah                   ## -- Begin function blah
        .p2align	4, 0x90
    _blah:                                  ## @blah
        .cfi_startproc
    ## BB#0:
        pushq	%rbp
    Lcfi3:
        .cfi_def_cfa_offset 16
    Lcfi4:
        .cfi_offset %rbp, -16
        movq	%rsp, %rbp
    Lcfi5:
        .cfi_def_cfa_register %rbp
        movq	%rdi, %rax
        popq	%rbp
        retq
        .cfi_endproc

Value Semantics:

    Lcfi3:
        .cfi_offset %rbx, -24
        movq	%rdi, %rbx
        incl	16(%rbp)
        leaq	16(%rbp), %rsi
        movl	$1036, %edx             ## imm = 0x40C
        callq	_memcpy
        movq	%rbx, %rax
        addq	$8, %rsp
        popq	%rbx
        popq	%rbp
        retq
        .cfi_endproc

What I find interesting here is that in both the C and C++ , the memcpy now appears. And in the C case, there is still only one memcpy, not two.

So as I said, I don't have any answers and really don't know what the take away is. But RVO no longer seems to be a factor in these variants.

adrianratnapala · on June 3, 2018

Having a module boundary in a place where function call overhead is going to be signficant is a code smell. C++ and Rust programmers just don't notice because the entire standard library reeks of it.