Taras Tsugrii
Software Bits

Software Bits

We still know our code better than compilers.

Or a case of unnecessary CPU lock instruction

Taras Tsugrii's photo
Taras Tsugrii
·Apr 20, 2021·

2 min read

Compilers brought a huge productivity and performance boost thanks to their ability to translate high-level abstractions into highly-optimized low-level instructions. In fact they are so good at optimizing our code, that we just expect them to understand our code even better than us and for the most part it's a reasonable expectation. At the same time it's important to remember that we can still reason better about our code and when we expect a particular optimization, we should always verify it by checking out the generated assembly. Take following code snippet as an example:

#include <atomic>
using namespace std;

int trivial_inc() {
    atomic<int> num;
    return num.fetch_add(1);

Even though num is an atomic int, we can easily convince ourselves that since num is a local variable that does not escape trivial_inc function and is initialized to 0, we'd expect compiler to turn this code into something like

int trivial_inc() {
    int num = 0;
    return num + 1;

which can be further simplified to

int trivial_inc() {
    return 1;

But here is what clang with -O3 is producing:

trivial_inc():                       # @trivial_inc()
        mov     eax, 1
        lock            xadd    dword ptr [rsp - 8], eax

While it's able to remove most atomic<int> traces, notice that it's still updating its value using unnecessary lock xadd instruction.

Surely Rust with all its superior zero-cost abstractions will not repeat the same mistake, or would it? Let's check

use std::sync::atomic::{AtomicUsize, Ordering};

fn trivial_atomic() -> usize {
    let count: AtomicUsize = AtomicUsize::new(0);
    count.fetch_add(1, Ordering::SeqCst)

Oh no, we an see the same

    movl    $1, %eax
    lock        xaddq    %rax, 24(%rsp)

Oh well, it's nice to know that we should expect more wins in the future from our compilers. To wrap up, I couldn't check what Go would do in this case

package main

import "sync/atomic"

func trivial_inc() uint64 {
    var num uint64 = 0
    atomic.AddUint64(&num, 1)
    return num

And oh no

v12 00003 (+6) LEAQ type.uint64(SB), AX
v14 00004 (6) MOVQ AX, (SP)
v7   00005 (6) PCDATA $1, $0
v7   00006 (6) CALL runtime.newobject(SB)
v9   00007 (6) MOVQ 8(SP), AX
v18 00008 (+7) MOVL $1, CX
v10 00009 (7) LOCK
v10 00010 (7) XADDQ CX, (AX)
v15 00011 (+8) MOVQ (AX), AX
v17 00012 (8) MOVQ AX, "".~r0(SP)
b1 00013 (8) RET
     00014 (?) END

So in addition to using unnecessary LOCK, Go's compiler also allocates num on the heap (CALL runtime.newobject(SB)) :(

Well, nothing is perfect and I still admire compilers, but for things that matter, we should always look under the hood to see if there are no surprises.

Share this