Bit-testing on ARM64.

Or embracing modern CPU architecture.

In the previous article we've looked at how different compilers handle

int isHtmlWhitespace(int ch) {
    return ch == 0x0009 || ch == 0x000A ||
            ch == 0x000C || ch == 0x000D ||
            ch == 0x0020;
}

on x86 CPU architecture. But as ARM64 mobile space domination starts to expand to server market with solutions like AWS Graviton and desktop market with Apple's M1, this time we'll take a look at Aarch64 assembly generated by GCC:

isHtmlWhitespace(int):
        mov     x1, 13824 ; x1 = 0b11011000000000
        cmp     w0, 33 ; wzr = w0 - 33 and update flags
        movk    x1, 0x1, lsl 32 ; x1 |= 1 << 32
        lsr     x0, x1, x0 ; x0 = (x1 >> x0)
        and     w0, w0, 1 ; discard all but the rightmost bit in w0
        csel    w0, w0, wzr, cc ; w0 = (last cmp set unsigned lower flag) ? w0 : 0
        ret

I have annotated all assembly instructions with pseudo-code comments but even original code turned out to be fairly concise and readable:

  • cmp w0, 33 sets the unsigned lower flag if ch is smaller than 33, since the largest ch that can possibly match is 32 (0x20);
  • mov x1, 13824 and movk x1, 0x1, lsl 32 create a bitset mask in x1 where 1 is set at positions that correspond to values of ch + 1;
  • lsr x0, x1, x0 and and w0, w0, 1 selects a bit at the ch + 1th position;
  • finally csel w0, w0, wzr, cc sets w0 to the value of the bitmask select above in case ch < 33 and 0 otherwise. In addition to being very concise, the generated assembly is also branchless, which is pipeline friendly, since there are no branches to mispredict and as such no reason to flush the pipeline.

The assembly generated by Clang does contain branches, just like its x86 version from the previous article:

isHtmlWhitespace(int):                  // @isHtmlWhitespace(int)
        sub     w8, w0, #9                      // =9
        cmp     w8, #5                          // =5
        b.hs    .LBB2_2
        mov     w9, #27
        lsr     w8, w9, w8
        tbnz    w8, #0, .LBB2_3
.LBB2_2:
        cmp     w0, #32                         // =32
        cset    w0, eq
        ret
.LBB2_3:
        mov     w0, #1
        ret
  • sub w8, w0, #9 -> w8 = w0 - 9 to shift the range of possible ch values from [9, 32] to [0, 23];
  • cmp w8, #5 and b.hs .LBB2_2 checks if w8 can potentially include 0x0009, 0x000A, 0x000C, 0x000D in the new [0,23] range and if that's the case, we have to check the last potential match, 32 using cmp w0, #32 and cset w0, eq;
  • mov w9, #27 creates a bitset mask 0b11011 in w9 that has 1 at each position corresponding to w8 values we are interested in - 0, 1, 3 and 4;
  • lsr w8, w9, w8 does w8 = w9 << w8 ensuring that the transformed value of ch stored in w8 is positioned at the rightmost position of w8;
  • and finally tbnz w8, #0, .LBB2_3 branches to .LBB2_3 in case of w8's rightmost bit is set that sets the return value w0 to true (#1).

Without benchmarks on target hardware it's hard to say which version is faster, but since Clang's version contains branches and more instructions, it's likely that GCC version would perform better.

Since iOS and M1 applications are usually compiled with Xcode that uses Clang, it would be interesting to benchmark performance sensitive applications with GCC to see if it makes a noticeable difference.