In the previous article we've looked at how different compilers handle
int isHtmlWhitespace(int ch) {
return ch == 0x0009 || ch == 0x000A ||
ch == 0x000C || ch == 0x000D ||
ch == 0x0020;
}
on x86 CPU architecture. But as ARM64 mobile space domination starts to expand to server market with solutions like AWS Graviton and desktop market with Apple's M1, this time we'll take a look at Aarch64 assembly generated by GCC:
isHtmlWhitespace(int):
mov x1, 13824 ; x1 = 0b11011000000000
cmp w0, 33 ; wzr = w0 - 33 and update flags
movk x1, 0x1, lsl 32 ; x1 |= 1 << 32
lsr x0, x1, x0 ; x0 = (x1 >> x0)
and w0, w0, 1 ; discard all but the rightmost bit in w0
csel w0, w0, wzr, cc ; w0 = (last cmp set unsigned lower flag) ? w0 : 0
ret
I have annotated all assembly instructions with pseudo-code comments but even original code turned out to be fairly concise and readable:
cmp w0, 33
sets the unsigned lower flag ifch
is smaller than33
, since the largestch
that can possibly match is32
(0x20
);mov x1, 13824
andmovk x1, 0x1, lsl 32
create a bitset mask inx1
where1
is set at positions that correspond to values ofch + 1
;lsr x0, x1, x0
andand w0, w0, 1
selects a bit at thech + 1
th position;- finally
csel w0, w0, wzr, cc
setsw0
to the value of the bitmask select above in casech < 33
and0
otherwise. In addition to being very concise, the generated assembly is also branchless, which is pipeline friendly, since there are no branches to mispredict and as such no reason to flush the pipeline.
The assembly generated by Clang does contain branches, just like its x86 version from the previous article:
isHtmlWhitespace(int): // @isHtmlWhitespace(int)
sub w8, w0, #9 // =9
cmp w8, #5 // =5
b.hs .LBB2_2
mov w9, #27
lsr w8, w9, w8
tbnz w8, #0, .LBB2_3
.LBB2_2:
cmp w0, #32 // =32
cset w0, eq
ret
.LBB2_3:
mov w0, #1
ret
sub w8, w0, #9
->w8 = w0 - 9
to shift the range of possiblech
values from[9, 32]
to[0, 23]
;cmp w8, #5
andb.hs .LBB2_2
checks ifw8
can potentially include0x0009
,0x000A
,0x000C
,0x000D
in the new[0,23]
range and if that's the case, we have to check the last potential match,32
usingcmp w0, #32
andcset w0, eq
;mov w9, #27
creates a bitset mask0b11011
inw9
that has1
at each position corresponding tow8
values we are interested in -0
,1
,3
and4
;lsr w8, w9, w8
doesw8 = w9 << w8
ensuring that the transformed value ofch
stored inw8
is positioned at the rightmost position ofw8
;- and finally
tbnz w8, #0, .LBB2_3
branches to.LBB2_3
in case ofw8
's rightmost bit is set that sets the return valuew0
totrue
(#1
).
Without benchmarks on target hardware it's hard to say which version is faster, but since Clang's version contains branches and more instructions, it's likely that GCC version would perform better.
Since iOS and M1 applications are usually compiled with Xcode that uses Clang, it would be interesting to benchmark performance sensitive applications with GCC to see if it makes a noticeable difference.