In the previous post we ended up with Go code for array of structs and struct of arrays code:
const userMetadataSize = 8
type userMetadata [userMetadataSize]byte
type User struct {
metadata userMetadata
age uint8
}
type Users []User
type ParallelUsers struct {
metadata []userMetadata
ages []uint8
}
func (users Users) TotalAge() uint64 {
var total uint64 = 0
for _, user := range users {
total += uint64(user.age)
}
return total
}
func (users ParallelUsers) TotalAge() uint64 {
var total uint64 = 0
for _, age := range users.ages {
total += uint64(age)
}
return total
}
In theory we expect TotalAge
to be faster for ParallelUsers
since we expect more efficient cache usage and better prefetching efficiency. Conveniently Go comes with built-in benchmark testing support, so let's write some tests:
package main
import (
"fmt"
"math/rand"
"testing"
)
func generateAgesSlice(n int) []uint8 {
s := make([]uint8, 0, n)
for i := 0; i < n; i++ {
s = append(s, uint8(rand.Uint32()%256))
}
return s
}
func BenchmarkUserAgesTotal(b *testing.B) {
for _, sliceSize := range [...]int{100, 1000, 10000} {
sz := sliceSize
b.Run(fmt.Sprintf("total of %d user ages", sz), func(b *testing.B) {
ages := generateAgesSlice(sz)
var users = make([]User, 0, sz)
for i := 0; i < sz; i++ {
users = append(users, User{age: ages[i]})
}
b.ResetTimer()
for n := 0; n < b.N; n++ {
Users(users).TotalAge()
}
})
}
}
func BenchmarkPUserAgesTotal(b *testing.B) {
for _, sliceSize := range [...]int{100, 1000, 10000} {
sz := sliceSize
b.Run(fmt.Sprintf("total of %d parallel user ages", sz), func(b *testing.B) {
ages := generateAgesSlice(sz)
users := ParallelUsers{ages: ages}
b.ResetTimer()
for n := 0; n < b.N; n++ {
users.TotalAge()
}
})
}
}
To establish the baseline and check the impact on cache efficiency let's first set userMetadataSize
to 0
. Unsurprisingly we get very close results:
goos: darwin
goarch: amd64
pkg: example.com/testing
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkUserAgesTotal/total_of_100_user_ages-16 24067093 48.07 ns/op
BenchmarkUserAgesTotal/total_of_1000_user_ages-16 3039406 401.6 ns/op
BenchmarkUserAgesTotal/total_of_10000_user_ages-16 286630 3817 ns/op
BenchmarkPUserAgesTotal/total_of_100_parallel_user_ages-16 25159890 46.80 ns/op
BenchmarkPUserAgesTotal/total_of_1000_parallel_user_ages-16 3080486 405.8 ns/op
BenchmarkPUserAgesTotal/total_of_10000_parallel_user_ages-16 311725 3875 ns/op
Increasing it to just 4 bytes results in ~25% drop in performance:
BenchmarkUserAgesTotal/total_of_100_user_ages-16 18262779 62.41 ns/op
BenchmarkUserAgesTotal/total_of_1000_user_ages-16 2155155 563.4 ns/op
BenchmarkUserAgesTotal/total_of_10000_user_ages-16 208611 5576 ns/op
BenchmarkPUserAgesTotal/total_of_100_parallel_user_ages-16 22664948 46.06 ns/op
BenchmarkPUserAgesTotal/total_of_1000_parallel_user_ages-16 2982039 404.4 ns/op
BenchmarkPUserAgesTotal/total_of_10000_parallel_user_ages-16 275511 4006 ns/op
Interestingly increasing it to 8
does not change the numbers:
BenchmarkUserAgesTotal/total_of_100_user_ages-16 18419624 63.52 ns/op
BenchmarkUserAgesTotal/total_of_1000_user_ages-16 2171192 547.5 ns/op
BenchmarkUserAgesTotal/total_of_10000_user_ages-16 203426 5555 ns/op
BenchmarkPUserAgesTotal/total_of_100_parallel_user_ages-16 24264589 44.73 ns/op
BenchmarkPUserAgesTotal/total_of_1000_parallel_user_ages-16 3030728 394.2 ns/op
BenchmarkPUserAgesTotal/total_of_10000_parallel_user_ages-16 279097 3935 ns/op
which demonstrates effect of User
struct alignment. Once we exceed the alignment padding by increasing metadata size to 16
, we start losing more performance:
BenchmarkUserAgesTotal/total_of_100_user_ages-16 15560678 73.25 ns/op
BenchmarkUserAgesTotal/total_of_1000_user_ages-16 1772716 680.8 ns/op
BenchmarkUserAgesTotal/total_of_10000_user_ages-16 125956 9366 ns/op
BenchmarkPUserAgesTotal/total_of_100_parallel_user_ages-16 24811290 44.93 ns/op
BenchmarkPUserAgesTotal/total_of_1000_parallel_user_ages-16 3168794 384.9 ns/op
BenchmarkPUserAgesTotal/total_of_10000_parallel_user_ages-16 312606 3906 ns/op
Setting metadata size to 64
results in extreme cache inefficiency and ~5x worse performance:
BenchmarkUserAgesTotal/total_of_100_user_ages-16 5372842 229.0 ns/op
BenchmarkUserAgesTotal/total_of_1000_user_ages-16 514650 2535 ns/op
BenchmarkUserAgesTotal/total_of_10000_user_ages-16 40646 27880 ns/op
BenchmarkPUserAgesTotal/total_of_100_parallel_user_ages-16 24846136 46.26 ns/op
BenchmarkPUserAgesTotal/total_of_1000_parallel_user_ages-16 3052599 410.1 ns/op
BenchmarkPUserAgesTotal/total_of_10000_parallel_user_ages-16 276385 3979 ns/op
Since there are multiple layers of caches, performance drop does not stop here and further increase of metadata size to 256
results in a whopping ~16x degradation:
BenchmarkUserAgesTotal/total_of_100_user_ages-16 1760745 677.2 ns/op
BenchmarkUserAgesTotal/total_of_1000_user_ages-16 136510 7997 ns/op
BenchmarkUserAgesTotal/total_of_10000_user_ages-16 13818 88085 ns/op
BenchmarkPUserAgesTotal/total_of_100_parallel_user_ages-16 26034332 46.46 ns/op
BenchmarkPUserAgesTotal/total_of_1000_parallel_user_ages-16 3055123 398.5 ns/op
BenchmarkPUserAgesTotal/total_of_10000_parallel_user_ages-16 291650 3976 ns/op
So we have a clear winner here - parallel arrays definitely deliver on its promise. Unfortunately Go's compiler does not leverage SIMD instructions so we cannot verify effect of vectorization and in one of the next posts we'll use Rust to see what kinds of wins we can expect from parallel arrays backed by SIMD instructions.
To be continued...