It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?
Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?
A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:
- be reasonably slow/fast, so measuring execution time is practical and meaningful
have working data sets in sizes 10kb-10Mb
- the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
- be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely
Finally, I think it is fun if the program does something slightly meaningful.
I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:
Array to sort: G A F C B D H E ( N=8 )
Sorted array: A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1 = 28 (compares and conditional swaps)
I invented the following cache-optimized Bubble-Twist-Sort:
Array to sort: G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist: ( G>B : swap )
A C F B G D E H ( D>F : swap )
A C D B G F E H ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
+ 4/2 (expected number of twist-swaps)
+ 16/2 + 16/2 (second two bubbelsort)
= 34
Actual cost: 4*(3+2+1) + 2 = 26
Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.
Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.
I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).
The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:
QNAP wdr3600 ac20i Rpi Rpi 2 wdr4900 G4 Celeron Xeon Athlon i5
~2007 ~2010 ~2013
============================================================================================
L1 32 32 32 16 ? 32 64 32 32 128 32
L2 128 ? 256 256 512 6M 1024 256
L3 1024 6M
Mhz 500 560 580 700 900 800 866 900 2800 3000 3100
CPU ARMv5 Mips74K Mips24K ARMv6 ARMv7 PPC PPC x86 x64 x64 x64
OS Debian OpenWrt OpenWrt OpenWrt OpenWrt OpenWrt Debian Ubuntu MacOSX Ubuntu Windows
Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.
The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.
Bubblesort
QNAP wdr3600 ac20i rpi rpi2 wdr4900 G4 Celeron Xeon Athlon i5
============================================================================================
4000000 1248 1332 997 1120 396 833 507 120 104 93
2000000 1248 1332 994 1118 386 791 553 506 114 102 93
1000000 1274 1330 1009 1110 367 757 492 504 113 96 93
500000 1258 1194 959 1049 352 628 389 353 72 74 63
250000 1219 1116 931 911 351 445 309 276 53 61 48
125000 1174 1043 902 701 349 397 287 237 44 56 41
62500 941 853 791 573 349 373 278 218 38 52 37
31250 700 462 520 474 342 317 260 208 36 48 36
15625 697 456 507 368 340 315 258 204 35 49 35
7812 696 454 495 364 340 315 256 202 34 49 35
3906 696 455 496 364 340 315 257 203 34 47 35
1953 698 456 496 365 342 320 257 204 35 45 35
Selectionsort
QNAP wdr3600 ac20i rpi rpi2 wdr4900 G4 Celeron Xeon Athlon i5
============================================================================================
1000000 1317 996 877 1056 446 468 296 255 30 45 19
31250 875 354 539 559 420 206 147 245 28 40 21
1953 874 362 520 457 422 209 149 250 30 41 23
Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.
Findings
Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.
I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.
I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.
I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.
The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.
What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.
Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:
void bubbelsort_uint32_t(uint32_t* array, size_t len) {
size_t i, j, jm1;
uint32_t tmp;
for ( i=len ; i>1 ; i-- ) {
for ( j=1 ; j<i ; j++ ) {
jm1 = j-1;
if ( array[jm1] > array[j] ) {
tmp = array[jm1];
array[jm1] = array[j];
array[j] = tmp;
}
}
}
}
void selectionsort_uint32_t(uint32_t* array, size_t len) {
size_t i, j, best;
uint32_t tmp;
for ( i=1 ; i<len ; i++ ) {
best = i-1;
for ( j=i ; j<len ; j++ ) {
if ( array[best] > array[j] ) {
best = j;
}
}
tmp = array[i-1];
array[i-1] = array[best];
array[best] = tmp;
}
}
Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.
Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.