Tag Archives: Performance

Excessive SYS CPU with NodeJS 20 on Linux

I am running a system, a collection of about 20 Node.js-processes on a single machine. Those processes do some I/O to disk and they communicate with each other using HTTP. Much of the code is almost 10 years old and this system first ran on Node 0.12. I can run the system on many different machines and I have automated tests as well.

The problem demonstrated for idle system using top

I will now illustrate the problem of excessive SYS CPU load under Node 20.10.0 compared to Node 18 on an idle system, using top.

TEST (production identical cloud VPS, Debian 11.8)

Here the system running on Node 18 has been idling for a little while.

top - 12:44:46 up 3 days, 23:21,  4 users,  load average: 0.02, 0.44, 0.35
Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.5 us,  0.7 sy,  0.0 ni, 96.4 id,  0.1 wa,  0.0 hi,  0.3 si,  0.1 st
MiB Mem :   3910.9 total,    948.2 free,   1484.8 used,   1478.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2166.2 avail Mem

Upgrading to Node.js 20.10.0 and letting the system idle a while gives:

top - 12:54:20 up 3 days, 23:30,  2 users,  load average: 0.79, 1.74, 1.16
Tasks: 108 total,   3 running, 105 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us, 20.4 sy,  0.0 ni, 76.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
MiB Mem :   3910.9 total,    809.8 free,   1316.8 used,   1784.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2347.7 avail Mem

As you can see, the SYS CPU load is massive under Node 20.

RPI v2, Raspbian 12.1

Here the system running on Node 18 has been idling on a RPi2 for more than 15 minutes.

top - 12:38:36 up 42 min,  2 users,  load average: 0.13, 0.11, 0.63
Tasks: 133 total,   2 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.2 us,  1.2 sy,  0.0 ni, 95.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    971.9 total,    436.0 free,    324.3 used,    263.3 buff/cache    
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.    647.6 avail Mem

This is a very under powered machine, but it is ok.

Upgrading to Node.js 20.10.0 and letting the machine idle gives:

top - 12:55:09 up 59 min,  2 users,  load average: 0.56, 1.38, 1.32
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.3 us, 12.6 sy,  0.0 ni, 82.7 id,  0.3 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    971.9 total,    429.5 free,    327.9 used,    266.5 buff/cache    
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.    644.0 avail Mem

Again, a quite massive increase in SYS CPU load.

The problem demonstrated using integration tests and “time”

On the same TEST system as above, I run my integration tests on Node 18:

$ node --version
v18.13.0
$ time ./tools/local.sh integrationtest ALL -v | tail -n 1
Bad:0 Void:0 Skipped:8 Good:1543 (1551)

real 0m27.277s
user 0m17.751s
sys 0m4.251s

Changing to Node 20.10.0 instead gives:

$ node --version
v20.10.0
$ time ./tools/local.sh integrationtest ALL -v | tail -n 1
Bad:0 Void:0 Skipped:8 Good:1542 (1551)

real	0m56.958s
user	0m12.875s
sys	0m36.931s

As you can see, SYS CPU load increased dramatically.

Affected Node versions

There is never a problem with Node.js 18 or lower.

Current Node.js 20.10.0 shows the problem (on some hosts).

My tests (on one particular host) indicate that the excessive SYS CPU load was introduced with Node.js 20.3.1. The problem is still there with Node 21.

There is an interesting open Github issue.

Affected hosts

I can reproduce the problem on some computers with some configurations. Successful reproduction means that Node 18 runs fine and Node 20.10.0 runs with excessive SYS CPU load.

Hosts where problem is reproduced (Node 20 runs with excessive SYS CPU load)

  1. Raspberry Pi 2, Raspbian 12.1
  2. Intel NUC i5 4250U, Debian 12.1
  3. Cloud VPS, Glesys.com, System container VPS, x64, Debian 11.8

Host where problem is not reproduced (Node 20 runs just fine)

  1. Apple M1 Pro, macOS
  2. Dell XPS, 8th gen i7, Windows 11
  3. Raspberry Pi 2, Raspbian 11.8
  4. QNAP Container Station LXD, Celeron J1900, Debian 11.8
  5. QNAP Container Station LXD, Celeron J1900, Debian 12.4

Comments to this

On the RPi, upgrading from 11.8 to 12.1 activated the problem.
On QNAP LXD, both 11.8 and 12.4 do not show the problem.

Thus we have Debian 11.8 hosts that exhibit both behaviours, and we have Debian 12 hosts that exhibit both behaviours.

Conclusion

This problem seems quite serious.

It affects recent versions of Debian in combination with Node 20+.

I have seen no problems on macOS or Windows.

I have tested no other Linux distributions than Debian (Raspbian).

Solution

It it seems this is a kernel bug with io_uring, at least according to Node.js/Libuv people. That is consistent with my findings above about affected machines.

There is a workaround for Node.js:

UV_USE_IO_URING=0

It appears to be intentionally undocumented, which I interpret as it will be removed from Node.js when no common kernels are affected.

I will stay away from Node.js 20, at least in prodution, for a year and see how this develops.

Improving performance with mitigations=off

I became aware that Spectre and Meltdown kernel mitigations could be turned off in Linux. I decided to give it a try.

DISCLAIMER & WARNING
You are making your system vulnerable to known types of attacks for marginal performance gains. I do not suggest or recommend it.

I am not explaining what the vulnerabilities are, and in what cases it would make sense to leave them open. My CPU is (selected lines from lscpu):

$ lscpu             GenuineIntel
  Model name:            Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz

Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushe
                         s, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-
                         eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected

Deactivate Mitigations

This computer is running Fedora 37, booting using EFI and Grub2. I used to know LILO. Updating Grub was very easy when I knew how to do it. Obviously a restart is required.

# == To disable mitigations, making system vulnerable ==
# grubby --update-kernel=ALL --args="mitigations=off"
# grubby --info=ALL
# grub2-mkconfig -o /boot/grub2/grub.cfg

# == To enable mitigations, making system safe ==
# grubby --update-kernel=ALL --remove-args="mitigations=off"
# grubby --info=ALL
# grub2-mkconfig -o /boot/grub2/grub.cfg

After turning mitigations off, this is the pretty output from lscpu:

Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX vulnerable
Vulnerability Mds:               Vulnerable; SMT vulnerable
Vulnerability Meltdown:          Vulnerable
Vulnerability Mmio stale data:   Vulnerable
Vulnerability Retbleed:          Vulnerable
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Vulnerability Srbds:             Vulnerable
Vulnerability Tsx async abort:   Not affected

Benchmark

Rather than using some synthetic test I decided to use my most common heavy workload, a “precommit”-script that i run in my software project before committing code to git. All going well, it looks like this:

$ /usr/bin/time ./tools/precommit.sh
PRE COMMIT (5x: integrationtests, tests, htmllint, eslint, pkgjson)...
         pkgjson ---- Bad:0 Void:0 Skipped:0 Good:111 (111)
            html ---- Bad:0 Void:0 Skipped:0 Good:257 (257)
              es ---- Bad:0 Void:0 Skipped:0 Good:2 (2)
 integrationtest ---- Bad:0 Void:0 Skipped:7 Good:1487 (1494)
            test ---- Bad:0 Void:0 Skipped:0 Good:2622 (2622)
41.08user 4.37system 0:22.63elapsed 200%CPU (0avgtext+0avgdata 196640maxresident)k

This is a mix of different types of loads. Mostly it is Node.js running JavaScript code. In the beginning the five test categories run in parallel, but as the first checks complete less things are running. Also the tests contains some I/O, some waiting for I/O, some actual requests to services on the internet and things like that. So there is a lower limit to how fast it can run, regardless of CPU performance.

I ran several times to make sure everthing is cached and below are approximate avarages (this benchmark is not entirely stable from time to time):

Elapsed timeUser timeSystem time% CPU
i7-8809G25s41s4.3s190%
i7-8809G mitigations=off22s38s3.5s190%
i7-8809G hyperthreading off (BIOS)29s32s3.9s120%
i5-4250U34s89s9s290%
i5-4250U mitigations=off33s89s8s290%
Apple M1 Pro (10 Cores)23s18s3.6sn/a

Turning mitigations off gives about 10% performance on elapsed time for this real world problem. That is something (I have seen other people seeing more like 1% difference in gaming).

I found it interesting that my M1 Pro had the same performance, despite having more cores (10 vs 4/8) and lower total user time (kind of half time). I draw the conclusion that not so many cores are used in parallel and thought it was interesting to turn off hyperthreading (on a safe configuration) but that was quite bad for performance.

I also tested on an older NUC, finding basically no improvement at all with mitigations=off.

I will leave my computers safe.

Stable Diffusion CPU-only

I spent much time trying to install Stable Diffusion on an Intel NUC Hades Canyon with Core i7 (8th Generation) and an AMD RX Vega (4GB), with no success. 4GB is tricky. AMD is trickier.

I gave up on my NUC and installed on my laptop with Windows, GeForce GTX 1650. That worked, and a typical image (512×512 and 20 samples) takes about 3 minutes to generate.

For practical reasons I wanted to run Stable Diffusion on my Linux NUC anyway, so I decided to give a CPU-only version of stable diffusion a try (stable-diffusion-cpuonly). It was a pretty easy install, and to my surprise generation is basically as fast as on my GeForce GTX 1650. I have 16GB of RAM and that works fine for 512×512. I think 8GB would be too little and as usual, lower resolutions than 512×512 generates very bad output for me.

So when you read “stable diffusion requires Nvidia GPU with at least 4GB of RAM”, for simple hobby purposes any computer with 16GB of RAM will be fine.

Simple Loops in JavaScript

I let SonarQube inspect my JavaScript code and it had opinions about my loops. I learnt about the for-of-loop. Let us see what we have.

Below four loop-constructions are for most practical purposes the same.

  // good old for-loop
  for ( let i=0 ; i<array.length ; i++ ) {
    const x = array[i];
    ...
  }

  // for-in-loop
  for ( const i in array ) {
    const x = array[i];
    ...
  }

  // for-of-loop
  for ( const x of array ) {
    ...
  }

  // forEach
  array.forEach((x) => {
     ...
  });

Well, if they are all practially the same, why bother? Why not pick one for all cases? Well, in the details, they are different when it comes to

  • simplicity to write / verboseness
  • performance
  • flexibility and explicitness

Lets discuss the loops.

The good old for-loop

The good old for-loop requires you to write the name of the array twice, and you need to explicitely increment the loop variable and compare it the length of the array. This is very easy, but it is possible to make silly mistakes.

In many/most cases it is unnecessarily explicit and verbose. However, as soon as you want to do things like:

  • skip first, or any other element
  • access several items in the array each (most commonly adjacent items: 01, 12, 23, 34, 45)
  • break / continue
  • modify the array – even the length of it – during the loop
  • sparse arrays, with undefined, it is obvious what you get

this becomes very natural with the good old loop. Doing it with the others will make it appear a bit contrived or the result may not be so obviously correct.

There is also something very explicit about the order. It may be true (or not?) that every implementation of JavaScript will always execute the other three loops in order. But you need to know that, to be absolutely sure, when reading the code. Not so with the good old for-loop. If order is a critical part of the algorithm and you may want to be explicit about it.

This is also the fastest loop.

The for-in-loop

for-in enumerates properties and loops over them. Do not use it for arrays:

  • it makes more sense to use for-in for Object, so the reader of the code may think your array is an object
  • are you 100% sure your array has no other enumerable properties, ever?
  • performance – this is by far the slowest loop
  • it is quite verbose

The for-of-loop

The for-of-loop is a bit “newer” and may not work in old browsers or JavaScript engines. That can be a reason to avoid it, but even more a reason why you do not see it in code you read.

I would argue this is the most practical, clean and simple loop, that should be used in most cases.

It is slightly slower than the good old for-loop, but faster than the other alternatives.

Array.forEach

I have been ranting about functional style code elsewhere. forEach is kind of an antipattern, because it is a functional construction that does nothing without a side-effect. A functional way to do something non-functional.

The callback function gets not ONE argument (as shown above), but actually 4 arguments. If you pass some standard function into forEach that can give you very strange results if the standard function happens to accept more than one argument and you did not know or think about it.

You get both index and array, so you can do horrible things like:

  array.forEach((current,i,array) => {
    const last = array[i-1];
    ..
  });

I have seen worse. Don’t do it. Functional programming is about being clear about your intentions. Use a good old for loop, or write your own higher-order-loop-function if you do the above thing often.

According to the documentation forEach loops in order. JavaScript is singlethreaded. But other languages may parallellize things like forEach, so I think the right way to think about forEach is that order should not matter. Best use for forEach are things like:

  ['gif','jpg','png'].forEach(registerImageFormat);
  players.forEach(updatePosition);

forEach is slower than the good old for-loop and for-of.

Sparse Arrays

I made an experment with a sparse (and worse) array:

  const array = ['first'];
  array[2] = 'last';
  array.x = 'off-side';
 
  let r = 'for';
  for ( let i=0 ; i<array.length ; i++ ) {
    r += ':' + array[i];
  }
  console.log(r);
 
  r = 'for-in';
  for ( const i in array ) {
    r += ':' + array[i];
  }
  console.log(r);

  r = 'for-of';
  for ( const x of array ) {
    r += ':' + x;
  }
  console.log(r);
 
  r = 'forEach';
  array.forEach((x) => {
    r += ':' + x;
  });
  console.log(r);

The output of this program is:

  for:first:undefined:last
  for-in:first:last:off-side
  for-of:first:undefined:last
  forEach:first:last

If this surprises you, think about how you code and what loops you use.

Performance

For a rather simple loop body here are some benchmarks

~160 M loopsMacBook Air 2014
node 14.16.0
RPI v2 900MHz
node 14.15.3
i7-8809G
node 12.18.3
i7-8809G
node 14.16.0
for ( i=0 ; i<array.length ; i++ )280ms3300ms200ms180ms
for ( i of array )440ms6500ms470ms340ms
for ( i in array )6100ms74000ms5900ms4100ms
array.forEach560ms10400ms480ms470ms

On one hand, a few milliseconds for a millions loops may not mean anything. On the other hand that could be a few milliseconds more latency or UI refresh delay.

JavaScript: Fast Numeric String Testing

Sometimes I have strings that (should) contain numbers (like ‘31415’) but I want/need to test them before I use them. If this happens in a loop I could start asking myself questions about performance. And if it is a long loop an a Node.js server the performance may actually matter.

For the purpose of this post I have worked with positives (1,2,3,…), and I have written code that finds the largest valid positive in an array. Lets say there are a few obvious options:

// Parse it and test it
const nv = +nc;
pos = Number.isInteger(nv) && 0 < nv;

// A regular expression
pos = /^[1-9][0-9]*$/.test(nc);

// A custom function
const strIsPositive = (x) => {
   if ( 'string' !== typeof x || '' === x ) return false;
   const min = 48; // 0
   const max = 57; // 9
   let   cc  = x.charCodeAt(0);
   if ( cc <= min || max < cc ) return false;
   for ( let i=1 ; i<x.length ; i++ ) {
     cc = x.charCodeAt(i);
     if ( cc < min || max < cc ) return false;
   }
   return true;
 }
pos = strIsPositive(nc);

Well, I wrote some benchmark code and ran it in Node.js, and there are some quite predictable findings.

It is no huge difference between the alternatives above, but there are differences (1ms for 10000 validations, on a 4th generation i5).

There is no silver bullet, the optimal solution depends on.

If all you want is validation, it is wasteful to convert (+nc). A Regular expression is faster, but you can easily beat a Regular expression with a simple loop.

If most numbers are valid, converting to number (+nc) makes more sense. It is expensive to parse invalid values to NaN.

If you are going to use the number, converting to number (+nc) makes sense (if you convert only once).

The fastest solution, both for valid and invalid numbers, is to never convert to number (but use the custom function above to validate) and find the max using string compare.

if ( strIsPositive(nc) &&
     ( max.length < nc.length ) || ( max.length === nc.length && max < nc )
   )
  max = nc; 

This is obviously not a generally good advice.

Other numeric formats

My above findings are for strings containing positives. I have tested both code that only validates, and code that use the value by comparing it.

You may not have positives but:

  • Naturals, including 0, which creates a nastier regular expression but an easier loop.
  • Integers, including negative values, which creates even nastier regular expressions.
  • Ranged integers, like [-256,255], which probably means you want to parse (+nc) right away.
  • Decimal values
  • Non standard formats (with , instead of . for decimal point, or with delimiters like spaces to improve readability)
  • Hex, scientific formats, whatever

In the end readability is usually more important than performance.

Stress testing with Raspberry Pi

I have a system – a micro service architecture platform – built on Node.js. It can run on a single computer or distributed. It is a quite small system but quite critical that it works correctly.

Under what circumstances would the system fail to work correctly? How much load can it handle? How does it behave under too heavy load?

Stress testing is difficult, and expensive. Ideally you have plenty of test clients simulating realistic usage. It can be done, but often not easily. A simple and cheap option is to run the system on less resources.

My system used to run perfectly on a Raspberry Pi. The tests work fine. I have also kept the integrationtests working (although there have been timing issues). However, the other day I tried to restore production data to the Raspberry Pi, and it failed to run properly. Problems were

  • High latency and timeouts
  • Heavy swapping
  • Escallating retries, making the situation worse

The last point is particularly interesting. Error handling is designed for stability and recovery, but it risks increasing the total load, making the system even more unstable.

I did make the system work on a RPi again, and in doing so I leant about real problems, and fixed them. It is an interesting excersise in finding problems in systems that don’t work properly, and it is a practical way to “measure first, optimize second”.

Does your system work, with a reasonable amount of production data, on a Raspberry Pi?

Performance, Node.js & Sorting

I will present two findings that I find strange in this post:

  1. The performance of Node.js (V8?) has clearly gotten consistently worse with newer Node.js versions.
  2. The standard library sort (Array.prototype.sort()) is surprisingly slow, often slower than a simple textbook mergesort.

My findings in this article are based on running a simple program mergesort.js on different computers and different node versions.

You may also want to read this article about sorting in Node.js. It applies to V8 version 7.0, which should be used in Node.js V11.

The sorting algorithms

There are three sorting algorithms compared.

  1. Array.prototype.sort()
  2. mergesort(), a textbook mergesort
  3. mergesort_opt(), a mergesort that I put some effort into making faster

Note that mergesort is considered stable and not as fast as quicksort. As far as I understand from the above article, Node.js used to use quicksort (up to V10), and from V11 uses something better called Timsort.

My mergesort implementations (2) (3) are plain standard JavaScript. Nothing fancy whatsoever (I will post benchmarks using Node.js v0.12 below).

The data to be sorted

There are three types of data to be sorted.

  1. Numbers (Math.random()), compared with a-b;
  2. Strings (random numbers converted to strings), compared with default compare function for sort(), and for my mergesort simple a<b, a>b compares to give -1, 1 or 0
  3. Objects, containing two random numbers a=[0-9], b=[0-999999], compared with (a.a-b.a) || (a.b-b.b). In one in 10 the value of b will matter, otherwise looking at the value of a will be enough.

Unless otherwise written the sorted set is 100 000 elements.

On Benchmarks

Well, just a standard benchmark disclaimer: I do my best to measure and report objectively. There may be other platforms, CPUs, configurations, use cases, datatypes, or array sizes that give different results. The code is available for you to run.

I have run all tests several times and reported the best value. If anything, that should benefit the standard library (quick)sort, which can suffer from bad luck.

Comparing algorithms

Lets start with the algorithms. This is Node V10 on different machines.

(ms)     ===== Numbers =====   ===== Strings =====   ==== Objects =====
sort() merge m-opt sort() merge m-opt sort() merge m-opt
NUC i7 82 82 61 110 81 54 95 66 50
NUC i5 113 105 100 191 130 89 149 97 72
NUC Clrn 296 209 190 335 250 196 287 189 157
RPi v3 1886 1463 1205 2218 1711 1096 1802 1370 903
RPi v2 968 1330 1073 1781 1379 904 1218 1154 703

The RPi-v2-sort()-Numbers stand out. Its not a typo. But apart from that I think the pattern is quite clear: regardless of datatype and on different processors the standard sort() simply cannot match a textbook mergesort implemented in JavaScript.

Comparing Node Versions

Lets compare different node versions. This is on a NUC with Intel i5 CPU (4th gen), running 64bit version of Ubuntu.

(ms)     ===== Numbers =====   ===== Strings =====   ==== Objects =====
sort() merge m-opt sort() merge m-opt sort() merge m-opt
v11.13.0 84 107 96 143 117 90 140 97 71
v10.15.3 109 106 99 181 132 89 147 97 71
v8.9.1 85 103 96 160 133 86 122 99 70
v6.12.0 68 76 88 126 92 82 68 83 63
v4.8.6 51 66 89 133 93 83 45 77 62
v0.12.9 58 65 78 114 92 87 55 71 60

Not only is sort() getting slower, also running “any” JavaScript is slower. I have noticed this before. Can someone explain why this makes sense?

Comparing different array sizes

With the same NUC, Node V10, I try a few different array sizes:

(ms)     ===== Numbers =====   ===== Strings =====   ==== Objects =====
sort() merge m-opt sort() merge m-opt sort() merge m-opt
10 000 10 9 11 8 12 6 4 7 4
15 000 8 15 7 13 14 11 6 22 7
25 000 15 35 12 40 27 15 11 25 18
50 000 35 56 34 66 57 37 51 52 30
100 000 115 107 97 192 138 88 164 101 72
500 000 601 714 658 1015 712 670 698 589 558

Admittedly, the smaller arrays show less difference, but it is also hard to measure small values with precision. So this is from the RPi v3 and smaller arrays:

(ms)     ===== Numbers =====   ===== Strings =====   ==== Objects =====
sort() merge m-opt sort() merge m-opt sort() merge m-opt
5 000 34 57 30 46 59 33 29 52 26
10 000 75 129 64 100 130 74 63 104 58
20 000 162 318 151 401 290 166 142 241 132
40 000 378 579 337 863 623 391 344 538 316

I think again quite consistently this looks remarkably bad for standard library sort.

Testing throughput (Version 2)

I decided to measure throughput rather than time to sort (mergesort2.js). I thought perhaps the figures above are misleading when it comes to the cost of garbage collecting. So the new question is, how many shorter arrays (n=5000) can be sorted in 10s?

(count)  ===== Numbers =====   ===== Strings =====   ==== Objects =====
sort() merge m-opt sort() merge m-opt sort() merge m-opt
v11.13.0 3192 2538 4744 1996 1473 2167 3791 2566 4822
v10.15.3 4733 2225 4835 1914 1524 2235 4911 2571 4811
RPi v3 282 176 300 144 126 187 309 186 330

What do we make of this? Well the collapse in performance for the new V8 Torque implementation in Node v11 is remarkable. Otherwise I notice that for Objects and Node v10, my optimized algorithm has no advantage.

I think my algorithms are heavier on the garbage collector (than standard library sort()), and this is why the perform relatively worse for 10s in a row.

If it is so, I’d still prefer to pay that price. When my code waits for sort() to finish there is a user waiting for a GUI update, or for an API reply. I rather see a faster sort, and when the update/reply is complete there is usually plenty of idle time when the garbage collector can run.

Optimizing Mergesort?

I had some ideas for optimizing mergesort that I tried out.

Special handling of short arrays: clearly if you want to sort 2 elements, the entire mergesort function is heavier than a simple function that sorts two elements. The article about V8 sort indicated that they use insertion sort for arrays up to length 10 (I find this very strange). So I implemented special functions for 2-3 elements. This gave nothing. Same performance as calling the entire mergesort.

Less stress on the garbage collector: since my mergesort creates memory nodes that are discarded when sorting is complete, I thought I could keep those nodes for the next sort, to ease the load on the garbage collector. Very bad idea, performance dropped significantly.

Performance of cmp-function vs sort

The relevant sort functions are all K (n log n) with different K. It is the K that I am measuring and discussing here. The differences are, after all, quite marginal. There is clearly another constant cost: the cost of the compare function. That seems to matter more than anything else. And in all cases above “string” is just a single string of 10 characters. If you have a more expensive compare function, the significance of sort() will be even less.

Nevertheless, V8 is a single threaded environment and ultimately cycles wasted in sort() will result in overall worse performance. Milliseconds count.

Conclusions

Array.prototype.sort() is a critical component of the standard library. In many applications sorting may be the most expensive thing that takes place. I find it strange that it does not perform better than a simple mergesort implementation. I do not suggest you use my code, or start looking for better sort() implementations out there right away. But I think this is something for JavaScript programmers to keep in mind. However, the compare function probably matters more in most cases.

I find it strange that Node v11, with Timsort and V8 Torque is not more of an improvement (admittedly, I didnt test that one very much).

And finally I find it strange that Node.js performance seems to deteriorate with every major release.

Am I doing anything seriously wrong?

JavaScript Double Linked List

JavaScript has two very powerful and flexible build in data structures: [] and {}. You can program rather advanced JavaScript for years without needing anything else.

Nevertheless I had a conversation about possible advantages of using a linked list (instead of an array). Linked lists are not very popular, Stroustrup himself has suggested they should be avoided. But what if you mostly do push(), pop(), shift() and unshift() and never access an item by its index? Higher order functions as map(), reduce(), filter() and sort(), as well as iterators should be just fine.

I decided to implement a Double Linked List in JavaScript making it (mostly) compatible with Array and do some tests on it. The code both of the DoubleLinkedList itself, and the unit tests/benchmarks are available.

Disclaimer

This is a purely theoretical, academical and nerdy experiment. The DoubleLinkedList offers no advantages over the built in Array, except for possible performance advantages in edge cases. The disadvantages compared to Array are:

  • Lower performance in some cases
  • Less features / limited API
  • Less tested and proven
  • An extra dependency, possible longer application loading time

So, my official recommendation is that you read this post and perhaps look at the code for learning purposes. But I really doubt you will use my code in production (although you are welcome to).

Benchmarks

Benchmarks are tricky. In this case there are three kinds of benchmarks:

  1. Benchmarks using array[i] to get the item at an index. This is horrible for the linked list. I wrote no such benchmarks.
  2. Benchmarks testing map(), reduce(), filter(), that I wrote but that show consistently no relevant and interesting differences between built in Array and my DoubleLinkedList (my code is essentially equally fast as the standard library array code, which on one hand is impressive, and on the other hand is reason not to use it).
  3. Benchmarks where my DoubleLinkedList does fine, mostly that heavily depends on push(), pop(), shift() and unshift().

The only thing I present below is (3). I have nothing for (1), and (2) shows nothing interesting.

The machines are in order an Hades Canyon i7-NUC, an old i5-NUC, a newer Celeron-NUC, an Acer Chromebook R13 (with an ARMv8 CPU), A Raspberry Pi v3, and a Raspberry Pi V2. The Chromebook runs ChromeOS, the i7 runs Windows, the rest run Linux.

My benchmarks use Math.random() to create test data. That was not very smart of me because the variation between test runs is significant. The below numbers (milliseconds) are the median value of running each test 41 times. You can see for yourself that the values are quite consistent.

The tested algorithms

The push(), pop(), shift(), unshift() tests use the array/list as a queue and push 250k “messages” throught it, keeping the queue roughly 10k messages.

The mergesort() test is a mergesort built on top of the datastructures using push()/shift().

The sort() test is the standard Array.sort(), versus a mergesort implementation for DoubleLinkedList (it has less overhead than mergesort(), since it does not create new objects for every push()).

Benchmark result

                    Node8   ============== Node 10 =====================
(ms) NUCi7 NUCi7 NUCi5 NUC-C R13 RPiV3 RPiV2
unshift/pop 250k
Array 679 649 1420 1890 5216 11121 8582
D.L.L. 8 13 10 20 40 128 165
push/shift 250k
Array 37 31 31 49 143 388 317
D.L.L. 10 12 10 19 44 115 179
mergesort 50k
Array 247 190 300 466 1122 3509 3509
D.L.L. 81 88 121 244 526 1195 1054
sort 50k
Array 53 55 59 143 416 1093 916
D.L.L. 35 32 42 84 209 543 463

What do we make of this?

  • For array, push/shift is clearly faster than unshift/pop!
  • It is possible to implement a faster sort() than Array.sort() of the standard library. However, this may have little to do with my linked list (I may get an even better result if I base my implementation on Array).
  • I have seen this before with other Node.js code but not published it: the RPiV2 (ARMv7 @900MHz) is faster than the RPiV3 (ARMv8 @1200Mhz).
  • I would have expected my 8th generation i7 NUC (NUC8i7HVK) to outperform my older 4th generation i5 NUC (D54250WYK), but not so much difference.

More performance findings

One thing I thought could give good performance was a case like this:

x2 = x1.map(...).filter(...).reduce(...)

where every function creates a new Array just to be destroyed very soon. I implemented mapMutate and filterMutate for my DoubleLinkedList, that reuse existing List-nodes. However, this gave very little. The cost of the temporary Arrays above seems to be practically insignificant.

However for my Double linked list:

dll_1 = DoubleLinkedList.from( some 10000 elements )
dll_1.sort()
dll_2 = DoubleLinkedList.from( some 10000 elements )

Now
dll_1.map(...).filter(...).reduce(...) // slower
dll_2.map(...).filter(...).reduce(...) // faster

So it seems I thought reusing the list-nodes would be a cost saver, but it turns out to produce cache-misses instead

Using the Library

If you feel like using the code you are most welcome. The tests run with Node.js and first runs unit tests (quickly) and then benchmarks (slower). As I wrote earlier, there are some Math.random() in the tests, and on rare occations statistically unlikely events occur, making the tests fail (I will not make this mistake again).

The code itself is just for Node.js. There are no dependencies and it will require minimal work to adapt it to any browser environment of your choice.

The code starts with a long comment specifying what is implemented. Basically, you use it just as Array, with the exceptions/limitations listed. There are many limitations, but most reasonable uses should be fairly well covered.

Conclusion

It seems to make no sense to replace Array with a linked list in JavaScript (Stroustrup was right). If you are using Array as a queue be aware that push/shift is much faster than unshift/pop. It would surprise me much if push/pop is not much faster than unshift/shift for a stack.

Nevertheless, if you have a (large) queue/list/stack and all you do is push, pop, shift, unshift, map, filter, reduce and sort go ahead.

There is also a concatMutate in my DoubleLinkedList. That one is very cheap, and if you for some reason do array.concat(array1, array2, array3) very often perhaps a linked list is your choice.

It should come as no surprise, but I was suprised that sort(), mergesort in my case, was so easy to implement on a linked list.

On RPiV2 vs RPiV3

I have on several occations before written about that the 900MHz ARMv7 of RPiV2 completely outperformes the 700MHz ARMv6 of RPiV1. It is about 15 times faster, and not completely clear why the difference is so big (it is that big for JavaScript, not for C typical code).

The RPiV3 is not only 300MHz faster than the RPiV2, it is also a 64-bit ARMv8 cpu compared to the 32-bit ARMv7 cpu of RPiV2. But V3 delivers worse performance than V2.

One reason could be that the RPi does not have that much RAM, and not that fast RAM either, and that the price of 64-bit is simply not worth it. For now, I have no other idea.

References

An article about sorting in V8: https://v8.dev/blog/array-sort. Very interesting read. But I tried Node version 11 that comes with V8 version 7, and the difference was… marginal at best.

Best way to write compare-functions

The workhorse of many (JavaScript) programs is sort(). When you want to sort objects (or numbers, actually) you need to supply a compare-function. Those are nice functions because they are very testable and reusable, but sorting is also a bit expensive (perhaps the most expensive thing your program does) so you want them fast.

For the rest of this article I will assume we are sorting som Order objects based status, date and time (all strings).

The naive way to write this is:

function compareOrders1(a,b) {
if ( a.status < b.status ) return -1;
if ( a.status > b.status ) return 1;
if ( a.date < b.date ) return -1;
if ( a.date > b.date ) return 1;
if ( a.time < b.time ) return -1;
if ( a.time > b.time ) return 1;
return 0;
}

There are somethings about this that is just not appealing: too verbose, risk of a typo, and not too easy to read.

Another option follows:

function cmpStrings(a,b) {
if ( a < b ) return -1;
if ( a > b ) return 1;
return 0;
}

function compareOrders2(a,b) {
return cmpStrings(a.status,b.status)
|| cmpStrings(a.date ,b.date )
|| cmpStrings(a.time ,b.time );
}

Note that the first function (cmpStrings) is highly reusable, so this is shorter code. However, there is still som repetition, so I tried:

function cmpProps(a,b,p) {
return cmpStrings(a[p], b[p]);
}

function compareOrders3(a,b) {
return cmpProps(a,b,'status')
|| cmpProps(a,b,'date')
|| cmpProps(a,b,'time');
}

There is something nice about not repeating status, date and time, but there is something not so appealing about quoting them as strings. If you want to go more functional you can do:

function compareOrders4(a,b) {
function c(p) {
return cmpStrings(a[p],b[p]);
}
return c('status') || c('date') || c('time');
}

To my taste, that is a bit too functional and obscure. Finally, since it comes to mind and some people may suggest it, you can concatenate strings, like:

function compareOrders5(a,b) {
return cmpStrings(
a.status + a.date + a.time,
b.status + b.date + b.time
);
}

Note that in case fields “overlap” and/or have different length, this could give unexpected results.

Benchmarks

I tried the five different compare-functions on two different machines and got this kind of results (i5 N=100000, ARM N=25000), with slightly different parameters.

In these tests I used few unique values of status and date to often hit the entire compare function.

(ms)   i5    i5    ARM
#1 293 354 507
#2 314 351 594
#3 447 506 1240
#4 509 541 1448
#5 866 958 2492

This is quite easy to understand. #2 does exactly what #1 does and the function overhead is eliminated by the JIT. #3 is trickier for the JIT since a string is used to read a property. That is true also for #4, which also requires a function to be generated. #5 puts two strings on the stack needlessly when often only the first two strings are needed to compare anyway.

Conclusion & Recommendation

My conclusion is that #3 may be the best choice, despite it is slightly slower. I find #2 clearly preferable to #1, and I think #4 and #5 should be avoided.

JavaScript: Sets, Objects and Arrays

JavaScript has a new (well well) fancy Set datastructure (that does not come with functions for union, intersection and the likes, but whatever). A little while ago I tested Binary Search (also not in the standard library) and I was quite impressed with the performance.

When I code JavaScript I often hesitate about using an Array or an Object. And I have not started using Set much.

I decided to make some tests. Lets say we have pseudo-random natural numbers (like 10000 of them). We then want to check if a number is among the 10000 numbers or not (if it is a member of the set). A JavaScript Set does exactly that. A JavaScript Object just requires you to do: set[314] = true and you are basically done (it gets converted to a string, though). For an Array you just push(314), sort the array, and then use binary search to see if the value is there.

Obviously, if you often add or remove value, (re)sorting the Array will be annoying and costly. But quite often this is not the case.

The test
My test consists of generating N=10000 random unique numbers (with distance 1 or 2 between them). I then insert them (in a kind of pseudo-random order) into an Array (and sorts it), into an Object, and into a Set. I measure this time as an initiation time (for each data structure).

I repeat. So now I have 2xArrays, 2xObjects, 2xSets.

This way I can test both iterating and searching with all combinations of data structures (and check that the results are the same and thus correct).

Output of a single run: 100 iterations, N=10000, on a Linux Intel i5 and Node.js 8.9.1 looks like this:

                         ====== Search Structure ======
(ms)                        Array     Object      Set
     Initiate                1338        192      282
===== Iterate =====    
        Array                 800         39       93
       Object                 853        122      170
          Set                1147         82      131

By comparing columns you can compare the cost of searching (and initiating the structure before searching it). By comparing rows you can compare the cost of iterating over the different data structures (for example, iterating over Set while searching Array took 1147ms).

These results are quite consistent on this machine.

Findings
Some findings are very clear (I guess they are quite consistent across systems):

  • Putting values in an Array, to sort it, and the search it, is much slower and makes little sense compared to using an Object (or a Set)
  • Iterating an Array is a bit faster than iterating an Object or Set, so if you are never going to search an Array is faster
  • The newer and more specialized Set offers little advantage to good old Objects

What is more unclear is why iterating over Objects is faster when searching Arrays, but iterating over Sets if faster when searching Objects or Sets. What I find is:

  • Sets seem to perform comparably to Objects on Raspberry Pi, ARMv7.
  • Sets seem to underperform more on Mac OS X

Obviusly, all this is very unclear and can vary depending on CPU-cache, Node-version, OS and other factors.

Smaller and Larger sets
These findings hold quite well for smaller N=100 and larger N=1000000. The Array, despite being O(n log n), does not get much more worse for N=1000000 than it already was for N=10000.

Conclusions and Recommendation
I think the conservative choice is to use Arrays when order is important or you know you will not look for a member based on its unique id. If members have unique IDs and are not ordered, use Object. I see no reason to use Set, especially if you target browsers (support in IE is still limited in early 2018).

The Code
Here follows the source code. Output is not quite as pretty as the table above.

var lodash = require('lodash');

function randomarray(size) {
  var a = new Array(size);
  var x = 0;
  var i, r;
  var j = 0;
  var prime = 3;

  if ( 50   < size ) prime = 31;
  if ( 500  < size ) prime = 313;
  if ( 5000 < size ) prime = 3109;

  for ( i=0 ; i<size ; i++ ) {
    r = 1 + Math.floor(2 * Math.random());
    x += r;
    a[j] = '' + x;
    j += prime;
    if ( size <= j ) j-=size;
  }
  return a;
}

var times = {
  arr : {
    make : 0,
    arr  : 0,
    obj  : 0,
    set  : 0
  },
  obj : {
    make : 0,
    arr  : 0,
    obj  : 0,
    set  : 0
  },
  set : {
    make : 0,
    arr  : 0,
    obj  : 0,
    set  : 0
  }
}

function make_array(a) {
  times.arr.make -= Date.now();
  var i;
  var r = new Array(a.length);
  for ( i=a.length-1 ; 0<=i ; i-- ) {
    r[i] = a[i];
  }
  r.sort();
  times.arr.make += Date.now();
  return r;
}

function make_object(a) {
  times.obj.make -= Date.now();
  var i;
  var r = {};
  for ( i=a.length-1 ; 0<=i ; i-- ) {
    r[a[i]] = true;
  }
  times.obj.make += Date.now();
  return r;
}

function make_set(a) {
  times.set.make -= Date.now();
  var i;
  var r = new Set();
  for ( i=a.length-1 ; 0<=i ; i-- ) {
    r.add(a[i]);
  }
  times.set.make += Date.now();
  return r;
}

function make_triplet(n) {
  var r = randomarray(n);
  return {
    arr : make_array(r),
    obj : make_object(r),
    set : make_set(r)
  };
}

function match_triplets(t1,t2) {
  var i;
  var m = [];
  m.push(match_array_array(t1.arr , t2.arr));
  m.push(match_array_object(t1.arr , t2.obj));
  m.push(match_array_set(t1.arr , t2.set));
  m.push(match_object_array(t1.obj , t2.arr));
  m.push(match_object_object(t1.obj , t2.obj));
  m.push(match_object_set(t1.obj , t2.set));
  m.push(match_set_array(t1.set , t2.arr));
  m.push(match_set_object(t1.set , t2.obj));
  m.push(match_set_set(t1.set , t2.set));
  for ( i=1 ; i<m.length ; i++ ) {
    if ( m[0] !== m[i] ) {
      console.log('m[0]=' + m[0] + ' != m[' + i + ']=' + m[i]);
    }
  }
}

function match_array_array(a1,a2) {
  times.arr.arr -= Date.now();
  var r = 0;
  var i, v;
  for ( i=a1.length-1 ; 0<=i ; i-- ) {
    v = a1[i];
    if ( v === a2[lodash.sortedIndex(a2,v)] ) r++;
  }
  times.arr.arr += Date.now();
  return r;
}

function match_array_object(a1,o2) {
  times.arr.obj -= Date.now();
  var r = 0;
  var i;
  for ( i=a1.length-1 ; 0<=i ; i-- ) {
    if ( o2[a1[i]] ) r++;
  }
  times.arr.obj += Date.now();
  return r;
}

function match_array_set(a1,s2) {
  times.arr.set -= Date.now();
  var r = 0;
  var i;
  for ( i=a1.length-1 ; 0<=i ; i-- ) {
    if ( s2.has(a1[i]) ) r++;
  }
  times.arr.set += Date.now();
  return r;
}

function match_object_array(o1,a2) {
  times.obj.arr -= Date.now();
  var r = 0;
  var v;
  for ( v in o1 ) {
    if ( v === a2[lodash.sortedIndex(a2,v)] ) r++;
  }
  times.obj.arr += Date.now();
  return r;
}

function match_object_object(o1,o2) {
  times.obj.obj -= Date.now();
  var r = 0;
  var v;
  for ( v in o1 ) {
    if ( o2[v] ) r++;
  }
  times.obj.obj += Date.now();
  return r;
}

function match_object_set(o1,s2) {
  times.obj.set -= Date.now();
  var r = 0;
  var v;
  for ( v in o1 ) {
    if ( s2.has(v) ) r++;
  }
  times.obj.set += Date.now();
  return r;
}

function match_set_array(s1,a2) {
  times.set.arr -= Date.now();
  var r = 0;
  var v;
  var iter = s1[Symbol.iterator]();
  while ( ( v = iter.next().value ) ) {
    if ( v === a2[lodash.sortedIndex(a2,v)] ) r++;
  }
  times.set.arr += Date.now();
  return r;
}

function match_set_object(s1,o2) {
  times.set.obj -= Date.now();
  var r = 0;
  var v;
  var iter = s1[Symbol.iterator]();
  while ( ( v = iter.next().value ) ) {
    if ( o2[v] ) r++;
  }
  times.set.obj += Date.now();
  return r;
}

function match_set_set(s1,s2) {
  times.set.set -= Date.now();
  var r = 0;
  var v;
  var iter = s1[Symbol.iterator]();
  while ( ( v = iter.next().value ) ) {
    if ( s2.has(v) ) r++;
  }
  times.set.set += Date.now();
  return r;
}

function main() {
  var i;
  var t1;
  var t2;

  for ( i=0 ; i<100 ; i++ ) {
    t1 = make_triplet(10000);
    t2 = make_triplet(10000);
    match_triplets(t1,t2);
    match_triplets(t2,t1);
  }

  console.log('TIME=' + JSON.stringify(times,null,4));
}

main();