Excessive SYS CPU with NodeJS 20 on Linux

I am running a system, a collection of about 20 Node.js-processes on a single machine. Those processes do some I/O to disk and they communicate with each other using HTTP. Much of the code is almost 10 years old and this system first ran on Node 0.12. I can run the system on many different machines and I have automated tests as well.

The problem demonstrated for idle system using top

I will now illustrate the problem of excessive SYS CPU load under Node 20.10.0 compared to Node 18 on an idle system, using top.

TEST (production identical cloud VPS, Debian 11.8)

Here the system running on Node 18 has been idling for a little while.

top - 12:44:46 up 3 days, 23:21,  4 users,  load average: 0.02, 0.44, 0.35
Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.5 us,  0.7 sy,  0.0 ni, 96.4 id,  0.1 wa,  0.0 hi,  0.3 si,  0.1 st
MiB Mem :   3910.9 total,    948.2 free,   1484.8 used,   1478.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2166.2 avail Mem

Upgrading to Node.js 20.10.0 and letting the system idle a while gives:

top - 12:54:20 up 3 days, 23:30,  2 users,  load average: 0.79, 1.74, 1.16
Tasks: 108 total,   3 running, 105 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us, 20.4 sy,  0.0 ni, 76.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
MiB Mem :   3910.9 total,    809.8 free,   1316.8 used,   1784.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2347.7 avail Mem

As you can see, the SYS CPU load is massive under Node 20.

RPI v2, Raspbian 12.1

Here the system running on Node 18 has been idling on a RPi2 for more than 15 minutes.

top - 12:38:36 up 42 min,  2 users,  load average: 0.13, 0.11, 0.63
Tasks: 133 total,   2 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.2 us,  1.2 sy,  0.0 ni, 95.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    971.9 total,    436.0 free,    324.3 used,    263.3 buff/cache    
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.    647.6 avail Mem

This is a very under powered machine, but it is ok.

Upgrading to Node.js 20.10.0 and letting the machine idle gives:

top - 12:55:09 up 59 min,  2 users,  load average: 0.56, 1.38, 1.32
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.3 us, 12.6 sy,  0.0 ni, 82.7 id,  0.3 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    971.9 total,    429.5 free,    327.9 used,    266.5 buff/cache    
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.    644.0 avail Mem

Again, a quite massive increase in SYS CPU load.

The problem demonstrated using integration tests and “time”

On the same TEST system as above, I run my integration tests on Node 18:

$ node --version
v18.13.0
$ time ./tools/local.sh integrationtest ALL -v | tail -n 1
Bad:0 Void:0 Skipped:8 Good:1543 (1551)

real 0m27.277s
user 0m17.751s
sys 0m4.251s

Changing to Node 20.10.0 instead gives:

$ node --version
v20.10.0
$ time ./tools/local.sh integrationtest ALL -v | tail -n 1
Bad:0 Void:0 Skipped:8 Good:1542 (1551)

real	0m56.958s
user	0m12.875s
sys	0m36.931s

As you can see, SYS CPU load increased dramatically.

Affected Node versions

There is never a problem with Node.js 18 or lower.

Current Node.js 20.10.0 shows the problem (on some hosts).

My tests (on one particular host) indicate that the excessive SYS CPU load was introduced with Node.js 20.3.1. The problem is still there with Node 21.

There is an interesting open Github issue.

Affected hosts

I can reproduce the problem on some computers with some configurations. Successful reproduction means that Node 18 runs fine and Node 20.10.0 runs with excessive SYS CPU load.

Hosts where problem is reproduced (Node 20 runs with excessive SYS CPU load)

  1. Raspberry Pi 2, Raspbian 12.1
  2. Intel NUC i5 4250U, Debian 12.1
  3. Cloud VPS, Glesys.com, System container VPS, x64, Debian 11.8

Host where problem is not reproduced (Node 20 runs just fine)

  1. Apple M1 Pro, macOS
  2. Dell XPS, 8th gen i7, Windows 11
  3. Raspberry Pi 2, Raspbian 11.8
  4. QNAP Container Station LXD, Celeron J1900, Debian 11.8
  5. QNAP Container Station LXD, Celeron J1900, Debian 12.4

Comments to this

On the RPi, upgrading from 11.8 to 12.1 activated the problem.
On QNAP LXD, both 11.8 and 12.4 do not show the problem.

Thus we have Debian 11.8 hosts that exhibit both behaviours, and we have Debian 12 hosts that exhibit both behaviours.

Conclusion

This problem seems quite serious.

It affects recent versions of Debian in combination with Node 20+.

I have seen no problems on macOS or Windows.

I have tested no other Linux distributions than Debian (Raspbian).

Solution

It it seems this is a kernel bug with io_uring, at least according to Node.js/Libuv people. That is consistent with my findings above about affected machines.

There is a workaround for Node.js:

UV_USE_IO_URING=0

It appears to be intentionally undocumented, which I interpret as it will be removed from Node.js when no common kernels are affected.

I will stay away from Node.js 20, at least in prodution, for a year and see how this develops.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.