Tag Archives: Node.js

Excessive SYS CPU with NodeJS 20 on Linux

I am running a system, a collection of about 20 Node.js-processes on a single machine. Those processes do some I/O to disk and they communicate with each other using HTTP. Much of the code is almost 10 years old and this system first ran on Node 0.12. I can run the system on many different machines and I have automated tests as well.

The problem demonstrated for idle system using top

I will now illustrate the problem of excessive SYS CPU load under Node 20.10.0 compared to Node 18 on an idle system, using top.

TEST (production identical cloud VPS, Debian 11.8)

Here the system running on Node 18 has been idling for a little while.

top - 12:44:46 up 3 days, 23:21,  4 users,  load average: 0.02, 0.44, 0.35
Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.5 us,  0.7 sy,  0.0 ni, 96.4 id,  0.1 wa,  0.0 hi,  0.3 si,  0.1 st
MiB Mem :   3910.9 total,    948.2 free,   1484.8 used,   1478.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2166.2 avail Mem

Upgrading to Node.js 20.10.0 and letting the system idle a while gives:

top - 12:54:20 up 3 days, 23:30,  2 users,  load average: 0.79, 1.74, 1.16
Tasks: 108 total,   3 running, 105 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us, 20.4 sy,  0.0 ni, 76.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
MiB Mem :   3910.9 total,    809.8 free,   1316.8 used,   1784.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2347.7 avail Mem

As you can see, the SYS CPU load is massive under Node 20.

RPI v2, Raspbian 12.1

Here the system running on Node 18 has been idling on a RPi2 for more than 15 minutes.

top - 12:38:36 up 42 min,  2 users,  load average: 0.13, 0.11, 0.63
Tasks: 133 total,   2 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.2 us,  1.2 sy,  0.0 ni, 95.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    971.9 total,    436.0 free,    324.3 used,    263.3 buff/cache    
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.    647.6 avail Mem

This is a very under powered machine, but it is ok.

Upgrading to Node.js 20.10.0 and letting the machine idle gives:

top - 12:55:09 up 59 min,  2 users,  load average: 0.56, 1.38, 1.32
Tasks: 139 total,   1 running, 138 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.3 us, 12.6 sy,  0.0 ni, 82.7 id,  0.3 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :    971.9 total,    429.5 free,    327.9 used,    266.5 buff/cache    
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.    644.0 avail Mem

Again, a quite massive increase in SYS CPU load.

The problem demonstrated using integration tests and “time”

On the same TEST system as above, I run my integration tests on Node 18:

$ node --version
v18.13.0
$ time ./tools/local.sh integrationtest ALL -v | tail -n 1
Bad:0 Void:0 Skipped:8 Good:1543 (1551)

real 0m27.277s
user 0m17.751s
sys 0m4.251s

Changing to Node 20.10.0 instead gives:

$ node --version
v20.10.0
$ time ./tools/local.sh integrationtest ALL -v | tail -n 1
Bad:0 Void:0 Skipped:8 Good:1542 (1551)

real	0m56.958s
user	0m12.875s
sys	0m36.931s

As you can see, SYS CPU load increased dramatically.

Affected Node versions

There is never a problem with Node.js 18 or lower.

Current Node.js 20.10.0 shows the problem (on some hosts).

My tests (on one particular host) indicate that the excessive SYS CPU load was introduced with Node.js 20.3.1. The problem is still there with Node 21.

There is an interesting open Github issue.

Affected hosts

I can reproduce the problem on some computers with some configurations. Successful reproduction means that Node 18 runs fine and Node 20.10.0 runs with excessive SYS CPU load.

Hosts where problem is reproduced (Node 20 runs with excessive SYS CPU load)

  1. Raspberry Pi 2, Raspbian 12.1
  2. Intel NUC i5 4250U, Debian 12.1
  3. Cloud VPS, Glesys.com, System container VPS, x64, Debian 11.8

Host where problem is not reproduced (Node 20 runs just fine)

  1. Apple M1 Pro, macOS
  2. Dell XPS, 8th gen i7, Windows 11
  3. Raspberry Pi 2, Raspbian 11.8
  4. QNAP Container Station LXD, Celeron J1900, Debian 11.8
  5. QNAP Container Station LXD, Celeron J1900, Debian 12.4

Comments to this

On the RPi, upgrading from 11.8 to 12.1 activated the problem.
On QNAP LXD, both 11.8 and 12.4 do not show the problem.

Thus we have Debian 11.8 hosts that exhibit both behaviours, and we have Debian 12 hosts that exhibit both behaviours.

Conclusion

This problem seems quite serious.

It affects recent versions of Debian in combination with Node 20+.

I have seen no problems on macOS or Windows.

I have tested no other Linux distributions than Debian (Raspbian).

Solution

It it seems this is a kernel bug with io_uring, at least according to Node.js/Libuv people. That is consistent with my findings above about affected machines.

There is a workaround for Node.js:

UV_USE_IO_URING=0

It appears to be intentionally undocumented, which I interpret as it will be removed from Node.js when no common kernels are affected.

I will stay away from Node.js 20, at least in prodution, for a year and see how this develops.

Functional Programming is Slow – revisited

I have written before about Functional Program with a rather negative stand point (it sucks, it is slow). Those posts have some readers, but they are a few years old, and I wanted to do some new benchmarks.

Please note:

  • This is written with JavaScript (and Node.js) in mind. I don’t know if these findings apply to other programming languages.
  • Performance is important, but it is far from everything. Apart from performance, there are both good and bad aspects of Functional Programming.

Basic Chaining

One of the most common ways to use functional programming (style) in JavaScript is chaining. It can look like this:

v = a.map(map_func).filter(filter_func).reduce(reduce_func)

In this case a is an array, and three functions are sequentially called to each element (except reduce is not called on those that filter gets rid of). The return value of reduce (typically a single value) is stored in v.

  • What is the cost of this?
  • What are the alternatives?

I decided to calculate the value of pi by

  1. evenly distribute points in the[0,0][1,1] rectangle.
  2. for each point calculate the (squared) distance to origo (a simple map)
  3. get rid of each point beyond distance 1.0 (a simple filter)
  4. count the number of remaing points (a simple reduce – although in this simple case it would be enough to check the length of the array)

The map, filter and reduce functions looks like:

const pi_map_f = (xy) => {
  return xy.x * xy.x + xy.y * xy.y;
};
const pi_filter_f = (xxyy) => {
  return xxyy <= 1.0;
};
const pi_reduce_f = (acc /* ,xxyy */) => {
  return 1 + acc;
};

In chained functional code this looks like:

const pi_higherorder = (pts) => {
  return 4.0
       * pts.map(pi_map_f)
            .filter(pi_filter_f)
            .reduce(pi_reduce_f,0)
       / pts.length;
};

I could use the same three functions in a regular loop:

const pi_funcs = (pts) => {
  let i,v;
  let inside = 0;
  for ( i=0 ; i<pts.length ; i++ ) {
    v = pi_map_f(pts[i]);
    if ( pi_filter_f(v) ) inside = pi_reduce_f(inside,v);
  }
  return 4.0 * inside / pts.length;
};

I could also write everything in a single loop and function:

const pi_iterate = (pts) => {
  let i,p;
  let inside = 0;
  for ( i=0 ; i<pts.length ; i++ ) {
    p = pts[i];
    if ( p.x * p.x + p.y*p.y <= 1.0 ) inside++;
  }
  return 4.0 * inside / pts.length;
};

What about performance? Here are some results from a Celeron J1900 CPU and Node.js 14.15.0:

Iterate (ms)Funcs (ms)Higher Order (ms)Pi
10k88103.1428
40k34193.1419
160k33473.141575
360k661963.141711
640k11114043.141625
1000k17175593.141676
1440k252511603.14160278

There are some obvious observations to make:

  • Adding more points does not necessarily give a better result (160k seems to be best, so far)
  • All these are run in a single program, waiting 250ms between each test (to let GC and optimizer run). Obvously it took until after 10k for the Node.js optimizer to get things quite optimal (40k is faster than 10k).
  • The cost of writing and calling named functions is zero. Iterate and Funcs are practially identical.
  • The cost of chaining (making arrays to use once only) is significant.

Obviously, if this has any practical significance depends on how large arrays you are looping over, and how often you do it. But lets assume 100k is a practical size for your program (that is for example 100 events per day for three years). We are then talking about wasting 20-30ms every time we do a common map-filter-reduce-style loop. Is that much?

  • If it happens server side or client side, in a way that it affects user latency or UI refresh time, it is significant (especially since this loop is perhaps not the only thing you do)
  • If it happens server side, and often, this chaining choice will start eating up significant part of your server side CPU time

You may have a faster CPU or a smaller problem. But the key point here is that you choose to waste significant amount of CPU cycles because you choose to write pi_higherorder rather than pi_funcs.

Different Node Versions

Here is the same thing, executed with different versions of node.

1000kIterate (ms)Funcs (ms)Higher Order (ms)
8.17.01111635
10.23.01111612
12.19.01111805
14.15.01819583
15.1.01719556

A few findings and comments on this:

  • Different node version show rather different performance
  • Although these results are stable on my machine, what you see here may not be valid for a different CPU or a different problem size (for 1440k points, node version 8 is the fastest).
  • I have noted before, that functional code gets faster, iterative code slower, with newer versions of node.

Conclusion

My conclusions are quite consistent with what I have found before.

  • Writing small, NAMED, testable, reusable, pure functions is good programming, and good functional programming. As you can see above, the overhead of using a function in Node.js is practially zero.
  • Chaining – or other functional programming practices, that are heavy on the memory/garbage collection – is expensive
  • Higher order functions (map, filter, reduce, and so on) are great when
    1. you have a named, testable, reusable function
    2. you actually need the result, just not for using once and throwing away
  • Anonymous functions fed directly into higher order functions have no advantages whatsoever (read here)
  • The code using higher order functions is often harder to
    1. debug, becuase you can’t just put debug outputs in the middle of it
    2. refactor, because you cant just insert code in the middle
    3. use for more complex agorithms, becuase you are stuck with the primitive higher order functions, and sometimes they don’t easily allow you to do what you need

Feature Wish

JavaScript is hardly an optimal programming language for functional programming. One thing I miss is truly pure functions (functions with no side effects – especially no mutation of input data).

I have often seen people change input data in a map.

I believe (not being a JavaScript engine expert) that if Node.js knew that the functions passed to map, filter and reduce above were truly pure, it would allow for crazy optimizations, and the Higher Order scenario could be made as fast as the other ones. However, as it is now, Node.js can not get rid of the temporary arrays (created by map and filter), because of possible side effects (not present in my code).

I tried to write what Node.js could make of the code, if it knew it was pure:

const pi_allinone_f = (acc,xy) => {
  return acc + ( ( xy.x * xy.x + xy.y * xy.y <= 1.0 ) ? 1 : 0);
};

const pi_allinone = (pts) => {
  return 4.0
       * pts.reduce(pi_allinone_f,0)
       / pts.length;
};

However, this code is still 4-5 times slower than the regular loop.

All the code

Here is all the code, if you want to run it yourself.

const points = (n) => {
  const ret = [];
  const start = 0.5 / n;
  const step = 1.0 / n;
  let x, y;
  for ( x=start ; x<1.0 ; x+=step ) {
    for ( y=start ; y<1.0 ; y+=step ) {
      ret.push({ x:x, y:y });
    }
  }
  return ret;
};

const pi_map_f = (xy) => {
  return xy.x * xy.x + xy.y * xy.y;
};
const pi_filter_f = (xxyy) => {
  return xxyy <= 1.0;
};
const pi_reduce_f = (acc /* ,xxyy */) => {
  return 1 + acc;
};
const pi_allinone_f = (acc,xy) => {
  return acc + ( ( xy.x * xy.x + xy.y * xy.y <= 1.0 ) ? 1 : 0);
};

const pi_iterate = (pts) => {
  let i,p;
  let inside = 0;
  for ( i=0 ; i<pts.length ; i++ ) {
    p = pts[i];
    if ( p.x * p.x + p.y*p.y <= 1.0 ) inside++;
  }
  return 4.0 * inside / pts.length;
};

const pi_funcs = (pts) => {
  let i,v;
  let inside = 0;
  for ( i=0 ; i<pts.length ; i++ ) {
    v = pi_map_f(pts[i]);
    if ( pi_filter_f(v) ) inside = pi_reduce_f(inside,v);
  }
  return 4.0 * inside / pts.length;
};

const pi_allinone = (pts) => {
  return 4.0
       * pts.reduce(pi_allinone_f,0)
       / pts.length;
};

const pi_higherorder = (pts) => {
  return 4.0
       * pts.map(pi_map_f).filter(pi_filter_f).reduce(pi_reduce_f,0)
       / pts.length;
};

const pad = (s) => {
  let r = '' + s;
  while ( r.length < 14 ) r = ' ' + r;
  return r;
}

const funcs = {
  higherorder : pi_higherorder,
  allinone : pi_allinone,
  functions : pi_funcs,
  iterate : pi_iterate
};

const test = (pts,func) => {
  const start = Date.now();
  const pi = funcsfunc;
  const ms = Date.now() - start;
  console.log(pad(func) + pad(pts.length) + pad(ms) + 'ms ' + pi);
};

const test_r = (pts,fs,done) => {
  if ( 0 === fs.length ) return done();
  setTimeout(() => 
    test(pts,fs.shift());
    test_r(pts,fs,done);
  }, 1000);
};

const tests = (ns,done) => {
  if ( 0 === ns.length ) return done();
  const fs = Object.keys(funcs);
  const pts = points(ns.shift());
  test_r(pts,fs,() => {
    tests(ns,done);
  });
};

const main = (args) => {
  tests(args,() => {
    console.log('done');
  });
};

main([10,100,200,400,600,800,1000,1200]);

Webpack: the shortest tutorial

So, you have some JavaScript that requires other JavaScript using require, and you want to pack all the files into one. Install webpack:

$ npm install webpack webpack-cli

These are my files (a main file with two dependencies):

$ cat main.js 

var libAdd = require('./libAdd.js');
var libMult = require('./libMult.js');

console.log('1+2x2=' + libAdd.calc(1, libMult.calc(2,2)));


$ cat libAdd.js 

exports.calc = (a,b) => { return a + b; };


$ cat libMult.js 

exports.calc = (a,b) => { return a * b; };

To pack this

$ ./node_modules/webpack-cli/bin/cli.js --mode=none main.js
Hash: 639616969f77db2f336a
Version: webpack 4.26.0
Time: 180ms
Built at: 11/21/2018 7:22:44 PM
  Asset      Size  Chunks             Chunk Names
main.js  3.93 KiB       0  [emitted]  main
Entrypoint main = main.js
[0] ./main.js 141 bytes {0} [built]
[1] ./libAdd.js 45 bytes {0} [built]
[2] ./libMult.js 45 bytes {0} [built]

and I have my bundle in dist/main.js. This bundle works just like original main:

$ node main.js 
1+2x2=5
$ node dist/main.js 
1+2x2=5

That is all I need to know about Webpack!

Background
I like the old way of building web application: including every script with a src-tag. However, occationally I want to use code I dont write myself, and more and more often it comes in a format that I can not easily just include it with a src-tag. Webpack is a/the way to make it “just” a JavaScript file that I can do what I want with.

Node.js 6 on OpenWrt

I have managed to produce a working Node.js 6 binary for OpenWrt and RPi (brcm2708/brcm2709).

Binaries

15.05.1: brcm2708 6.9.5
15.05.1: brcm2709 6.9.5
15.05.1: mvebu 6.9.5 Please test (on WRT1x00AC router) and get back to me with feedback
15.05.1: x86 6.9.5 Please test and get back to me with feedback

Note: all the binaries work with equal performance on RPi v2 (brcm2709). For practical purposes the brcm2708 may be the only binary needed.

How to build 6.9.5 brcm2708/brcm2709
The procudure is:

  1. Set PATH and STAGING_DIR
  2. Set a few compiler flags and run configure with not so few options
  3. Fix nearbyint/nearbyintf
  4. Fix config.gypi
  5. make

1. I have a little script to set my toolchain variables.

# file:  env-15.05.1-brcm2709.sh
# usage: $ source ./env-15.05.1-brcm2709.sh

PATH=/path/to/staging_dir/bin:$PATH
export PATH

STAGING_DIR=/path/to/staging_dir
export STAGING_DIR

Your path should now contain arm-openwrt-linux-uclibcgnueabi-g++ and other binaries.

2. (brcm2709 / mvebu) I have another script to run configure:

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"

export CFLAGS="-isystem${CSTOOLS_INC} -mfloat-abi=softfp"
export CPPFLAGS="-isystem${CSTOOLS_INC} -mfloat-abi=softfp"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux --without-npm --without-ssl --without-intl --without-inspector

bash --norc

Please not that this script was the first one that worked. It may not be the best. Some things may not be needed. –without-intl and –without-inspector helped me avoid build errors. If you need those features you have more work to do.

2. (brcm2708)

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"

export CFLAGS="-isystem${CSTOOLS_INC} -march=armv6j -mfloat-abi=softfp"
export CPPFLAGS="-isystem${CSTOOLS_INC} -march=armv6j -mfloat-abi=softfp"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux --without-npm --without-ssl --without-intl --without-inspector

bash --norc

3. Use “grep -nR nearbyint” to find and replace:

  nearbyint => round
  nearbyintf => roundf

This may not be a good idea! However, nearbyint(f) is not supported in OpenWrt, and with the above replacements Node.js builds and it passes the octane benchmark – so it is not that broken. I suppose there is correct way to replace nearbyint(f).

4. Add to config.gypi:

{ 'target_defaults': {
    'cflags': [ '-D__STDC_LIMIT_MACROS' ,'-D__STDC_CONSTANT_MACROS'],
    'ldflags': [ '-Wl,-rpath,/path/to/staging_dir/lib/' ]},

These are just compilation error workarounds.

This works for me.

Dependencies
You need to install dependencies in OpenWrt:

# opkg update
# opkg install librt
# opkg install libstdcpp

Performance
My initial tests indicate that Node.js v6 is a little (~2%) slower than Node.js 4 on ARM v7 (RPi v2).

Other targets
mvebu: I will build a binary, but I need help to test
x86/x86_64: This shall be easy, but I see little need/use. Let me know if you want a binary.
mpc85xx: The chip is quite capable, but the PowerPC port of Node.js will most likely never support it.

Most MIPS architectures lack FPU and are truly unsuitable for running Node.js.

std::snprintf
It seems the OpenWrt C++ std library does not support std::snprintf. You can replace it with just snprintf and add #include <stdio.h> in the file:
deps/v8_inspector/third_party/v8_inspector/platform/inspector_protocol/String16_cpp.template
However, this is not needed when –without-inspector is applied.

Node.js 6.12.2
I have failed building Node.js 6.12.2 on x86 with some openssl error.

Node.js 7
I have failed building Node.js 7 before. But perhaps I will give it a try sometime that Node.js 6 is working.

Older versions of Node.js
I have previously built and distributed Node.js 4 for OpenWrt.

Node.js 4 on OpenWrt

Update 2017-02-27: I have built Node.js 6 for OpenWRT.
Update 2017-02-20: I migrated the files from DropBox since their public shares will stop working.
Update 2017-02-20: Updated binaries for OpenWRT 15.05.1 and Node.js 4.7.3.

Node.js is merged with io.js, and after Node.js 0.12.7 came version 4.0.0.

Well, the good news is that V8 seems to be competely and officially supported on Raspberry Pi (ARMv6+VFPv2) again (it has been a little in and out).

I intend to build and benchmark Node.js for different possible (and impossible) OpenWRT targets, and share a few binaries.

Binaries

Target Binaries Comments
14.07: brcm2708 4.0.0
15.05: x86 4.1.0
15.05: brcm2708 4.1.0 also works for brcm2709 Raspberry Pi 2
15.05: brcm2709 4.1.2
15.05: mvebu 4.1.2 Not Tested! Please test, run octane-benchmark, and let me know!
15.05: ramips/mt7620 0.10.40
r47168: ramips/mt7620 4.1.2 requires kernel FPU emulation (get custom built r47168)
15.05.1: brcm2708 4.4.5
4.7.3
15.05.1: brcm2709 4.4.5
4.7.3
15.05.1: mvebu 4.7.3 Not Tested! Please test, run octane-benchmark, and let me know!

You need to install dependencies:

# opkg update
# opkg install librt
# opkg install libstdcpp

Benchmarks
Octane (1.0.0) Benchmark:

Target        System             CPU        Score      Time
brcm2708      Raspberry Pi v1    700Mhz       97.1     2496s
brcm2708      Raspberry Pi v2    900Mhz     1325        198s
brcm2709      Raspberry Pi v2    900Mhz     1298        198s
x86           Eee701             900Mhz     2559        118s
mt7620        Archer C20i        ( 64 MB RAM not enough )

Performance has been very consistent through different versions of OpenWRT and Node.js.

Building x86
With the 15.05 toolchain, this script configured Node.js 4.1.0

#!/bin/sh -e

export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="i486-openwrt-linux-uclibc-gcc"
export CXX="i486-openwrt-linux-uclibc-g++"
export LD="i486-openwrt-linux-uclibc-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=x86 --dest-os=linux --without-npm

bash --norc

Then just run make, and wait.

Building brcm2708 (Raspberry Pi v1)
I configured
– Node.js 4.0.0 with 14.07 toolchain,
– Node.js 4.1.0 with 15.05 toolchain,
– Node.js 4.4.5 with 15.05.1 toolchain
with the following script:

#!/bin/sh -e

export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"

export CFLAGS="-isystem${CSTOOLS_INC} -march=armv6j -mfloat-abi=softfp"
export CPPFLAGS="-isystem${CSTOOLS_INC} -march=armv6j -mfloat-abi=softfp"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux --without-npm

bash --norc

Then just run make, and wait.

Building brcm2709 (Raspberry Pi v2)
I configured Node.js 4.1.2 with 15.05 toolchain and 4.4.5 with 15.05.1 toolchain with the following script:

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"

export CFLAGS="-isystem${CSTOOLS_INC} -mfloat-abi=softfp"
export CPPFLAGS="-isystem${CSTOOLS_INC} -mfloat-abi=softfp"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux --without-npm

bash --norc

Building ramips/mt7620 (Archer C20i)
For Ramips mt7620, Node.js 0.10.40 runs on standard 15.05 and I have posted build instructions for 0.10.38/40 before.

For Node.js 4, you need kernel FPU emulation (which is normally disabled in OpenWRT). The following script configures Node.js 4 for trunk (r47168, to be DD).

#!/bin/sh -e

export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="mipsel-openwrt-linux-musl-gcc"
export CXX="mipsel-openwrt-linux-musl-g++"
export LD="mipsel-openwrt-linux-musl-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm --with-mips-float-abi=soft
bash --norc

Without FPU emulation you will get ‘Illegal Instruction’ and Node.js will not run.

ar71xx (TP-Link WDR3600)
Without a custom built FPU-emulator-enabled kernel, a WDR3600 gives:

root@wdr3600-1505-std:/tmp# ./node 
Illegal instruction

However, with FPU enabled:

root@wdr3600-1505-fpu:/tmp# ./node 
undefined:1



SyntaxError: Unexpected end of input
    at Object.parse (native)
    at Function.startup.processConfig (node.js:265:27)
    at startup (node.js:33:13)
    at node.js:963:3

Same result for 4.1.2 and 4.2.2. That is as far as I have got with ar71xx at the moment (20151115).

Building Node.js for OpenWrt (mipsel)

Update 2015-10-11:See separate post for Node version 4 for different OpenWrt targets. Information about v4 added below.

I managed to build (and run) Node.js OpenWrt and my Archer C20i with a MIPS 24K Little Endian CPU, without FPU (target=ramips/mt7620).

Node.js v0.10.40
First edit (set to false):

deps/v8/build/common.gypi

    54      # Similar to vfp but on MIPS.
    55      'v8_can_use_fpu_instructions%': 'false',
   
    63      # Similar to the ARM hard float ABI but on MIPS.
    64      'v8_use_mips_abi_hardfloat%': 'false',

For 15.05 I use this script to run configure:

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="mipsel-openwrt-linux-uclibc-gcc"
export CXX="mipsel-openwrt-linux-uclibc-g++"
export LD="mipsel-openwrt-linux-uclibc-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm

bash --norc

Then just “make”. I have uploaded the compiled binary node to DropBox.

Compilation for (DD) trunk (with musl rather than uclibc) fails for v0.10.40.

Node.js v4.1.2
Node.js v4 does not run without a FPU. Normally Linux emulates an FPU if it is not present, but this feature is disabled in OpenWRT. I built and published r47168 with FPU emulation and Node v4.1.2.

Node.js 4.1.2 is configured like:

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="mipsel-openwrt-linux-musl-gcc"
export CXX="mipsel-openwrt-linux-musl-g++"
export LD="mipsel-openwrt-linux-musl-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm --with-mips-float-abi=soft

bash --norc

Dependencies
In order to run the node binary on OpenWrt you need to install:

# opkg update
# opkg install librt
# opkg install libstdcpp

Performance
The 64MB or RAM of my Archer C20i is not sufficient to run octane-benchmark (even if the node binary and the benchmark are stored on a USB drive). However, I have a Mandelbrot benchmark that I can run. For Archer C20i, timings are:

C/Soft Float                     48s
Lua                              82s
Node.js v0.10.40 (soft float)    65s
Node.js v4.1.2 (FPU emulation)  444s (63s user, 381s kernel)

Clearly, the OpenWrt developers have a good reason to leave FPU emulation out. However, for Node.js in the future, FPU emulation seems to be the only way. My Mandelbrot benchmark is of course ridiculously dependent on FPU performance. For more normal usage, perhaps the penalty is less significant.

Other MIPS?
The only other MIPS I have had the opportunity to try was my WDR3600, a Big Endian 74K. It does not work:

  • v0.10.38 does not build at all (big endian MIPS seems unsupported)
  • v0.12.* builds, but it does not run (floating point exceptions), despite I managed to build for Soft Float.

I need to try rebuilding OpenWRT with FPU emulation for ar71xx, then perhaps Node.js v4 will work.

Node.js performance of Raspberry Pi 1 sucks

In several previous posts I have studied the performance of the Raspberry Pi (version 1) and Node.js to find out why the Raspberry Pi underperforms so badly when running Node.js.

The first two posts indicate that the Raspberry Pi underperforms about 10x compared to an x86/x64 machine, after compensation for clock frequency is made. The small cache size of the Raspberry Pi is often mentioned as a cause for its poor performance. In the third post I examine that, but it is not that horribly bad: about 3x worse performance for big memory needs compared to in-cache-situations. It appears the slow SDRAM of the RPi is more of a problem than the small cache itself.

The Benchmark Program
I wanted to relate the Node.js slowdown to some other scripted language. I decided Lua is nice. And I was lucky to find Mandelbrot implementations in several languages!

I modified the program(s) slightly, increasing the resolution from 80 to 160. I also made a version that did almost nothing (MAX_ITERATIONS=1) so I could measure and substract the startup cost (which is signifacant for Node.js) from the actual benchmark values.

The Numbers
Below are the average of three runs (minus the average of three 1-iteration rounds), in ms. The timing values were very stable over several runs.

 (ms)                           C/Hard   C/Soft  Node.js     Lua
=================================================================
 QNAP TS-109 500MHz ARMv5                 17513    49376   39520
 TP-Link Archer C20i 560MHz MIPS          45087    65510   82450
 RPi 700MHz ARMv6 (Raspbian)       493             14660   12130
 RPi 700MHz ARMv6 (OpenWrt)        490    11040    15010   31720
 RPi2 900MHz ARMv7 (OpenWrt)       400     9130      770   29390
 Eee701 900MHz Celeron x86         295               500    7992
 3000MHz Athlon II X2 x64           56                59    1267

Notes on Hard/Soft floats:

  • Raspbian is armhf, only allowing hard floats (-mfloat-abi=hard)
  • OpenWrt is armel, allowing both hard floats (-mfloat-abi=softfp) and soft floats (-mfloat-abi=soft).
  • The QNAP has no FPU and generates runtime error with hard floats
  • The other targets produce linkage errors with soft floats

The Node.js versions are slightly different, and so are the Lua versions. This makes no significant difference.

Findings
Calculating the Mandelbrot with the FPU is basically “free” (<0.5s). Everything else is waste and overhead.

The cost of soft float is about 10s on the RPI. The difference between Node.js on Raspbian and OpenWrt is quite small – either both use the FPU, or none of them does.

Now, the interesting thing is to compare the RPi with the QNAP. For the C-program with the soft floats, the QNAP is about 1.5x slower than the RPi. This matches well with earlier benchmarks I have made (see 1st and 3rd link at top of post). If the RPi would have been using soft floats in Node.js, it would have completed in about 30 seconds (based on the QNAP 50 seconds). The only thing (I can come up with) that explains the (unusually) large difference between QNAP and RPi in this test, is that the RPi actually utilizes the FPU (both Raspbian and OpenWrt).

OpenWrt and FPU
The poor Lua performance in OpenWrt is probably due to two things:

  1. OpenWrt is compiled with -Os rather than -O2
  2. OpenWrt by default uses -mfloat-abi=soft rather than -mfloat-abi=softfp (which is essentially like hard).

It is important to notice that -mfloat-abi=softfp not only makes programs much faster, but also quite much smaller (10%), which would be valuable in OpenWrt.

Different Node.js versions and builds
I have been building Node.js many times for Raspberry Pi and OpenWrt. The above soft/softfp setting for building node does not affect performance much, but it does affect binary size. Node.js v0.10 is faster on Raspberry Pi than v0.12 (which needs some patching to build).

Lua
Apart from the un-optimized OpenWrt Lua build, Lua is consistently 20-25x slower than native for RPi/x86/x64. It is not like the small cache of the RPi, or some other limitation of the CPU, makes it worse for interpreted languages than x86/x64.

RPi ARMv6 VFPv2
While perhaps not the best FPU in the world, the VFPv2 floating point unit of the RPi ARMv6 delivers quite decent performance (slightly worse per clock cycle) compared to x86 and x64. It does not seem like the VFPv2 is to be blamed for the poor performance of Node.js on ARM.

Conclusion and Key finding
While Node.js (V8) for x86/x64 is near-native-speed, on the ARM it is rather near-Lua-speed: just another interpreted language, mostly. This does not seem to be caused by any limitation or flaw in the (RPi) ARM cpu, but rather the V8 implementation for x86/x64 being superior to that for ARM (ARMv6 at least).

JavaScript: switch options

Is the nicest solution also the fastest?

Here is a little thing I ran into that I found interesting enough to test it. In JavaScript, you get a parameter (from a user, perhaps a web service), and depending on the parameter value you will call a particular function.

The first solution that comes to my mind is a switch:

function test_switch(code) {
  switch ( code ) {
  case 'Alfa':
    call_alfa();
    break;
  ...
  case 'Mike':
    call_mike();
    break;
  }
  call_default();
}

That is good if you know all the labels when you write the code. A more compact solution that allows you to dynamically add functions is to let the functions just be properties of an object:

x1 = {
  Alfa:call_alfa,
  Bravo:call_bravo,
  Charlie:call_charlie,
...
  Mike:call_mike
};

function test_prop(code) {
  var f = x1[code];
  if ( f ) f();
  else call_default();
}

And as a variant – not really making sense in this simple example but anyway – you could loop over the properties (functions) until you find the right one:

function test_prop_loop(code) {
  var p;
  for ( p in x1 ) {
    if ( p === code ) {
      x1[p]();
      return;
    }
  }
  call_default();
}

And, since we are into loops, this construction does not make so much sense in this simple example, but anyway:

x2 = [
  { code:'Alfa'     ,func:call_alfa    },
  { code:'Bravo'    ,func:call_bravo   },
  { code:'Charlie'  ,func:call_charlie },
...
  { code:'Mike'     ,func:call_mike    }
];

function test_array_loop(code) {
  var i, o;
  for ( i=0 ; i<x2.length ; i++ ) {
    o = x2[i];
    if ( o.code === code ) {
      o.func();
      return;
    }
  }
  call_default();
}

Alfa, Bravo…, Mike and default
I created exactly 13 options, and labeled them Alfa, Bravo, … Mike. And all the test functions accept invalid code and falls back to a default function.

The loops should clearly be worse for more options. However it is not obvious what the cost is for more options in the switch case.

I will make three test runs: 5 options (Alfa to Echo), 13 options (Alfa to Mike) and 14 options (Alfa to November) where the last one ends up in default. For each run, each of the 5/13/14 options will be equally frequent.

Benchmark Results
I am benchmarking using Node.js 0.12.2 on a Raspberry Pi 1. The startup time for Nodejs is 2.35 seconds, and I have reduced that from all benchmark times. I also ran the benchmarks on a MacBook Air with nodejs 0.10.35. All benchmarks were repeated three times and the median has been used. Iteration count: 1000000.

(ms)       ======== RPi ========     ==== MacBook Air ====
              5      13      14         5      13      14
============================================================
switch     1650    1890    1930        21      28      30
prop       2240    2330    2890        22      23      37
proploop   2740    3300    3490        31      37      38
loop       2740    4740    4750        23      34      36

Conclusions
Well, most notable (and again), the RPi ARMv6 is not fast running Node.js!

Using the simple property construction seems to make sense from a performance perspective, although the good old switch also fast. The loops have no advantages. Also, the penalty for the default case is quite heavy for the simple property case; if you know the “code” is valid the property scales very nicely.

It is however a little interesting that on the ARM the loop over properties is better than the loop over integers. On the x64 it is the other way around.

Variants of Simple Property Case
The following are essentially equally fast:

function test_prop(code) {
  var f = x1[code];   
  if ( f ) f();       
  else call_x();                        
}   

function test_prop(code) {
  var f = x1[code];   
  if ( 'function' === typeof f ) f();
  else call_x();                        
}   

function test_prop(code) {
  x1[code]();                          
}   

So, it does not cost much to have a safety test and a default case (just in case), but it is expensive to use it. This one, however:

function test_prop(code) {
  try {
    x1[code]();
  } catch(e) {
    call_x();
  }
}

comes at a cost of 5ms on the MacBook, when the catch is never used. If the catch is used (1 out of 14) the run takes a full second instead of 37ms!

Node.js Benchmark on Raspberry Pi (v1)

I have experimented a bit with Node.js and Raspberry Pi lately, and I have found the performance… surprisingly bad. So I decided to run some standard tests: benchmark-octane (v9).

Octane is essentially run like:

$ npm install benchmark-octane
$ cd node_modules/benchmark-octane
$ node run.js

The distilled result of Octane is a total run time and a score. Here are a few results:

                         OS             Node.js                   Time    Score
QNAP TS-109 500MHz       Debian        v0.10.29 (Debian)         3350s      N/A
Raspberry Pi v1 700MHz   OpenWrt BB    v0.10.35 (self built)     2267s      140
Raspberry Pi v1 700MHz   Raspbian       v0.6.19 (Raspbian)       2083s      N/A
Raspberry Pi v1 700MHz   Raspbian       v0.12.2 (self built)     2176s      104
Eee701 Celeron 900Mhz    Xubuntu       v0.10.25 (Ubuntu)          171s     1655
Athlon II X2@3Hz         Xubuntu       v0.10.25 (Ubuntu)           49s     9475
MacBook Air i5@1.4Ghz    Mac OS X      v0.10.35 (pkgsrc)           47s    10896
HP 2560p i7@2.7Ghz       Xubuntu       v0.10.25 (Ubuntu)           41s    15450

Score N/A means that one test failed and there was no final score.

When I first saw the RPi performance I thought I had done something wrong building (using a cross compiler) Node.js myself for RPi and OpenWRT. However Node.js with Raspbian is basically not faster, and also RPi ARMv6 with FPU is not much faster than the QNAP ARMv5 without FPU.

I think the Eee701 serves as a good baseline here. At first glance, possible reasons for the RPi underperformance relative to the Celeron are:

  • Smaller cache (16kb of L1 cache and L2 only available to GPU, i Read) compared to Celeron (512k)
  • Bad or not well utilised FPU (but there at least is one on the RPi)
  • Node.js (V8) less optimized for ARM

I found that I have benchmarked those to CPUs against each other before. That time the Celeron was twice as fast as the RPi, and the FPU of the RPi performed decently. Blaming the small cache makes more sense to me, than blaming the people who implemented ARM support in V8.

The conclusion is that Raspberry Pi (v1 at least) is extremely slow running Node.js. Other benchmarks indicate that RPi v2 is significantly faster.

Raspberry Pi (v1), OpenWrt (14.07) and Node.js (v0.10.35 & v0.12.2)

Since I gave up running NetBSD on my Raspberry pi I decided it was time to try OpenWrt. And, to my surprise I also managed to cross compile Node.js!

Install OpenWrt on Raspberry Pi (v1@700MHz)
I installed OpenWrt Barrier Breaker (the currently stable release) using the standard instructions.

After you have put the image on an SD-card with dd, it is quite easy to resize the root partition:

  1. copy the second partition to an image file using dd
  2. use fdisk to delete the second partition, and create a new, bigger
  3. format the new partition with mkfs.ext4
  4. mount the image file using mount -o loop
  5. mount the new second partition
  6. copy all data from image file to second partition using cp -a

If you want to, you can edit /etc/config/network while you are anyway working with the OpenWrt root partition:

#config interface 'lan'
#	option ifname 'eth0'
#	option type 'bridge'
#	option proto 'static'
#	option ipaddr '192.168.1.1'
#	option netmask '255.255.255.0'
#	option ip6assign '60'
#	option gateway '?.?.?.?'
#	option dns '?.?.?.?'
config interface 'lan'
	option ifname 'eth0'
	option proto 'dhcp'
	option macaddr 'XX:XX:XX:XX:XX:XX'
	option hostname 'rpiopenwrt'

Probably you want to disable dnsmasq, odhcpd and firewall too:

.../etc/init.d/$ chmod -x dnsmasq firewall odhcpd

OR (depending on your idea of what is the right way)

.../etc/rc.d$ sudo rm S60dnsmasq S35odhcpd K85odhcpd S19firewall

Also, it is a good idea to edit config.txt (on the DOS partition):

gpu_mem=1

I don’t know if 1 is really a legal value, but it worked for me, and I had much more memory available than when gpu_mem was not set.

Node.js4 added 2015-10-03
For Node.js, check Node.js 4 builds.

Building Node.js v0.12.2
I downloaded and built Node.js v0.12.2 on a Xubuntu machine with an x64 cpu. On such a machine you can download the standard OpenWrt toolchain for Raspberry Pi.

I replaced configure and cpu.cc in the standard sources with the files from This Page (they are meant for v0.12.1 but they work equally good for v0.12.2).

I then found an a gist that gave me a good start. I modified it, and ended up with:

#!/bin/sh -e

export STAGING_DIR=...path to your toolchain...

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib
export ARM_TARGET_LIB=$CSTOOLS_LIB

export TARGET_ARCH="-march=armv6j"

#Define the cross compilators on your system
export AR="arm-openwrt-linux-uclibcgnueabi-ar"
export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LINK="arm-openwrt-linux-uclibcgnueabi-g++"
export CPP="arm-openwrt-linux-uclibcgnueabi-gcc -E"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"
export AS="arm-openwrt-linux-uclibcgnueabi-as"
export CCLD="arm-openwrt-linux-uclibcgnueabi-gcc ${TARGET_ARCH} ${TARGET_TUNE}"
export NM="arm-openwrt-linux-uclibcgnueabi-nm"
export STRIP="arm-openwrt-linux-uclibcgnueabi-strip"
export OBJCOPY="arm-openwrt-linux-uclibcgnueabi-objcopy"
export RANLIB="arm-openwrt-linux-uclibcgnueabi-ranlib"
export F77="arm-openwrt-linux-uclibcgnueabi-g77 ${TARGET_ARCH} ${TARGET_TUNE}"
unset LIBC

#Define flags
export CXXFLAGS="-march=armv6j"
export LDFLAGS="-L${CSTOOLS_LIB} -Wl,-rpath-link,${CSTOOLS_LIB} -Wl,-O1 -Wl,--hash-style=gnu"
export CFLAGS="-isystem${CSTOOLS_INC} -fexpensive-optimizations -frename-registers -fomit-frame-pointer -O2"
export CPPFLAGS="-isystem${CSTOOLS_INC}"
export CCFLAGS="-march=armv6j"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux --without-npm

bash --norc

Run this script in the Node.js source directory. If everything goes fine it configures the Node.js build, and leaves you with a shell where you can simply run:

$ make

If compilation is fine, you find the node binary in the out/Release folder. Copy it to your OpenWrt Raspberry Pi.

Building Node.js v0.10.35
I first successfully built Node.js v0.10.35.

The (less refined) script for configuring that I used was:

#!/bin/sh -e

export STAGING_DIR=...path to your toolchain...

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib
export ARM_TARGET_LIB=$CSTOOLS_LIB
export GYP_DEFINES="armv7=0"

#Define our target device
export TARGET_ARCH="-march=armv6"
export TARGET_TUNE="-mfloat-abi=hard"

#Define the cross compilators on your system
export AR="arm-openwrt-linux-uclibcgnueabi-ar"
export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LINK="arm-openwrt-linux-uclibcgnueabi-g++"
export CPP="arm-openwrt-linux-uclibcgnueabi-gcc -E"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"
export AS="arm-openwrt-linux-uclibcgnueabi-as"
export CCLD="arm-openwrt-linux-uclibcgnueabi-gcc ${TARGET_ARCH} ${TARGET_TUNE}"
export NM="arm-openwrt-linux-uclibcgnueabi-nm"
export STRIP="arm-openwrt-linux-uclibcgnueabi-strip"
export OBJCOPY="arm-openwrt-linux-uclibcgnueabi-objcopy"
export RANLIB="arm-openwrt-linux-uclibcgnueabi-ranlib"
export F77="arm-openwrt-linux-uclibcgnueabi-g77 ${TARGET_ARCH} ${TARGET_TUNE}"
unset LIBC

#Define flags
export CXXFLAGS="-march=armv6"
export LDFLAGS="-L${CSTOOLS_LIB} -Wl,-rpath-link,${CSTOOLS_LIB} -Wl,-O1 -Wl,--hash-style=gnu"
export CFLAGS="-isystem${CSTOOLS_INC} -fexpensive-optimizations -frename-registers -fomit-frame-pointer -O2 -ggdb3"
export CPPFLAGS="-isystem${CSTOOLS_INC}"
export CCFLAGS="-march=armv6"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux
bash --norc

Running node on the Raspberry Pi
Back on the Raspberry Pi you need to install a few packages:

# ldd ./node 
	libdl.so.0 => /lib/libdl.so.0 (0xb6f60000)
	librt.so.0 => not found
	libstdc++.so.6 => not found
	libm.so.0 => /lib/libm.so.0 (0xb6f48000)
	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6f34000)
	libpthread.so.0 => not found
	libc.so.0 => /lib/libc.so.0 (0xb6edf000)
	ld-uClibc.so.0 => /lib/ld-uClibc.so.0 (0xb6f6c000)
# opkg update
# opkg install librt
# opkg install libstdcpp

That is all! Now you should be ready to run node. The node binary is about 13Mb (the v0.10.35 was 19Mb perhaps becuase of -ggdb3), so it is not optimal to deploy it to other typical OpenWrt hardware.

Final comments
I ran a few small programs to test, and they were fine. I guess some more testing would be appropriate. The performance is very comparable to Node.js built and executed on Raspbian.

I think RaspberryPi+OpenWrt+Node.js is a very interesting and competitive combination for microservices!