Tag Archives: Lisp

Peculiar Compiler Optimizations

My teacher in a High Performance Computing class once told me not to confuse the compiler. This was in the late 90s, and SGI C and Fortran compilers were supposed to replace entire blocks of code with highly optimised implementations. As long as the compiler understood your intentions.

I have never discovered this, but yesterday perhaps! Read on.

I have been playing around with LISP, solving Project Euler challenges on Hackerrank. For problems 44 and 45 I decided to do Binary Search (which afterwards turned out not to be so smart, but that is another story) and took the implementation from Rosetta Code (the iterative one).

Binary search is about finding an element in a sorted array by starting in the middle, and jumping left or right, cutting the remaning array in half each time.

In my case I decided just a part of the array was worth searching so instead of searching the entire array [0..length] I wanted to search up to the Nth element [0..N]. Searching fewer elements should be faster, so I improved the binary search function to take an additional argument: hi. For SBCL (Steel Bank Common Lisp), this surprisingly had horrible effect on performance.

Benchmark Results
The results for different algorithms, different machines and different LISP implementations follow. The RPi V1 runs Arch Linux, Clisp comes with Arch, SBCL is downloaded from SBCL webpage. The RPi V2 runs Raspbian and SBCL that comes with the distribution. The Celeron runs Ubuntu that comes with SBCL. The MacBook Air runs OS X and SBCL is downloaded separately.

 (times in seconds)             Array Size Standard Optimized Recursive    C -O2
RPi V1  ARMv6 700MHz  Clisp 2.49      5000    640       633       720
                      SBCL  1.3.12    5000     15.6      27        34       0.95
RPi V2  ARMv7 900MHz  SBCL  1.2.4     5000      6.2      16        17       0.31
                                     20000    110       293       300       5.8
NUC Celeron J3455     Clisp 2.49     20000    765       762       720   
                      SBCL  1.3.3    20000      8.3      16.7      18.0     1.0
MacBook Air i5        SBCL  1.2.11   20000      4.0      11.5      12.3     0.75

A very slight “optimization” turns out to have very negative impact on performance for the quite fast (compiled) SBCL. I can’t imagine any other explanation than SBCL replaces the standard binary search with optimized code rather than executing my program. For Clisp the optimization actually works quite as would be expected and the recursive code is actually the fastest. On the Celeron, Clisp and SBCL behaves completely opposite.

Comparing to C
The other week I had the feeling (SBCL) LISP was fast and decided to compare LISP to C. This time I had the feeling that LISP was rather slow so I ported my test program to C (basically line by line). Well, I found that SBCL is actually pretty fast (especially on x86/x64), and C was faster only thanks to -O2 on some systems. -O2 actually made the C-program more than 5 times faster: perhaps also the C-compiler replace the entire binary search?

The Test Program
The code follows. The only difference between Standard and Optimized is the single line that is commented out (with ; in LISP) selecting which binary search to run (the length of the function name does not explain the performance difference).

The program creates an array of length N and populates it with values by the formula n(n+1)/2. This takes little time. It then checks for values 10,20,30… if the values are found in the array (using binary search). In this program the entire array is always searched, not taking advantage of the extra parameter (although the optimized version does not need to find the length of the array every time called).

(defun binary-search (value array)                       ; Standard 2 lines
    (let ((low 0) (high (1- (length array))))            ;
        (do () ((< high low) nil)
            (let ((middle (floor (+ low high) 2)))
                (cond ((> (aref array middle) value)
                       (setf high (1- middle)))
                      ((< (aref array middle) value)
                       (setf low (1+ middle)))
                      (t (return middle)))))))

(defun binary-search-optimized (value array hi)          ; Optimized 2 lines
    (let ((low 0) (high hi))                             ;
        (do () ((< high low) nil)
            (let ((middle (floor (+ low high) 2)))
                (cond ((> (aref array middle) value)
                       (setf high (1- middle)))
                      ((< (aref array middle) value)
                       (setf low (1+ middle)))
                      (t (return middle)))))))

(defun binary-search-r (value
                        &optional (low 0)
                        (high (1- (length array))))
  (if (< high low)
      (let ((middle (floor (+ low high) 2)))
        (cond ((> (aref array middle) value)
               (binary-search-r value array low (1- middle)))
              ((< (aref array middle) value)
               (binary-search-r value array (1+ middle) high))
              (t middle)))))

(defun formula (n)
    (/ (* n (+ n 1)) 2))

(defun init-array (n)
    (let ((arr (make-array n)))
        (loop for i from 0 to (1- n) do
            (setf (aref arr i) (formula (1- i))))

(defun solve (arr n max)
    (let ((ret 0))
        (loop for i from 10 to max by 10 do
            (if (binary-search i arr)                     ; Toggle code used
;           (if (binary-search-r i arr)                   ;
;           (if (binary-search-optimized i arr n)         ;
                (incf ret)
(defun main ()
    (let ((n (read)))
        (let ((arr (init-array n)))
            (format T "~D~%" (solve arr (1- n) (aref arr (1- n)))))))


Since I am a very novice LISP programmer I appreciate any feedback. The code above does not solve Project Euler 44 or 45, it is much simplified to test binary search. Initially I wrote code that relied on recursion rather than loops but I exceeded the stack size and ended up with loops (according to what I read, loops rather than recursion is the preferred style of Common Lisp).

Well... optimization is hard, and dont make any assumptions. As I have found many times before, what makes code faster on some platforms can make it slower on others. When it comes to optimizing SBCL and compiled LISP much experience is required, and dont forget to measure!

Playing with LISP and LISP vs C

Lisp is fun! Well, since I first knew about Lisp I was fascinated, but I have found it hard to learn Lisp and to play with it in a meaningful way. A few years ago I wrote about it here and here. As usual, the first steps of learning something new can be the hardest.

Occationally I use Hackerrank to find programming challanges to solve for fun. I particularly like the Project Euler competition. I find it particularly good for trying out new languages: you get a “meaningful” challenge, a simple environment prepared in your web browser, and automated test cases. So, this time I didn’t waste my time trying to find the right Lisp implementation for me, I just started hacking on Project Euler 38 and 39 on Hackerrank.

Problem 38 was quite simple, but 39 was more interesting. When I had solved it, I found my implementation was not at all fast enough, so I started experimenting locally (the Hackerrank environment is not optimal for tweaking, optimization and debugging).

Choosing a (Common) Lisp implementation
There are quite many Common Lisp implementations out there. The one Hackerrank uses is SBCL. That is clearly the Common Lisp implementation I would recommend (based on my little experience) if it is available for your platform.

I installed SBCL with apt-get in Ubuntu. I also downloaded binaries directly for my Mac OS X computer and my Raspberry Pi (v1) running Arch linux. Installation is a bit non-standard, but you can actually run it without installing (just execute run-sbcl.sh in downloaded folder).

I also tried clisp and ecl, none of these could deal with the memory usage (stack size) of my program. For clisp I found no way to manipulate stack sizes at all. For ecl I made some progress but I could not make it run my program.

SBCL is a Lisp compiler, and it produces fast and efficient code. I later compared it to C.

Project Euler 39
Project Euler 39 is basically about finding integer solutions to Pythagoras theorem. For a given, large, perimeter, how many right triangles are there? For example:

300000^2 + 400000^2 = 500000^2

This triangle has a perimeter of 300000+400000+500000=1200000. What other values for a and b so that

a + b = 700000
a^2 + b^2 = 500000^2

are there? The Hackerrank challenge requires you to work with perimeters up to 5000000. If you implement a solution, a few things to immediately note:

  • The squares wont fit in a 32bit integer. They will fit with no loss of precision in the 53 bits of a 64 bit double and they will also fit in a 64 bit integer. (This matters not for Common Lisp)
  • If you want to do recursion (and of course you want when you code Lisp) it will be millions of recursion steps, which will be a challenge to the stack size. (This also turned out not to matter for SBCL)

The Lisp implementation
It turned out that the SBCL compiler optimized the recursion is such a way that the memory usage was quite low. SBCL successfully runs my program on RPi/Arch, Intel/Ubuntu and Intel/OSX with quite reasonable memory usage.

Since this is about learing Lisp I wanted a 100% functional programming implementation. Only pure functions. A lot of my code is about generating, modifying and testing triangles. A triangle (a b c) can obviously be represented as a Lisp list (a b c) and this was my first implementation. Then if you want to read a, b or c from a list abc, or create the list from a, b and c, you can do:

  a: (car abc)
  b: (car (cdr abc))
  c: (car (cdr (cdr abc)))

abc: (list a b c)

I found this cumbersome. It became a lot of list-to-variables and variables-to-list overhead (I didnt care so much about performance, more about my code readability). I learnt that Lisp functions can return multiple values using value and that you can bind them with multiple-value-bind and use them as arguments to a function using multiple-value-call. This felt functional and pure enough, and it made my code 25% faster than the car/cdr pattern above.:

; a (stupid) function returning a triangle as three values
(defun get-345-triangle ()
  (values 3 4 5))

; a function calculating the perimeter of triangle (from a function)
(defun triangle-perimeter-1 (tri-func)
  (multiple-value-bind (a b c) (funcall tri-func)
    (+ a b c)))

; and in this case you dont need to bind, you can use + directly
(defun triangle-perimeter-2 (tri-func)
  (multiple-value-call #'+ (funcall tri-func)))

; now this works
(triangle-perimeter-1 #'get-345-triangle)
(triangle-perimeter-2 #'get-345-triangle)

Since I am a very inexperienced Lisp programmer I appreciate suggestions for improvement.

Performance of Lisp
My final Hackerrank submission of Lisp code executes in about 4.5 seconds on my Intel i5/Ubuntu. It takes about the same time on the Hackerrank web page, which is fast enough to pass all tests. On the Raspberry Pi v1 (ARMv6 @700 MHz) it takes more than 700 seconds. My intuition told me that 4.5 seconds was very good. This made me ask two questions. How would Lisp compare to C? And why is the ARM more than 100 times slower, how would that compare in C?

The C implementation
My ambition was to rewrite Lisp to C line by line. So my C-program has exactly the same functions which take almost exactly the same arguments. All calculations are identical and performed in exactly the same order. The C-program relies entirely on recursion instead of loops (just like the Lisp program). However…

Functions in C can not return multiple variables. While Lisp had values I decided to use a reference to a struct in C:

(defun get-a-triangle()
  (values x y z))

void get_a_triangle(struct triangle *t) {
  t->a = x;
  t->b = y;
  t->c = z;

If the C-triangle struct is a local variable on the callers stack the difference is quite small (from a practical point of view, from a theoretic strict functional programming perspective its a different story).

Numbers in Lisp have arbitrary precision integers and floats make no difference. So, when porting to C, I had to pick numeric types. For most purposes, int32_t was good enough. But, for the purpose of calculating Pythagoras theorem higher precision was needed (as I wrote above, the 53 bits of double, or 64 bits of int64_t are good). So I ended up with 5 versions of the C-program (to compare performance):

  1. All 64-bit integers
  2. 32-bit integers, 64-bit for “triangles”
  3. 32-bit integers, double for “triangles”
  4. 32-bit integers, 64-bit only for pythagoras calc
  5. 32-bit integers, double only for pythagoras calc

(In cases 2,3 the struct triangle has int64_t/doubles properties, and all manipulations and calculations on triangles use these datatypes. In cases 4,5 everything is int32_t, except the internals of a single function, which casts to higher precision before doing its calculations.)

The C-program requires a significant stack size. The stack size can be obtain and changed like (numbers in kb, all values given with ulimit -a):

$ ulimit -s

$ ulimit -s 100000

For my program, a stack size much higher than 8192 is needed (see below). It seems impossible to get large stack than 64Mb in Mac OS X, so my C program could never run there.

Benchmark findings
All C-programs are compiled with gcc -O2.

 CPU            MHZ      SBCL        64     32/64  32/double   32(64)  32(double)
Time (s)
 i5-4250U 1300-2600       4.5      1.52      1.52      1.60      1,54      1.58
 ARMv6          700      ~715        85        83        45        42        39
 ARMv7          900       357        23        21        13        12        10

Max Res (MB)
 i5-4250U                  41       103       103       103       103       103
 ARMv6                     50       220       210        79       110        76
 ARMv7                     57       180       160        87        97        62

This is not too easy to interpret! The ironic thing is that the fastest thing on the x64-cpu (64-bit integers everywhere) is the slowest on the ARMv6. However, the fastest option on the ARMv6 (32-bit everywhere, and when absolutely needed, use double) is almost the worst on the i5 CPU.

When it comes to the 64-bit i5, it basically does not matter what datatypes you use.

When it comes to the ARMv6, the most important thing is to not store the triangles as int64_t. The strange thing here is the stack sizes. Why does it double (compared to x64) when triangles are stored as int64_t? And the doubles, why do they reduce stack size so much (where are all these doubles actually stored)?

The time command gives max resident memory usage. If I set ulimit -s 128 the first two programs fail (with Segmentation fault 11), and the last three ones succeed, on the ARMv6.

I have found before that the performance of the ARMv6 suffers because of its slow memory and small cache. It is quite possible that the poor performance of the ARMv6 compared to the i5 is related to its slow memory, and the recursion (and stack memory) heavy algorithm.

Finally, SBCL in x64 has very good performance even compared to C (however, an iterative C-implementation, fitting completely in cache, would probably be faster). Note that I am a novice Lisp programmer and this is a math heavy program where the generic number type of Lisp will come at a cost. On the ARMv6, Lisp performance suffers much more.

Windows stack size limit
For Windows, stack size limit is set in the binary, not in the shell. With Cygwin/GCC use the flag -Wl,–stack,1000000 for one million bytes. Note that these are options passed on to the linker.

Future investigations
And I am curious about how much faster a minimal-memory-footprint loop-based C-program would perform.

The source code
Since this code solves a problem in Hackerrank I hesitate to publish it. If you want it for any other reason than just running it on Hackerrank let me know.

Building a Common Lisp program (ECL)

LISP is easy – you just need to start up the interpreter and start playing. But what if you are dependent on libraries, and you want to compile a binary? If you come from another background, like I did, it is quite confusing in the beginning.

Everything below applies to ECL – I think most things will apply to other Common Lisp implementations as well. I build a little command line utility that converts things to/from Base64 (using a library for that).

ASDF is a tool that handles dependencies between packages, and also controls your build process (like make). Every project is called a System. Yours to.

When you download lisp packages they typically come with an asd-file, and one or more lisp-files. Each package goes in its own directory, and ASDF needs to know about each package.

I did everything from scratch and installed ecl in /opt/ecl. I put the packages in /opt/ecl/packages (not standard at all).

Project files and building it
These are the files my project contains, and how to build.

kvaser@kvaser:~/lisp/simple-base64$ ls -l
-rwxr-xr-x 1 kvaser kvaser  337 Mar 13 10:18 build.lisp
-rw-r--r-- 1 kvaser kvaser  197 Mar 13 10:18 simple-base64.asd
-rwxr-xr-x 1 kvaser kvaser 1389 Mar 13 13:51 simple-base64.lisp
kvaser@kvaser:~/lisp/simple-base64$ ./build
  ... ...

The binaries (of my program, and all dependencies) end up ~/.cache/, so thats where you need to go to execute your program (or just make a symbolic link to the project directory).


(in-package :asdf)

(defsystem :simple-base64
  :name "simple-base64"
  :author "Zo0ok"
  :version "0"

  :components((:file "simple-base64"))

  :depends-on (:s-base64 :flexi-streams))

:components points to my lisp-file(s).
:depends-on lists other systems that I depend on (the base64-library itself, and a stream library that turned out to be useful.

Here is the source code to the program itself. It is very non-Lispy, remember, I am new to Lisp and I dont know how to program Lisp with style.

(defun print-usage-and-quit ()
  (format *error-output* "Usage:~%")
  (format *error-output* "  ./simple-base64 -e PLAINDATA~%")
  (format *error-output* "  ./simple-base64 -d BASE64DATA~%")
  (format *error-output* "  ./simple-base64 -e < plaindata.file~%")
  (format *error-output* "  ./simple-base64 -d < base64data.file~%")

;;; MAIN starts here

(let ( (mode-op-enc NIL)
       (mode-src-stdin NIL)
       (input-stream NIL)
       (output-stream NIL) )
    ( (= 2 (length si::*command-args*) )
      (setf mode-src-stdin T ) )
    ( (= 3 (length si::*command-args*) )
      (setf mode-src-stdin NIL ) )
    ( T
      (print-usage-and-quit) ) )
    ( (string= "-d" (second si::*command-args*) )
      (setf mode-op-enc NIL) )
    ( (string= "-e" (second si::*command-args*) )
      (setf mode-op-enc T) )
    ( T
      (print-usage-and-quit) ) )

    ( mode-src-stdin
      ( setf input-stream *standard-input* ))
    ( mode-op-enc
      ( setf input-stream (flexi-streams:make-in-memory-input-stream
                         (map 'vector #'char-code (third si::*command-args*)))))
    ( ( not mode-op-enc )
      ( setf input-stream (make-string-input-stream (third si::*command-args*))))

  (if mode-op-enc
    (s-base64:encode-base64 input-stream *standard-output*)
    (s-base64:decode-base64 input-stream *standard-output*) )

Notice that nowhere the systems I depend on are included, they are just used when needed.

Finally the build-script, a lisp program that uses asdf:

#!/opt/ecl/bin/ecl -shell

(require 'asdf)
(push (truename #P"/opt/ecl/packages/s-base64") asdf:*central-registry*)
(push (truename #P"/opt/ecl/packages/cl-trivial-gray-streams") asdf:*central-registry*)
(push (truename #P"/opt/ecl/packages/flexi-streams-1.0.7") asdf:*central-registry*)
(asdf:make-build :simple-base64 :type :program)

Note that the build-script is the place to put paths to systems I depend on. Also note that I have included cl-trivial-gray-streams, a system I dont use directly, but flexi-streams needs it so I need to tell where it is. Finally, this pushing paths to *central-registry* is supposed to be the "old way". But for now I was happy to find a way that works, and that I understand.

As usual, when something works it looks simple, but it is tricky to get all the details right in the first place. I believe this is a good starting point for a small lisp-project that depends on available libraries.

Trivial Gray Streams
The package Trivial Gray Streams caused problems. The standard package I downloaded did not work for ECL (complained it could not find the system). I ended up installing Trivial Gray Streams using Debian apt-get. It puts lisp packages in /usr/share/common-lisp, and that version worked.

This applies to ECL version 11.1.1 and Trivial Gray Streams from Debian 6.0. The version of Trivial Gray Streams that did not work was dated 2008-11-02.

Lisp on Debian/ARM

After reading Revenge of the Nerds I decided it was time to learn Lisp. Programming without some kind of real project is boring, so my plan is to write some web applications using jquery and Lisp (for the back end).

Since I have a Qnap TS-109 running 24×7 I thought it would make a good development machine and Lisp web server. It runs Debian 6.0, but running Lisp on it turned out to be a challenge.

Debian, Lisp and ASDF
Debian supports installing different implementations of (Common) Lisp. However, it seems to be tricky to find a version that installs a binary on Debian ARM.

Also, there is a package depency tool for lisp called ASDF. Lisp implementations should come with it.

The only Common Lisp that I managed to easily install (i.e. with apt-get) in Debain 6.0 ARM was GCL. But it is a version of GCL that is 5 years old, and it does not come with ASDF.

I spent much time trying to compile clisp, but in the end I ended up with:

  > ( / 6 3)
  > ( / 5 2)
  Segmentation Fault

Not so fun. Significant parts of clisp is written in assembly (both a good thing and a bad thing), and I was really not able to figure out if it was supposed to work on ARM EABI at all, or just on the old ARM ABI. So after much struggle I gave up clisp.

I managed to compile ECL from source. Not completely without hassle though. It comes with libffi, but I ended up with compilation errors (the processor does not support…). So, I downloaded libffi, compiled it myself and installed it in /opt/libffi. That was no problem, but I ended up making a symbolic link to include myself:

kvaser@kvaser:/opt/libffi$ ls -l
total 8
lrwxrwxrwx 1 root root   40 Mar 11 16:52 include -> /opt/libffi/lib/libffi-3.0.10rc9/include
drwxr-xr-x 4 root root 4096 Mar 11 16:37 lib
drwxr-xr-x 4 root root 4096 Mar 11 16:37 share

Now I configured ecl with:

CPPFLAGS=-I/opt/libffi/include LDFLAGS=-L/opt/libffi/lib ./configure --prefix=/opt/ecl --with-dffi=auto

That worked, and compiling went fine until ecl_min could not be executed, because it could not find libffi.so.6. I tried to fix that a while, but finally ended up making another symbolic link:

kvaser@kvaser:/usr/lib$ ls -l libffi.so.6
lrwxrwxrwx 1 root root 31 Mar 11 19:56 libffi.so.6 -> /opt/libffi/lib/libffi.so.6.0.0

After that, I ran make again to finish compilation. It went fine.

ECL, ASDF and cl-who
Now, where to put the Lisp http library cl-who? I copied the asd-file and the lisp-files to the ecl library folder and ran ecl as root:

kvaser@kvaser:~/lisp/cl-who-0.11.1$ sudo cp cl-who.asd /opt/ecl/lib/ecl-11.1.1/
kvaser@kvaser:~/lisp/cl-who-0.11.1$ sudo cp *.lisp /opt/ecl/lib/ecl-11.1.1/
kvaser@kvaser:~$ sudo /opt/ecl/bin/ecl
  ... ...
> (require 'asdf)

;;; Loading #P"/opt/ecl/lib/ecl-11.1.1/asdf.fas"
;;; Loading #P"/opt/ecl/lib/ecl-11.1.1/cmp.fas"
("ASDF" "CMP")

> (asdf:operate 'asdf:load-op :cl-who)    
  ... ...

Now, cl-who is compiled and installed, ready to use. Next time, it does not need to be compiled.

Hello LISP
I wrote a little Hello World program:

kvaser@kvaser:~/lisp$ cat hello.lisp 
(format T "Hello Lisp~%")
kvaser@kvaser:~/lisp$ /opt/ecl/bin/ecl -load hello.lisp 
;;; Loading "/home/kvaser/lisp/hello.lisp"
Hello Lisp

Quite good (except I already know the file was loaded and it disturbs my output, but whatever. How about compiling it?

kvaser@kvaser:~/lisp$ /opt/ecl/bin/ecl -compile hello.lisp 
;;; Loading #P"/opt/ecl/lib/ecl-11.1.1/cmp.fas"
;;; Compiling hello.lisp.
;;; OPTIMIZE levels: Safety=2, Space=0, Speed=3, Debug=0
;;; End of Pass 1.
;;; Note:
;;;   Invoking external command:
;;;   gcc -I. -I/opt/ecl/include/ -I/opt/libffi/include -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -g -O2 -fPIC -Dlinux -O2 -w -c hello.c -o hello.o 
;;; Note:
;;;   Invoking external command:
;;;   gcc -o hello.fas -L/opt/ecl/lib/ /home/kvaser/lisp/hello.o -Wl,--rpath,/opt/ecl/lib/ -shared -L/opt/libffi/lib -L/opt/libffi/lib -lecl -lgmp -lgc -lffi -ldl -lm 
;;; Finished compiling hello.lisp.
kvaser@kvaser:~/lisp$ ls
cl-who-0.11.1  cl-who.tar.gz  hello.fas  hello.lisp
kvaser@kvaser:~/lisp$ ./hello.fas 
Segmentation fault
kvaser@kvaser:~/lisp$ /opt/ecl/bin/ecl -load hello.fas 
;;; Loading "/home/kvaser/lisp/hello.fas"
Hello Lisp

Ok, how to make a standalone executable?

> (compile-file "hello.lisp" :system-p t)
  ... ...

> (c:build-program "hello" :lisp-files '("hello.o"))
  ... ...
> (quit)
kvaser@kvaser:~/lisp$ ls
hello  hello.fas  hello.lisp  hello.o
kvaser@kvaser:~/lisp$ time ./hello
Hello Lisp

real	0m3.084s
user	0m2.920s
sys	0m0.160s
kvaser@kvaser:~/lisp$ time /opt/ecl/bin/ecl -load hello.fas 
;;; Loading "/home/kvaser/lisp/hello.fas"
Hello Lisp

real	0m3.127s
user	0m3.060s
sys	0m0.080s
kvaser@kvaser:~/lisp$ time /opt/ecl/bin/ecl -load hello.lisp
;;; Loading "/home/kvaser/lisp/hello.lisp"
Hello Lisp

real	0m3.113s
user	0m2.960s
sys	0m0.160s

Clearly, some overhead is involved in invoking ECL. I compared to C:

kvaser@kvaser:~/lisp$ cat hello.c 

int main(int argc, char **argv) {
	printf("Hello C\n");
kvaser@kvaser:~/lisp$ gcc -o hello_c hello.c 
kvaser@kvaser:~/lisp$ time ./hello_c 
Hello C

real	0m0.012s
user	0m0.010s
sys	0m0.000s

So, I can not use this method for CGI programming right away – each call to the webserver will take at least 3 seconds.