Introduction

These benchmarks are the result of my scalable network programming research. My interest in this area is to see how scalable and fast network applications can be on standard PC hardware.

I have done most of my research on Linux 2.4, 2.5 and 2.6 kernels using a home-grown distribution I affectionately call "Leanux". I have experimented with several APIs and methods to try and get the most scalability and performance out of a web server. The ultimate goal, however, is to demonstrate scalability by surviving a Slashdotting.

Please note that most of the sites succumb not only because of bad software but also because of their internet connectivity being saturated. Besides choosing an ISP with small bandwidth costs, there is only so much I can do about this. I hope this ISP can handle the load.

During my research I experimented with several APIs on what abstraction to choose for the different scalable event notification mechanisms. In the end, I settled on Dan Bernstein's IO library interface, which I slightly extended to include an abstraction around writev and sendfile. When my implementation worked on Linux, I decided to port the API to BSD as well, to finally get some real benchmark data into the eternal flamewar on which IP stack scales best.

To that end, I installed FreeBSD, OpenBSD and NetBSD on my notebook, so all benchmarks would run on the same hardware.

Benchmarking BSD and Linux

About the hardware

The benchmark hardware is a Dell Inspiron 8000 with a 900 MHz Pentium 3 and 256 MB RAM. The network chip is a MiniPCI Intel eepro100 card, which is supported and well tuned on all operating systems.

Since my intention is to benchmark the software and not the hardware, I didn't care that it was only a single, slow CPU with slow memory and a slow IDE hard disk. Real server machines built for high scalability will probably use more powerful hardware than this.

Common settings

On all of the operating systems, I took the default settings. If I had to turn a knob to get PCI IDE DMA enabled, I did so. If the OS had power management support, I did not disable that. I enabled (and used) IPv6 support on all operating systems.

Linux 2.4

I benchmarked a stock Linux 2.4.22 kernel.

Linux 2.6

I benchmarked a stock Linux 2.6.0-test7 kernel.

OpenBSD 3.4

I benchmarked an OpenBSD 3.4-CURRENT. I directly installed -CURRENT to get patched up openssl and openssh versions. The first OpenBSD installation was wasted because the stupid boot loader (after installing everything) did not want to boot from the partition, apparently because it is beyond cylinder 1024. In my opinion, this is absolutely unexcusable for a modern operating system in the year 2003, and the OpenBSD people should be ashamed of themselves. Linux and the other BSDs did not have a problem booting from high places.

I ended up reinstalling OpenBSD in the swap partition from my Linux installation. In the end I had so little space that I had to copy the logs over to my desktop box after each benchmark, otherwise the filesystem would be full.

OpenBSD also caused a lot of grief on the IPv6 front. The OpenBSD guys intentionally broke their IPv6 stack to not allow IPv4 connections to and from IPv6 sockets using the IPv4 mapped addresses that the IPv6 standard defines for thus purpose. I find this behaviour of pissing on internet standards despicable and unworthy of free operating systems.

OpenBSD had lots of performance and scalability issues. Particularly embarassing is the fact that even NetBSD outperforms it on a few benchmarks. OpenBSD is a fork of NetBSD, so I expected it to be no worse than it's ancestor in all key areas. I was wrong.

FreeBSD 5.1

I installed a FreeBSD 5.1-RELEASE on a free partition. However, the kernel turned out to be unstable under load and would panic or even freeze under load. So I reluctantly upgraded the kernel to 5.1-CURRENT, which fixed the problems and proved to be a very stable kernel.

I also upgraded openssl and openssh on this installation. And I amended my library to use the IPV6_V6ONLY sockopt to get proper IPv6 behaviour from FreeBSD.

Apart from the stability issues in 5.1-RELEASE, FreeBSD turned out to be very stable and the fastest and most scalable BSD.

NetBSD 1.6.1

I installed a NetBSD 1.6.1-RELEASE on a free partition. The kernel was very stable and had a much snappier feeling to it than OpenBSD, which was the first BSD I tried out, and I somehow expected NetBSD to be slower than OpenBSD in every respect. The opposite was true: in particular disk and file system performance appears to be much better in NetBSD.

Unfortunately, the IPV6_V6ONLY sockopt did not work on NetBSD. One has to use a sysctl to get the proper IPv6 behaviour. That is unfortunate but bearable.

Since NetBSD did not start any network services per default, I did not feel obliged to upgrade openssl and openssh. Since the kernel was so stable, I did not feel I had to try the -CURRENT version. In the mean time I heard that the -CURRENT version of NetBSD has over two years worth of improvements in it, so it probably is even better than NetBSD 1.6.1, which already surprised me with its good performance (although it is clearly outperformed by FreeBSD).

The socket benchmark

My first benchmark was calling socket ten thousand times. I normally use gettimeofday for taking the time in benchmarks, but in this case the results were so close and small that I switched to reading the CPU cycle counter, which has 900 times finer resolution on my notebook, but makes the results not so easily comparable.

The task we are benchmarking here is the kernel allocating a socket data structure in kernel (which is easy to do) and selecting the lowest unused file descriptor (which is not so easy).

On this benchmark you can see that all the operating systems scale quite well. While there are variations, there is no O(n) or even worse implementation. I find the FreeBSD graph particularly interesting, because it indicates some sort of "cheating" by suddenly being so much faster than all the others (and itself a few sockets earlier). This looks like the kernel starts pre-allocating file descriptors when a process has more than 3500 descriptors open or so. I didn't look in the code, though, and after all there is nothing wrong with this kind of tweaking.

Conclusion: NetBSD outperforms all other operating systems in this benchmark. However, all contestants scale equally well, there are no clear losers. And the overall latency in this benchmark is so low that the results are interchangeable in practice.

The bind benchmark

The second benchmark is calling bind with port 0 on each of the sockets. This was actually done in the same benchmark program than the previous benchmark, and in the same run.

This benchmark is not important for scalable web servers, but it is important for proxy servers, FTP servers and load balancers. Oh, and for our http benchmark program, of course.

The task we are benchmarking here is the kernel selecting an unused TCP port. There are 65535 TCP port. Traditionally, ports greater than 1024 are selected here, but at least on Linux this range is configurable (/proc/sys/net/ipv4/ip_local_port_range, which is set to 32768 - 61000 per default). This task is actually easier than selecting the first unused file descriptor, but since it is not so important for web servers, some operating systems decided not to optimize it.

In this graph you cannot see FreeBSD and Linux 2.4, because they are overdrawn by the Linux 2.6 graph. All three graphs are in the same area, scaling equally well: O(1).

The dents in the graphs for OpenBSD and NetBSD are normally a sign for a data structure becoming so big that it does not fit into the L1 cache any more (or L2 cache). You can see that NetBSD and OpenBSD scale O(n) here, while Linux and FreeBSD scale O(1). Since OpenBSD has so many data points with dramatically higher latency, it is the clear loser in this benchmark.

The fork benchmark

This benchmark creates a pipe and then opens many child processes, which each write one byte into the pipe and then hang around until they are killed. After creating each child, the benchmark waits until it can read the one byte from the new child process from the pipe and takes the time from the fork until having read the byte.

This way we measure how fast fork is implemented, and we measure the scheduler performance as well, because the scheduler's job becomes more difficult if there are more processes.

The OpenBSD and FreeBSD graphs stop early because OpenBSD crashed when I forked more processes, and I couldn't find out how to increase FreeBSD's system limit on the number of processes (sysctl said the value was read-only).

Please note that the benchmark process was linked dynamically. The Linux 2.4 graph is very peculiar, because it looks like two graphs. All of the graphs in all of the benchmarks here are taken five times and I initially planned to show the average values with error margins. The Linux 2.4 data from this benchmark were the reason I didn't; now I just plot all the values for each of the five runs. You might guess that the first run of the benchmark on Linux 2.4 produced the O(n) graph and the others were O(1), maybe because some internal data structure was continuously enlarged during the first run, but that is wrong. Each of the five benchmark runs on Linux 2.4 alternated randomly between a point in the O(1) graph and one in the O(n) graph. I have no explanation for this.

The clear winner in the graph is Linux 2.6. OpenBSD does not scale at all, and even panics under high load. NetBSD scales O(n), which is respectable for the grandfather of all the BSDs, but it is not a winning performance. Linux 2.4 shows that there is work to be done; I give it the third place. FreeBSD looks like it would scale O(1) if I could create more processes with it, but as long as I can't confirm it, I can only give it the second place.

The static fork benchmark

fork has more work to do when the processes are dynamically linked. So I reran the fork benchmark with a statically linked binary, but only on FreeBSD and Linux 2.6, the winners of the previous benchmark. To show you the difference, I plottet the new results besides the old results in this graph.

As you can see, linking statically almost halves the fork latency on both systems.

The mmap benchmark

It is important for databases and large web and proxy servers to map files into memory instead of having a buffer and reading the file contents into the buffer. If you map the file into memory directly, the operating system has more memory left for I/O buffering.

The Unix syscall for memory mapping files is called mmap. The performance aspect we are benchmarking here is the efficiency of the data structures the kernel uses to manage the page tables. Memory is managed in units of "pages", with one page typically being 4k. Many architectures (including x86) can to 4 MB pages as well for special occasions. On some SPARC CPUs the page size is 8k, on IA-64 it can be 4k to 16k.

The operating system needs to maintain two data structures for memory mapped files: one system wide "page table" for all the pages, because more than one process may do a shared mapping of the same page, plus one table for each process. The process specific table is what fork copies.

This benchmark takes a 200 MB file and mmaps other second 4k page of it into its address space. To make sure we measure the mmap data structure and not the hard disk, the benchmark starts by reading every of those pages once, so they are in the buffer cache. Then this benchmark takes the time it takes to mmap each page, and the time it takes to read the first byte of each page.

The point is that the operating system does not actually need to map a page into the address space when we mmap it. Only when we access that page, the process local page table needs to get updated. The MMU will signal an exception to the operating system as soon as the process touches the mmapped page that the OS did not actually map yet.

As you can see, we have three clear losers in this benchmark: Linux 2.4, NetBSD and FreeBSD. The OpenBSD graph scales much better than these, but wait until you see the second part of this benchmark. The clear winner is Linux 2.6.

Here is a graph of Linux and FreeBSD latency for touching a page.

As you can see, Linux 2.4 appears to scale O(n), while Linux 2.6 is O(1). FreeBSD looks to be much faster than Linux 2.6, but you need to keep in mind that FreeBSD took an extraordinary time to do the actual mmap, so this good result does not save the day for FreeBSD.

So where are the others?

Here are the same results, but this time with NetBSD. As you can see, NetBSD is significantly slower than Linux and FreeBSD for this benchmark, but at least it does not get much slower when more pages are mmapped.

And now the final graph, with OpenBSD:

Whoa! Obviously, something is seriously broken in the OpenBSD memory management. OpenBSD is so incredibly slow that compared to this performance, NetBSD looks like Warp 9, and Linux is not even on the same chart.

Conclusion: Linux 2.6 is the clear winner, scaling O(1) in every respect. The clear loser is OpenBSD; I have never seen bad performance of this magnitude. Even Windows would probably outperform OpenBSD. NetBSD performance leaves a lot to be desired as well. This mmap graph is the only part of the whole benchmark suite where FreeBSD did not perform top notch. If the FreeBSD people fix this one dark spot, they will share the top space with Linux 2.6.

Fragmentation

I would like to show you one more graph, although it is not OS specific. The next graph shows the effect of file system fragmentation and of I/O scheduling.

The red graph shows one client downloading one big CD image over a dedicated Fast Ethernet connection. As expected, the Ethernet is completely saturated, and the throughput is 11 MB/sec sustained. This was measured on Linux 2.6, but all the other operating systems (except OpenBSD) were also able to saturate the Fast Ethernet. OpenBSD had big performance drops in the process, adding to the previous embarassment with mmap. I can really only warn of using OpenBSD for scalable network servers.

Anyway, the green graph shows a badly fragmented file (I downloaded another ISO image using an old version of BitTorrent). Although the IDE disk is slow, it is not that slow. It can read a sustained 25 MB/sec linearly. But on modern hard disks the throughput is only good as long as you don't have to seek around on disk, which basically means that large files need to be non-fragmented. For this fragmented file, the throughput drops to about 4 MB/sec (which already has the positive effects of the Linux 2.6 I/O scheduler, it's more like 1.5 MB/sec on Linux 2.4!).

Another way to get the disk head to seek around is to have two people download different large files at the same time. The blue graph shows this: at first it also gets 11 MB/sec throughput, but as soon as someone else downloads something, the head has to move around, killing the throughput for the poor guy. Please note that the second download was a rate limited 4 MB/sec download over the loopback interface, so what you see here was not Ethernet saturation, it was latency from the hard disk. By the way: the blue graph downloads the same file as the red graph.

The connect latency benchmark

We are ultimately interested in the performance of HTTP requests. There are two parts to that: the connect latency and the latency for answering the actual HTTP request. The connect latency is the time it takes for the server to notice the connection attempt and call accept.

This time is largely dominated by the event noficiation. Accepting a connection does not actually do anything besides sending a TCP packet and allocating a file descriptor. The socket benchmark already showed that allocating a file descriptor is O(1) for each OS in the test. So it is reasonable to expect this benchmark to show that the operating systems with special event notification APIs scale O(1) (Linux 2.4: SIGIO, Linux 2.6: epoll, FreeBSD+OpenBSD: kqueue) and the rest to scale O(n) (NetBSD). My benchmark http server is called gatling and it makes use of SIGIO, epoll and kqueue if available, but falls back to poll if not.

I omitted the graphcs for Linux and FreeBSD because they were O(1), as expected. As you can see, it was OpenBSD that showed the O(n) graph, and NetBSD that has the O(1) graph here. I am as surprised as you. Believe me, I double and triple checked that gatling used kqueue on OpenBSD and that I hadn't switched the results or graphs somehow.

The clear loser is, again, OpenBSD. Don't use OpenBSD for network servers. NetBSD appears to have found some clever hack to short-circuit poll if there only are events for one of the first descriptors in the array.

Measuring HTTP request latency

This final benchmark measures how long it takes for the http server to answer a request. This does not include the connect latency, which I showed you in the previous graph.

This graph shows that Linux 2.4 and Linux 2.6 perform equally O(1) here. FreeBSD is a little slower for the first 4000 connections and becomes faster after that. I am at a loss how to explain that. The FreeBSD guys appear to have found some really clever shortcuts. The Linux 2.4 graph is overdrawn by the Linux 2.6 graph here. OpenBSD data points are all over the place in this graph; again, I would advise against using OpenBSD for network servers. NetBSD is missing on this graph, but here is a graph with NetBSD:

The clear loser of this benchmark is NetBSD, because they simply don't offer a better API than poll. As I wrote in the introduction, I only benchmarked the stable NetBSD 1.6.1 kernel here, and I assume they have included kqueue in their -CURRENT kernel. I will try to update my NetBSD installation and rerun the benchmarks on it.

Conclusion

Linux 2.6 scales O(1) in all benchmarks. Words fail me on how impressive this is. If you are using Linux 2.4 right now, switch to Linux 2.6 now!

FreeBSD 5.1 has very impressive performance and scalability. I foolishly assumed all BSDs to play in the same league performance-wise, because they all share a lot of code and can incorporate each other's code freely. I was wrong. FreeBSD has by far the best performance of the BSDs and it comes close to Linux 2.6. If you run another BSD on x86, you should switch to FreeBSD!

Linux 2.4 is not too bad, but it scales badly for mmap and fork.

NetBSD 1.6.1 was treated unfairly by me because I only tested the stable version, not the unstable source tree. I originally only wanted to benchmark stable versions, but deviated with OpenBSD and then with FreeBSD. I should have upgraded NetBSD then, too. Nonetheless, NetBSD feels snappy, performs well overall, although it needs work in the scalability department, judging from the old version I was using. Please note that NetBSD was the only BSD that never crashed or panicked on me, so it gets favourable treatment for that.

OpenBSD 3.4 was a real stinker in these tests. The installation routine sucks, the disk performance sucks, the kernel was unstable, and in the network scalability department it was even outperformed by it's father, NetBSD. OpenBSD also gets points deducted for the sabotage they did to their IPv6 stack. If you are using OpenBSD, you should move away now.

The Code

I used my experimental web server gatling to measure these numbers. All the benchmark programs are also part of the gatling package.

You can download gatling via anonymous cvs from:

  % cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co libowfat
  % cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co gatling

libowfat contains my implementation of the IO API, gatling is the webserver. You need to build libowfat first. If you are using Linux, also check out

  % cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co dietlibc