Introduction

<title>Benchmarking BSD and Linux</title>

<h1>Introduction</h1>

<p>
These benchmarks are the result of my scalable network programming
research.  My interest in this area is to see how scalable and fast
network applications can be on standard PC hardware.
<p>
I have done most of my research on Linux 2.4, 2.5 and 2.6 kernels using
a home-grown distribution I affectionately call "Leanux".  I have
experimented with several APIs and methods to try and get the most
scalability and performance out of a web server.  The ultimate goal,
however, is to demonstrate scalability by surviving a <a
href=http://ssadler.phy.bnl.gov/adler/SDE/SlashDotEffect.html>Slashdotting</a>.
<p>
Please note that most of the sites succumb not only because of bad
software but also because of their internet connectivity being
saturated.  Besides choosing an ISP with small bandwidth costs, there is
only so much I can do about this.  I hope this ISP can handle the load.
<p>
During my research I experimented with several APIs on what abstraction
to choose for the different scalable event notification mechanisms.  In
the end, I settled on <a href=http://cr.yp.to/lib/io.html>Dan
Bernstein's IO library interface</a>, which I slightly extended to
include an abstraction around <tt>writev</tt> and <tt>sendfile</tt>.
When my implementation worked on Linux, I decided to port the API to BSD
as well, to finally get some real benchmark data into the eternal
flamewar on which IP stack scales best.
<p>
To that end, I installed FreeBSD, OpenBSD and NetBSD on my notebook, so
all benchmarks would run on the same hardware.
<p>

<h1>Benchmarking BSD and Linux</h1>

<h2>About the hardware</h2>

<p>
The benchmark hardware is a Dell Inspiron 8000 with a 900 MHz Pentium 3
and 256 MB RAM.  The network chip is a MiniPCI Intel eepro100 card,
which is supported and well tuned on all operating systems.
<p>
Since my intention is to benchmark the software and not the hardware, I
didn't care that it was only a single, slow CPU with slow memory and a
slow IDE hard disk.  Real server machines built for high scalability
will probably use more powerful hardware than this.
<p>

<h2>Common settings</h2>

<p>
On all of the operating systems, I took the default settings.  If I had
to turn a knob to get PCI IDE DMA enabled, I did so.  If the OS had
power management support, I did not disable that.  I enabled (and used)
IPv6 support on all operating systems.
<p>

<h2>Linux 2.4</h2>

<p>
I benchmarked a stock Linux 2.4.22 kernel.
<p>

<h2>Linux 2.6</h2>

<p>
I benchmarked a stock Linux 2.6.0-test7 kernel.
<p>

<h2>OpenBSD 3.4</h2>

<p>
I benchmarked an OpenBSD 3.4-CURRENT.  I directly installed -CURRENT to
get patched up openssl and openssh versions.  The first OpenBSD
installation was wasted because the stupid boot loader (after installing
everything) did not want to boot from the partition, apparently because
it is beyond cylinder 1024.  In my opinion, this is absolutely
unexcusable for a modern operating system in the year 2003, and the
OpenBSD people should be ashamed of themselves.  Linux and the other
BSDs did not have a problem booting from high places.
<p>
I ended up reinstalling OpenBSD in the swap partition from my Linux
installation.  In the end I had so little space that I had to copy the
logs over to my desktop box after each benchmark, otherwise the
filesystem would be full.
<p>
OpenBSD also caused a lot of grief on the IPv6 front.  The OpenBSD guys
intentionally broke their IPv6 stack to not allow IPv4 connections to
and from IPv6 sockets using the IPv4 mapped addresses that the IPv6
standard defines for thus purpose.  I find this behaviour of pissing
on internet standards despicable and unworthy of free operating systems.
<p>
OpenBSD had lots of performance and scalability issues.  Particularly
embarassing is the fact that even NetBSD outperforms it on a few
benchmarks.  OpenBSD is a fork of NetBSD, so I expected it to be no
worse than it's ancestor in all key areas.  I was wrong.
<p>

<h2>FreeBSD 5.1</h2>

<p>
I installed a FreeBSD 5.1-RELEASE on a free partition.  However, the
kernel turned out to be unstable under load and would panic or even
freeze under load.  So I reluctantly upgraded the kernel to 5.1-CURRENT,
which fixed the problems and proved to be a very stable kernel.
<p>
I also upgraded openssl and openssh on this installation.  And I amended
my library to use the IPV6_V6ONLY sockopt to get proper IPv6 behaviour
from FreeBSD.
<p>
Apart from the stability issues in 5.1-RELEASE, FreeBSD turned out to be
very stable and the fastest and most scalable BSD.
<p>

<h2>NetBSD 1.6.1</h2>

<p>
I installed a NetBSD 1.6.1-RELEASE on a free partition.  The kernel was
very stable and had a much snappier feeling to it than OpenBSD, which
was the first BSD I tried out, and I somehow expected NetBSD to be
slower than OpenBSD in every respect.  The opposite was true: in
particular disk and file system performance appears to be much better in
NetBSD.
<p>
Unfortunately, the IPV6_V6ONLY sockopt did not work on NetBSD.  One has
to use a sysctl to get the proper IPv6 behaviour.  That is unfortunate
but bearable.
<p>
Since NetBSD did not start any network services per default, I did not
feel obliged to upgrade openssl and openssh.  Since the kernel was so
stable, I did not feel I had to try the -CURRENT version.  In the mean
time I heard that the -CURRENT version of NetBSD has over two years
worth of improvements in it, so it probably is even better than NetBSD
1.6.1, which already surprised me with its good performance (although it
is clearly outperformed by FreeBSD).
<p>

<h2>The socket benchmark</h2>

<p>
My first benchmark was calling <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/socket.html><tt>socket</tt></a>
ten thousand times.  I normally use <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/gettimeofday.html><tt>gettimeofday</tt></a>
for taking the time in benchmarks, but in this case the results were so
close and small that I switched to reading the CPU cycle counter, which
has 900 times finer resolution on my notebook, but makes the results not
so easily comparable.
<p>
The task we are benchmarking here is the kernel allocating a socket data
structure in kernel (which is easy to do) and selecting the lowest
unused file descriptor (which is not so easy).
<p>
<img width=888 height=573 src=socket.png>
<p>
On this benchmark you can see that all the operating systems scale quite
well.  While there are variations, there is no O(n) or even worse
implementation.  I find the FreeBSD graph particularly interesting,
because it indicates some sort of "cheating" by suddenly being so much
faster than all the others (and itself a few sockets earlier).  This
looks like the kernel starts pre-allocating file descriptors when a
process has more than 3500 descriptors open or so.  I didn't look in the
code, though, and after all there is nothing wrong with this kind of
tweaking.
<p>
Conclusion: NetBSD outperforms all other operating systems in this
benchmark.  However, all contestants scale equally well, there are no
clear losers.  And the overall latency in this benchmark is so low that
the results are interchangeable in practice.
<p>

<h2>The bind benchmark</h2>

<p>
The second benchmark is calling <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/bind.html><tt>bind</tt></a>
with port 0 on each of the sockets.  This was actually done in the same
benchmark program than the previous benchmark, and in the same run.
<p>
This benchmark is not important for scalable web servers, but it is
important for proxy servers, FTP servers and load balancers.  Oh, and
for our http benchmark program, of course.
<p>
The task we are benchmarking here is the kernel selecting an unused TCP
port.  There are 65535 TCP port.  Traditionally, ports greater than 1024
are selected here, but at least on Linux this range is configurable
(<tt>/proc/sys/net/ipv4/ip_local_port_range</tt>, which is set to 32768
- 61000 per default).  This task is actually easier than selecting the
first unused file descriptor, but since it is not so important for web
servers, some operating systems decided not to optimize it.
<p>
<img width=640 height=480 src=bind4.png>
<p>
In this graph you cannot see FreeBSD and Linux 2.4, because they are
overdrawn by the Linux 2.6 graph.  All three graphs are in the same
area, scaling equally well: O(1).
<p>
The dents in the graphs for OpenBSD and NetBSD are normally a sign for a
data structure becoming so big that it does not fit into the L1 cache
any more (or L2 cache).  You can see that NetBSD and OpenBSD scale O(n)
here, while Linux and FreeBSD scale O(1).  Since OpenBSD has so many
data points with dramatically higher latency, it is the clear loser in
this benchmark.
<p>

<h2>The fork benchmark</h2>

<p>
This benchmark creates a pipe and then opens many child processes, which
each write one byte into the pipe and then hang around until they are
killed.  After creating each child, the benchmark waits until it can
read the one byte from the new child process from the pipe and takes the
time from the fork until having read the byte.
<p>
This way we measure how fast <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/fork.html><tt>fork</tt></a>
is implemented, and we measure the scheduler performance as well,
because the scheduler's job becomes more difficult if there are more
processes.
<p>
<img width=640 height=480 src=fork.png>
<p>
The OpenBSD and FreeBSD graphs stop early because OpenBSD crashed when I
forked more processes, and I couldn't find out how to increase FreeBSD's
system limit on the number of processes (sysctl said the value was
read-only).
<p>
Please note that the benchmark process was linked dynamically.  The
Linux 2.4 graph is very peculiar, because it looks like two graphs.  All
of the graphs in all of the benchmarks here are taken five times and I
initially planned to show the average values with error margins.  The
Linux 2.4 data from this benchmark were the reason I didn't; now I just
plot all the values for each of the five runs.  You might guess that the
first run of the benchmark on Linux 2.4 produced the O(n) graph and the
others were O(1), maybe because some internal data structure was
continuously enlarged during the first run, but that is wrong.  Each of
the five benchmark runs on Linux 2.4 alternated randomly between a point
in the O(1) graph and one in the O(n) graph.  I have no explanation for
this.
<p>
The clear winner in the graph is Linux 2.6.  OpenBSD does not scale at
all, and even panics under high load.  NetBSD scales O(n), which is
respectable for the grandfather of all the BSDs, but it is not a winning
performance.  Linux 2.4 shows that there is work to be done; I give it
the third place.  FreeBSD looks like it would scale O(1) if I could
create more processes with it, but as long as I can't confirm it, I can
only give it the second place.
<p>

<h2>The static fork benchmark</h2>

<p>
fork has more work to do when the processes are dynamically linked.  So
I reran the fork benchmark with a statically linked binary, but only on
FreeBSD and Linux 2.6, the winners of the previous benchmark.  To show
you the difference, I plottet the new results besides the old results
in this graph.
<p>
<img width=640 height=480 src=fork-vs-forks.png>
<p>
As you can see, linking statically almost halves the fork latency on
both systems.
<p>

<h2>The mmap benchmark</h2>

<p>
It is important for databases and large web and proxy servers to map
files into memory instead of having a buffer and reading the file
contents into the buffer.  If you map the file into memory directly, the
operating system has more memory left for I/O buffering.
<p>
The Unix syscall for memory mapping files is called <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/mmap.html><tt>mmap</tt></a>.
The performance aspect we are benchmarking here is the efficiency of the
data structures the kernel uses to manage the page tables.  Memory is
managed in units of "pages", with one page typically being 4k.  Many
architectures (including x86) can to 4 MB pages as well for special
occasions.  On some SPARC CPUs the page size is 8k, on IA-64 it can be
4k to 16k.
<p>
The operating system needs to maintain two data structures for memory
mapped files: one system wide "page table" for all the pages, because
more than one process may do a shared mapping of the same page, plus one
table for each process.  The process specific table is what fork copies.
<p>
This benchmark takes a 200 MB file and mmaps other second 4k page of it
into its address space.  To make sure we measure the mmap data structure
and not the hard disk, the benchmark starts by reading every of those
pages once, so they are in the buffer cache.  Then this benchmark takes
the time it takes to mmap each page, and the time it takes to read the
first byte of each page.
<p>
The point is that the operating system does not actually need to map a
page into the address space when we mmap it.  Only when we access that
page, the process local page table needs to get updated.  The MMU will
signal an exception to the operating system as soon as the process
touches the mmapped page that the OS did not actually map yet.
<p>
<img width=640 height=480 src=mmap.png>
<p>
As you can see, we have three clear losers in this benchmark: Linux 2.4,
NetBSD and FreeBSD.  The OpenBSD graph scales much better than these,
but wait until you see the second part of this benchmark.  The clear
winner is Linux 2.6.
<p>
Here is a graph of Linux and FreeBSD latency for touching a page.
<p>
<img width=640 height=480 src=mmap1.png>
<p>As
you can see, Linux 2.4 appears to scale O(n), while Linux 2.6 is O(1).
FreeBSD looks to be much faster than Linux 2.6, but you need to keep in
mind that FreeBSD took an extraordinary time to do the actual mmap, so
this good result does not save the day for FreeBSD.
<p>
So where are the others?
<p>
<img width=640 height=480 src=mmap2.png>
<p>
Here are the same results, but this time with NetBSD.  As you can see,
NetBSD is significantly slower than Linux and FreeBSD for this
benchmark, but at least it does not get much slower when more pages are
mmapped.
<p>
And now the final graph, with OpenBSD:
<p>
<img width=640 height=480 src=mmap3.png>
<p>
Whoa!  Obviously, something is seriously broken in the OpenBSD memory
management.  OpenBSD is so incredibly slow that compared to this
performance, NetBSD looks like Warp 9, and Linux is not even on the same
chart.
<p>
Conclusion: Linux 2.6 is the clear winner, scaling O(1) in every
respect.  The clear loser is OpenBSD; I have never seen bad performance
of this magnitude.  Even Windows would probably outperform OpenBSD.
NetBSD performance leaves a lot to be desired as well.  This mmap
graph is the only part of the whole benchmark suite where FreeBSD did
not perform top notch.  If the FreeBSD people fix this one dark spot,
they will share the top space with Linux 2.6.
<p>

<h2>Fragmentation</h2>

<p>
I would like to show you one more graph, although it is not OS specific.
The next graph shows the effect of file system fragmentation and of
I/O scheduling.
<p>
<img width=640 height=480 src=fragments.png>
<p>
The red graph shows one client downloading one big CD image over a
dedicated Fast Ethernet connection.  As expected, the Ethernet is
completely saturated, and the throughput is 11 MB/sec sustained.
This was measured on Linux 2.6, but all the other operating systems
(except OpenBSD) were also able to saturate the Fast Ethernet.  OpenBSD
had big performance drops in the process, adding to the previous
embarassment with mmap.  I can really only warn of using OpenBSD for
scalable network servers.
<p>
Anyway, the green graph shows a badly fragmented file (I downloaded
another ISO image using an old version of BitTorrent).  Although the IDE
disk is slow, it is not that slow.  It can read a sustained 25 MB/sec
linearly.  But on modern hard disks the throughput is only good as long
as you don't have to seek around on disk, which basically means that
large files need to be non-fragmented.  For this fragmented file, the
throughput drops to about 4 MB/sec (which already has the positive
effects of the Linux 2.6 I/O scheduler, it's more like 1.5 MB/sec on
Linux 2.4!).
<p>
Another way to get the disk head to seek around is to have two people
download different large files at the same time.  The blue graph shows
this: at first it also gets 11 MB/sec throughput, but as soon as someone
else downloads something, the head has to move around, killing the
throughput for the poor guy.  Please note that the second download was a
rate limited 4 MB/sec download over the loopback interface, so what you
see here was not Ethernet saturation, it was latency from the hard disk.
By the way: the blue graph downloads the same file as the red graph.
<p>

<h2>The connect latency benchmark</h2>

<p>
We are ultimately interested in the performance of HTTP requests.  There
are two parts to that: the <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/connect.html><tt>connect</tt></a>
latency and the latency for answering the actual HTTP request.  The
connect latency is the time it takes for the server to notice the
connection attempt and call <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/accept.html><tt>accept</tt></a>.
<p>
This time is largely dominated by the event noficiation.  Accepting a
connection does not actually do anything besides sending a TCP packet
and allocating a file descriptor.  The socket benchmark already showed
that allocating a file descriptor is O(1) for each OS in the test.  So
it is reasonable to expect this benchmark to show that the operating
systems with special event notification APIs scale O(1) (Linux 2.4:
SIGIO, Linux 2.6: epoll, FreeBSD+OpenBSD: kqueue) and the rest to scale
O(n) (NetBSD).  My benchmark http server is called gatling and it makes
use of SIGIO, epoll and kqueue if available, but falls back to <a
href=http://www.opengroup.org/onlinepubs/007904975/functions/poll.html><tt>poll</tt></a>
if not.
<p>
<img width=640 height=480 src=openbsd-vs-netbsd.png>
<p>
I omitted the graphcs for Linux and FreeBSD because they were O(1), as
expected.  As you can see, it was OpenBSD that showed the O(n) graph,
and NetBSD that has the O(1) graph here.  I am as surprised as you.
Believe me, I double and triple checked that gatling used kqueue on
OpenBSD and that I hadn't switched the results or graphs somehow.
<p>
The clear loser is, again, OpenBSD.  Don't use OpenBSD for network
servers.  NetBSD appears to have found some clever hack to short-circuit
poll if there only are events for one of the first descriptors in the
array.
<p>

<h2>Measuring HTTP request latency</h2>

<p>
This final benchmark measures how long it takes for the http server to
answer a request.  This does not include the connect latency, which I
showed you in the previous graph.
<p>
<img width=640 height=480 src=httpreq-2.png>
<p>
This graph shows that Linux 2.4 and Linux 2.6 perform equally O(1) here.
FreeBSD is a little slower for the first 4000 connections and becomes
faster after that.  I am at a loss how to explain that.  The FreeBSD
guys appear to have found some really clever shortcuts.  The Linux 2.4
graph is overdrawn by the Linux 2.6 graph here.  OpenBSD data points are
all over the place in this graph; again, I would advise against using
OpenBSD for network servers.  NetBSD is missing on this graph, but here
is a graph with NetBSD:
<p>
<img width=640 height=480 src=httpreq-3.png>
<p>
The clear loser of this benchmark is NetBSD, because they simply don't
offer a better API than poll.  As I wrote in the introduction, I only
benchmarked the stable NetBSD 1.6.1 kernel here, and I assume they have
included kqueue in their -CURRENT kernel.  I will try to update my
NetBSD installation and rerun the benchmarks on it.
<p>

<h2>Conclusion</h2>

<p>
Linux 2.6 scales O(1) in all benchmarks.  Words fail me on how
impressive this is.  If you are using Linux 2.4 right now, switch to
Linux 2.6 now!
<p>
FreeBSD 5.1 has very impressive performance and scalability.  I
foolishly assumed all BSDs to play in the same league performance-wise,
because they all share a lot of code and can incorporate each other's
code freely.  I was wrong.  FreeBSD has by far the best performance of
the BSDs and it comes close to Linux 2.6.  If you run another BSD on
x86, you should switch to FreeBSD!
<p>
Linux 2.4 is not too bad, but it scales badly for mmap and fork.
<p>
NetBSD 1.6.1 was treated unfairly by me because I only tested the stable
version, not the unstable source tree.  I originally only wanted to
benchmark stable versions, but deviated with OpenBSD and then with
FreeBSD.  I should have upgraded NetBSD then, too.  Nonetheless, NetBSD
feels snappy, performs well overall, although it needs work in the
scalability department, judging from the old version I was using.
Please note that NetBSD was the only BSD that never crashed or panicked
on me, so it gets favourable treatment for that.
<p>
OpenBSD 3.4 was a real stinker in these tests.  The installation routine
sucks, the disk performance sucks, the kernel was unstable, and in the
network scalability department it was even outperformed by it's father,
NetBSD.  OpenBSD also gets points deducted for the sabotage they did to
their IPv6 stack.  If you are using OpenBSD, you should move away now.
<p>

<h2>The Code</h2>

<p>
I used my experimental web server <i>gatling</i> to measure these
numbers.  All the benchmark programs are also part of the gatling
package.
<p>
You can download gatling via anonymous cvs from:
<p>
<pre>
  % cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co libowfat
  % cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co gatling
</pre>
<p>
libowfat contains my implementation of the IO API, gatling is the
webserver.  You need to build libowfat first.  If you are using Linux,
also check out
<pre>
  % cvs -d :pserver:cvs@cvs.fefe.de:/cvs -z9 co dietlibc
</pre>
<p>