User-level Threading: Have Your Cake and Eat It Too - Supplementary Material

Martin Karsten and Saman Barghi

This web page provides software, experiment data, and supplementary information.

Repositories and Downloads

Commit SHAs

Experiments

The experiment scripts are set up for our local environment and need to be adjusted before they can be used elsewhere. The AMD 64-core machine is named 'kosa64'. The Intel 32-core machine is named 'kosi32'. The memcached latency experiments are run on a cluster of machines named 'redXX'. Please contact me (mkarsten) with any questions.

Figures 3-5

Data is produced using the script exp-threadtest.sh. The variable $bindir needs to be set appropriately. The values for $wbase and $gbase in config.sh are obtained for 1ms of work by running
./threadtest -c -u 1000 -w 1000
and
./gthreadtest -c -u 1000 -w 1000
respectively, on the target machine. The Makefile in the apps subdirectory of libfibre contains rules to make the necessary variants of threadtest.

Figures 6-9

Data is produced using the script exp-webserver.sh. The variable $bindir needs to be set appropriately. The Makefile in the apps subdirectory of libfibre contains rules to make the necessary variants of webserver. The ULib webserver is part of the ULib distribution.

Figures 10-13,16

Data is produced using the script exp-memcached.sh. The variables $bindir and $libdir need to be set appropriately with precompiled variants of memcached and the libfibre shared library placed in the appropriate subdirectories. Mutilate is enhanced using the above patch.

Figures 14-15,17-21

Data is produced using the script runred.sh. The script coordinates programs on multiple machines, thus machines names, IP addresses, and path names need to be adapted. Core affinity is added to vanilla Memcached using the above patch. Core affinity is enabled for fibre Memached using the preprocessor variable TESTING_AFFINITY in memcached.h.

Arachne Variation

A variation of Arachne has been pointed out by the Arachne authors. When disabling its load estimator, Arachne keeps all cores busy-spinning at all times. It can then achieve somewhat higher throughput and scalability in the `threadtest` experiments (Figures 3-4), but at the expense of worse cycle efficiency (Figure 5).

We regard the load estimator and core allocation of Arachne as a necessary kludge to avoid busy-spinning all given cores. This particular way of solving the idle-sleep problem has inherent disadvantages (lag from the load estimator) and the experiments expose this weakness. Therefore, and to avoid confusion, the paper only contains the results with the load estimater. However, we provide Figures 3-5 with an extra line representing 'ArachneSpin' here:

Results on Intel machine

The paper reports that all localhost experiments (Figures 3-13,16) have also been run on a 32-core Intel machine. We provide those results below: