Analysis: bond_risk

Application version	TBA
Application licence	BSD
Trad4 version	TBA
Document version	beta01
Author	schevans
Date	21-05-2009

Introduction
The Hosts
Concepts
Test 1: Non-optimised
Test 2: Optimised
Test 3: Duration vs. Context Switches
Test 4: Context Switches over Time
Conclusion
How to run the tests

Introduction

This is an analysis of the trad4 app bond_risk. This document is still in beta - if there is anything you'd like clarified or any test you'd like to see run, please let me know.

What we'll be measuring is the duration of the initial flush on start-up under various conditions. A flush is when every node gets recalculated, as happens on start-up. As such we'll be ignoring the time it takes to load the objects from the DB. It's important to understand that this initial flush is only the start-up time of a particular app and the performace profile during day-to-day activities will be different. During day-to-day operation the system will be responding to events that only affect a sub-set of objects, with the possible exception of date-rolls.

For each test the binary is locally compiled, and the test is run with the host in single-user mode using a non-root user and the default nice. Only one run of each test was recorded, with the exception of the ST Mean which gave us some measure of the standard deviation across all tests.

All these tests were run against the 440k data set: 40k bonds with 200k each of repo and outright trades.

The tests are summarised below:

Test 1 is a comparison between the Athlon and Opteron measuring the duration of the 440k flush against various numbers of threads using non-optimised binaries.
Test 2 is a comparison between the Athlon and Opteron measuring the duration of the 440k flush against various numbers of threads using optimised binaries.
Test 3 shows the 440k flush duration along with context switches
Test 4 shows context switches per second as a 440k flush runs against various numbers of threads.

The Hosts

The hosts are summarised below.

Host	Athlon	Opteron
Model	64 X2 Dual Core 4800+	Quad-Core Opteron 2354
CPU GHz*	2.4	2.1
Cores	2	8
Dies	1	2
OS Arch	i686	x86_64
Kernel	2.6.22-15-generic	2.6.27-7-server
Compiler	gcc 4.1.3	gcc 4.3.2

* As advertised

Concepts

Single-Threaded Mean (ST Mean)

The ST Mean value of a particular test is the mean value of 10 runs in single-threaded mode (NUM_THREADS=0). This gives us a good benchmark for the single-threaded performace of a particular arch/compiler combination.

This also gives us some information on the variance and standard deviation of these runs. This is shown in the table below:

Test	Mean	Variance	Standard Deviation
Athlon	22.3148	0.2297	0.4792
AthlonO3	8.2094	1.3066	1.1431
Opteron	19.9114	0.2944	0.5426
OpteronO3	13.2403	0.0020	0.0450

Single-Threaded Mode vs. NUM_THREADS=1

Single-threaded mode means the master thread does all the work. NUM_THREADS=1 means there is one master thread and one worker thread. Single-threaded mode is faster than NUM_THREADS=1 due to the lack of overhead of handing the work off to the worker thread.

The Context-Switch Cascade

The context-switch cascade occurs when there are too many threads on a host, and the kernel spends all it's time switching between threads and not getting any work done so the host becomes unresponsive and unusable. It is the hard limit on the number of threads we can run per host. This is given below:

Host	Max Threads
Athlon	128
Opteron	512

For this reason no data is given for the Athlon where NUM_THREADS>128.

Test 1: Non-optimised

Num Threads	Athlon Time	Opteron Time
1	70.05	54.82
2	35.83	24.59
4	22.13	12.12
8	15.06	6.11
16	14.15	3.51
32	13.66	2.99
64	13.3	2.73
128	13.2	2.8
256		2.74
512		2.87

br440k

From this test we can see several things. Firstly, by comparing the ST Mean of both the Athlon and Opteron we can see the single-core CPU speed is roughly comparable with the Opteron coming in as slightly faster.

Second, we can see that the multi-threaded Athlon is about twice as fast as the ST Mean. Likewise the multi-threaded Opteron is about eight times faster than the ST Mean. This we would expect as the Athlon is dual-cored and the Opteron is 8-cored.

Last, we can see that the multi-threaded Opteron is about four times as fast as the multi-threaded Athlon. Again we would expect this as the Opteron has four-times as many cores.

Test 2: Optimised

Num Cores	Athlon Time	Opteron Time
1	49.98	46.79
2	30.68	21.24
4	12.37	10.13
8	8.37	5.19
16	6.52	2.66
32	6.17	2.25
64	6.01	1.92
128	5.85	1.91
256		1.94
512		2.08

br440k_opt

The first thing we notice on this graph is that the Athlon binary has improved with optimisation (-O3) significantly more than the Opteron - the Athlon's ST Mean is not only faster than the Opteron now, there's also a bigger disparity between the two ST Means.

For this reason the multi-threaded Opteron is only about three times faster than the Athlon, when we expected a four-fold increase.

This is a somewhat unexpected result and one I'll be looking into in the future. My hypothesis is that the gcc 4.3.2 optimiser on x86_64 isn't as effective as the gcc 4.1.3 optimiser on i686. The fact the ST Means show the same pattern suggests I'm not hitting some not-yet-understood limit of the trad4 architecture.

Test 3: Duration vs. Context Switches

Num Threads	Opteron Time	Opteron CS
1	46.79	105.75
2	21.24	55.69
4	10.13	30.76
8	5.19	16.94
16	2.66	11.13
32	2.25	9.2
64	1.92	8.89
128	1.91	10.01
256	1.94	13.63
512	2.08	23.78

br440k_context

This graph shows the duration and number of context switches against #Threads, using an optimised Opteron.

The reason for the high number of context switches for a low number of threads is simply because the run has a longer duration and therefore a longer context-switch sample period. This is better illustrated in the graph below.

Another observation we can make is that even after the number of context switches starts to climb, the duration continues to fall (albeit slightly). This suggests any optimisation strategy should not be an attempt to minimise the context-switches - it is only when the cascade starts will we see performance drop off.

Test 4: Context Switches over Time

br440k_context_time

This data was collected from a second process collecting CS/s stats while an optimised Operon run was taking place. As such the x-axis shows real-time and each peak corresponds to a bond_risk run where the number of threads are doubled each run.

You can now see why the NUM_THREADS=1 run in Test3 was inflated by the time it took to complete the run: The CS/run data in Test3 corresponds to the integration of each peak on this graph.

You can see the beginnings of the context switch cascade on the NUM_THREADS=1024 spike. When this occurs the process monitoring CS/s can't get enough time on a CPU to record the CS/s, so the experiment is terminated.

Conclusion

In this section we'll be slicing the data already presented to prove that a trad4 application scales linearly with multiple cores - double the number of cores and you halve the calculation time.

First we'll examine each host separately (but on the same graph), testing to see how the multi-threaded stacks up to the ST Mean given X number of cores. Second, we'll compare the run durations between the two hosts to see if we can see a similar pattern.

Intra-host

br440k_st_duration

The above graph show the ratio of the ST Mean and each duration for all four test cases. This shows 'how much faster' any #Thread configuration is than the single-threaded case.

You can see, above 64 threads the Operon cases are close to eight times faster than the Opteron ST Mean. The Athlon cases above 16 threads are around two times faster than the Athlon ST Mean, as expected.

Inter-host

br440k_ath_opt

This graph shows the ratio of the Athlon/Operon durations over #Thread configuration. Looking at the non-optimised data (blue line), as the #Threads increase the Opteron starts to get the edge due to the extra headroom for threads (threadroom), until it's around 4 times faster than the Athlon.

When optimised (red line), the picture looks a little different due to what looks to be an advantage the Athlon optimised binary has over the Optron optimised binary. This is largely borne out by the difference in ST Means between the optimised and non optimised versions as discussed in Test1, but this will be further investigated.

How to run the tests

To run the tests:

Download(TBA), unpack and set-up as usual, sourcing bond_risk.conf
Set APP_DB to point to bond_risk_440k.db
Recompile with -O3 in CXXFLAGS if required
Run benchmarker.sh, which produces a benchmark.log.$$ file in $APP_ROOT.

Pro Tip: Run benchmarker.sh in the foreground so that it's easy to kill once you hit the context switch cascade. If in a multi-user/desktop environment and you don't run in the foreground you may have to hard-reboot your machine.