Stability and reproducibility

Why stability matters

When doing performance analysis, a lot of different factors come into play. From the hardware to software, many things can impact the performance tests results, and make them less reproducible.

calcite tries its best to avoid false-positives when detecting regressions, but removing noise from the benchmarks you will run helps a lot. One can not remove all the noise, but must at least try to limit it and be aware of its major causes.

Having control over the various performance factors also helps to have a more correct understanding of what is happening. It is very often the case that developpers draw bad conclusions due to using a poor environment or metrics.

Run your tests more than once

calcite supports sending multiple results for a same test and commit, and will use statistical tools to detect wether a regression occured or not.
A bias can very easily occur if you run your benchmarks only once.
Running your tests multiple times is the only real way to enhance reproducibility with the complex systems we have nowadays.
It might even increase the noise a bit, but enables statistical analysis (provided enough iterations are executed).

Machine stability

Independently of the type of performance test you will be doing, there is a set of common things you can control to get a more stable environment.
This is done to have more consistent results and reduce the number of iterations required for reliable results.
While it is very useful to detect regressions, it might not reflect end-user results or be suited to your metrics. That is why if you decide to ignore those configurations (to be closer to the end-user performance), the best thing that can be done still is to run more iterations of the tests.
To help you configure your test machines, here is a non-exhaustive list of the usual culprits that prevent reproducibility.

Note

If you are using a linux machine, know that the pyperf system tune command can fix quite a few of these issues. For Windows machines, powercfg can be used to control power usage and CPU settings.

We might provide a similar tool that works on multiple operating systems at a later time.

Power supply

Make sure that your power supply is reliable and your power cord is plugged in! If your power supply is of poor quality, this can impact the frequencies your processor and other hardware.

There are safeties enabled when you are running your latpop using its battery that can greatly impact performance results.

Heat

Your motherboard and other hardware components (CPU, GPU, …) will throttle when facing high temperatures. Make sure that your machine has a good cooling system.

CPU

CPUs are now very complex and it is hard to predict how different processors will react to the same code.
However one of the main cause for variation in timings results is due to frequency (CPU speed) changes.
frequency
Usuallly, the operating system will change the power usage and hence frequency of your processor based on the load of your machine.
This is commonly known as frequency scaling and can very quickly yield to a high variance in results, even if you run the exact same performance test.
Many processors now also support some kind of frequency boost which changes frequency only for some cores if the others are inactive.
Both can be controlled by your operating system given your account has enough privileges.
On linux, set the cpu governor to performance and write 0 to /sys/devices/system/cpu/cpufreq/boost (kernel doc) or use the Intel pstate driver.
Frequency boost can also usually be disabled through the BIOS.
hyperthreading
Another feature of modern CPUs that can have a big impact on the stability of certain types of benchmarks is hyperthreading.
This technology means multiple (virtual) cores can share resources such as CPU caches.
Depending on the type of code you are benchmarking, this can have serious impacts on performance and its stability.
This is best disabled through BIOS, but the operating can disable cores if needed.

Note

On some processors, hyperthreading also divides the number of PMU (Performance Monitoring Unit) counters available by 2. If you rely on such events (cache miss rate, branch mispredictions, etc…) you might want to disable hyperthreading, otherwise the OS will have to use multiplexing.

GPU

Similarly to the CPUs, GPUs now have frequency scaling.

On Windows, make sure that your test application or profiler makes use of the ID3D12Device::SetStablePowerState function.

Note

This requires your machine to have the developper mode enabled.

On linux, this usually depends on the device vendor and the installed drivers.

Operating system

As soon as you request anything of your operating system, such as scheduling multiple processes and threads, it will have an impact on the reproducibility of your tests.
background processes

To mitigate the issue, the most basic advice is to avoid running unnecessary processes in the background. Those will simply consume resources and cause contention. The operating system might need to suspend your application, even for tiny amounts of time.

scheduler

The operating system is responsible for scheduling processes and threads. By default it will try to balance CPU time fairly between processes, which can be an issue when doing performance testing.

One simple way to mitigate this problem is to give higher priority to your performance test processes. On Windows this is done using the SetPriorityClass function on existing processes, or launching it with the start command and specifying the priority (/high or /realtime). On linux this is controlled through the nice command/function.

Note

If you decide to isolate CPU cores, this will have little to no impact.

Warning

If your test benchmark involves communication with other processes, this can negatively impact performance. It usually still gives better reproducibility of the results though.

cpu core isolation

On multi-core systems, you can also dedicate specific cores to your application by isolating them.

On linux, you can use the cpuset utility to isolate CPUs for the kernel and your benchmarks.
You can also use the taskset command in combination with the isolcpus kernel parameter.

Warning

While it might not reflect the end-user case, it is generally a good idea when checking for regressions. It is particularly great for compute workloads.

IRQ affinity
Hardware interrupts can also cause your application to be suspended from time to time.
You can tell the kernel to avoid dispatching interrupts to the cores your application is using by setting /proc/irq/default_smp_affinity and /proc/irq/IRQ#/smp_affinity (doc)
Address Space Layout Randomization
The operating system has a security feature named Address Space Layout Randomization (ASLR).
We recommend to leave it turned on as it limits the impact of things such as environment variables values, the working directory or the command line used.
As suprising as it may sound, those things can have an impact as the system might not allocate the exact same memory when loading your program. [STI16]
Using ASLR, your code will always be loaded at a different place, which mitigates the issue.

Note

In this specific case, we advise to actually add a bit of randomness in the results, but this should be mitigated by the fact you should run your benchmarks multiple times anyway.
This will however only work if you start a new process for each run of your testsuites.
In the case where you do not want to run the process multiple times, then disabling ASLR might yield better results.
That is IF you can make sure your tests are not impacted by it.
This means at least environment variables, working directory and command line must not change.

References

STI16

Stinner, V. (2016, May 23). My journey to stable benchmark, part 3. Last accessed from https://vstinner.github.io/journey-to-stable-benchmark-average.html