By Chao Qian (Xixie)
This article describes why UnixBench scores vary between scenarios, and what needs to be considered
Factor 1: CPU Pinning and Unpinning
VMs provided by many cloud vendors in China are not pinned to CPUs, and because of this they may get high scores in multi-process scenarios. However, these VMs may also get low scores. The number of CPU cores is important even if you only run a single process on the VM, because some processes are concurrently running, such as the Shell Scripts (with 8 concurrent processes).
With the same or close clock speed, entry-level ECS instances (unpinned on the CPU) can be used for comparison, and their results are comparable. However, we recommend enterprise-level ECS instances (pinned on the CPU). The UnixBench scores of enterprise-level ECS instances may be low, compared with those of VMs from other vendors. However, if you test many VMs from other vendors, you may get low scores for a VM (running on a crowded physical machine).
Factor 2: Clock Speed
The CPU clock speed is crucial, and can even be decisive for any computation-related tasks. The higher the clock speed, the better.
Compare VMs of different vendors that have the same clock speed level. Different VMs with clock speeds in the middle class may still vary greatly in performance. For this reason, align the clock speed level before comparing performance. If the clock speeds are not at the same level, check stability.
Factor 3: UnixBench Algorithm Defect 1: Pipe-Based Context Switching
Pipe-Based Context Switching is a sub-algorithm of UnixBench, aiming to show the parent-child process switching efficiency. Its performance varies greatly across VMs from different cloud service vendors. The CPU topology varies by cloud service vendor:
In the CPU topology of other vendors, the parent and child processes are in the same CPU, resulting in low parent-child process switching costs and high scores in parent-child process switching efficiency. These scores can be even higher than scores of physical machines, which is unreasonable. To ensure parent-child process switching between different CPUs, we have developed a simple optimization method. Use the Alibaba Cloud UnixBench to test the efficiency of parent-child process switching. Download and execute unixbench.sh.
Factor 4: UnixBench Algorithm Defect 2: Double-Precision Whetstone
For more information about the implementation of floating-point number operations, see the previous article on UnixBench implementation.
As mentioned above, different clock speeds will lead to performance differences. The higher the clock speed, the better. The floating-point number operations score is inversely proportional to the clock speed performance.
The reason for this is that the Double-Precision Whetstone algorithm calculates the maximum number of floating-point number operations completed within 10 seconds. Floating-point number operations include addition, subtraction, multiplication, division, condition switching, trigonometric functions, and exponential operations. The algorithm considers that the time required for these operations increases linearly with the computation workload within a specified time period (which is controlled within 10 seconds). The algorithm, however, does not assume that the clock speed may be over 3.0 GHz in the design, which makes the computation workload of the last operation extremely large. As the computation workload increases, the time consumed increases exponentially. For high-frequency VMs, the initial operations take little time. This allows them to complete a large number of operations within two seconds, and, as a result, the time required for the last exponential operation is no longer linear as estimated. Here is the final paragraph of code extracted:
#include <math.h>int main(int argc, char *argv)
long ix, i, xtra = atol(argv), n8 = 93*x100;
double x = 0.75;
double t1 = 0.50000025;
for (ix=0; ix < xtra; ix++)
for(i=0; i < n8; i++)
x = sqrt(exp(log(x)/t1));
On a high-frequency VM, the time consumed for single-thread operations is as follows:
Computation workloadTime consumed10000.82020001.50930002.21640002.90950006.527
According to the table above, we can see that the time consumption is no longer linear when the computation workload exceeds 4000. We can see that this algorithm does not offer a fair representation. The computation workloads on high-frequency VMs are high, resulting in increased time consumption and low performance scores. You can make a simple modification to ensure the linearity of computation workloads. To do this, perform the initialization
x = 0.75 first for each loop.
Factor 5: Differences in Compilation Options
GCC is used by default. If ICC is used, the compilation performance is significantly improved. If dry2stone integer operations are used, the performance can be improved by 7.9%.
Factor 6: Kernel Parameter Adjustment
The performance of Pipe-based Context Switching is improved by more than twofold if halt polling is enabled. At the same time, power consumption is also significantly increased. By default, this option is not enabled by cloud service vendors.
Factor 7: Process Interference
UnixBench tests are very basic process operations, which reduce the interference of other processes as much as possible to achieve the best performance.
In this article series, we have introduced UnixBench, described its implementation, and discussed common factors that can affect tests scores.