Understanding Memory Performance
By Chao Qian (Xixie)
Memory is an essential part of a server, regardless of whether it is a physical machine or virtual machine. But the question is, how is memory performance measured?
Memory and Cache
Currently, most new CPUs have three levels of cache, including L1 cache (32 KB — 256 KB), L2 cache (128 KB — 2 MB), and L3 cache (1 MB — 32 MB). Cache sizes are getting bigger. A CPU searches the caches for data first. If the data is unavailable there, it then searches the memory.
Memory and Latency
CPUs can obtain data faster from a closer locations. LMBench can be used to test the data reading latency.
The preceding figure shows that:
- The Intel(R) Xeon(R) Platinum 8163 CPU with the frequency of 2.50 GHz has a 32 KB L1D cache, a 32 KB L1I cache, a 1 MB L2 cache, and a 32 MB L3 cache.
- Each cache has a stable latency.
- The latency increases exponentially across different caches.
This means, if you want your service code to have higher execution efficiency, ensure that execution is performed closer to the CPU. However, the preceding figure indicates that the memory latency is measured in nanoseconds, while the actual service speeds are measured in milliseconds. Therefore, optimization should focus on those operations that take milliseconds, and memory latency optimization is a matter of long tail.
Memory latency is closely related to caches. Without a good misunderstanding of memory latency, you may mistake cache latency for memory latency. If memory bandwidth tests are performed incorrectly, the cache bandwidth may be checked instead.
To understand the memory bandwidth, it is necessary for us to learn about the architecture of memory and CPUs. In the past, CPUs were connected to the memory through a northbridge. Now, CPUs directly read data from memory using the integrated memory controller (IMC) of the CPU.
You can test memory bandwidth using various tools. In Linux systems, the stream algorithm is generally used for this testing. The stream algorithm is briefly described as follows:
According to the preceding figure, the principle of the stream algorithm is very simple. Data is read from one memory block and put into another after simple computing. The memory bandwidth is calculated using this formula: Size of data moved/Time elapsed. An appropriate test over the entire machine can reveal the bandwidth of the IMC. The following figure shows the memory bandwidth data of a cloud product:
Function Best Rate MB/s Avg time Min time Max time
Copy: 128728.5 0.134157 0.133458 0.136076
Scale: 128656.4 0.134349 0.133533 0.137638
Add: 144763.0 0.178851 0.178014 0.181158
Triad: 144779.8 0.178717 0.177993 0.180214
Memory bandwidth is undoubtedly important, as it indicates the maximum data throughput of the memory. However, correct and suitable testing is very important. You need to pay attention to the following points:
- The size of the memory array must be far greater than the size of the L3 cache; otherwise, the test results indicate the cache throughput.
- The number of CPUs is important. Generally, one or two cores cannot fully utilize the memory bandwidth. The memory bandwidth can be effectively tested only with all the CPUs of the computer. However, you can test the memory latency by running the stream algorithm with a single core.
- Relationship between memory and NUMA: NUMA can effectively improve the memory throughput and reduce the memory latency.
- Selection of the stream algorithm compilation method: ICC compilation can effectively improve the performance score of memory bandwidth. The reason is that Intel has optimized CPU instructions to accelerate data read/write operations and instruction execution by means of instruction vectoring and instruction prefetching. ICC compilation can also be applied to other C code to provide instruction efficiency.