Latency numbers and how to find them
It is supposed to be something that every programmer should know. Whenever you are coding something you should be aware of these numbers. I am talking about Latency Numbers
Although its really easy to estimate them given that every component of path mentioned in that list comes with a throughput number. But how does it actually compare while running an actual process in conjunction with a lot of other things that a computer or system does.
I tried to estimate them, and even though I didn’t get the numbers as per the list, the difference in magnitude was pretty clear.
The code used to get these numbers are written in C and I tried my best to estimate them as accurately as possible, but I may have made some mistakes.
Complete code can be found here
First thing to consider was how to measure elapsed time. Using time routines is not advisable to I used the absolute opaque number the cpu cycle counter (TSC).
// code taken from https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
u_int64_t msr;
asm volatile ( "rdtsc\n\t" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n\t" // Shift the upper bits left.
"or %%rdx, %0" // 'Or' in the lower bits.
: "=a" (msr)
:
: "rdx");
This can be translated to seconds using CPU clock speed but I wouldn’t advise it. Firstly because CPU cycle is enough to see the magnitude difference and second that CPU clock speed is not always the one advertised or given by the system info and changes frequently. Listed speed can be extracted from this command output.
cat /proc/cpuinfo | grep -i ‘cpu MHz’
Cache size is required as well in some experiments
cat /proc/cpuinfo | grep -i ‘cache size’
While using TSC counter in a multi CPU environment it is not always sure which call is reading from which CPU’s counter. Although all counters remain the same mostly but when we are talking about single digit calculation then it matters. To make sure that the programs keeps getting scheduled on the same CPU the process is attached to a single CPU.
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(0, &set);
sched_setaffinity(0, sizeof(set), &set)
Second best possible thing is to actually make this process not yieldable, but that can’t happen for the user land process in most desktop systems. Only thing that can be done is make sure that the scheduler is round robin and get the interval. For now I am assuming that whatever impact happens because of yielding would be offset by averaging across high enough iterations and since I am interested in comparison only and all simulations are suffering from the same issue it should be okay.
Another initialization step was to calculate cycles taken up while time stamping the process and then deducting that value from all calculations.
u_int64_t a = rdtsc();
u_int64_t b = rdtsc();
(double) (b-a);
Finally comes the individual cycle calculation.
Cache read
This is done by allocating some bytes and pre-fetching it using GCC builtin. I am using asm mov to read and load some data from that prefetched buffer and calculating time spent in that process
// init
int *arr = (int *)malloc(CACHE_SIZE * sizeof(char));
__builtin_prefetch(arr + CACHE_SIZE);
// code to timestamp
asm volatile ("mov %1, %0"
: "=r" (arr_zero)
: "r" (arr[0]));
Single digit cycle : 5-9
Memory read
Exactly same as before but I made sure that cache was loaded with some other stuff just before making the read call.
int *arr1 = (int *)malloc(CACHE_SIZE * sizeof(char));
int *arr2 = (int *)malloc(CACHE_SIZE * sizeof(char));
int arr2_zero = arr2[0];
__builtin_prefetch(arr2 + CACHE_SIZE);
Close to hundred cycles : 80 - 100
Branch mis-prediction
To test this I again used GCC builtin to hint to the compiler about the probable branch. Wrote to program one with the correct hint and another with 50% branch mis-prediction. All cases were running the same code. Unfortunately I didn’t see much of a difference.
predicate = predicate_guess(predicate);
u_int64_t a = rdtsc();
if (__builtin_expect(predicate, 1)) {
// code
} else {
// code
}
Inconclusive : need to rethink
Sequential read from RAM
This experiment was run using the same trick as before to make sure cache was holding something else and the sequential read was done by memcpy.
For 16MB of data it took ~2.5 Mil Cycles
Sequential read from disk SSD
Same amount of data is used here as well, 16MB. To make sure that read happens from a fresh file and from a new file descriptor, for iteration, I created a file then wrote data to it then closed it reopened in read only mode and then performed read call. The file was deleted at the end of each iteration. Only the time spent in read call was recorded.
For 16MB of data it took ~130 Mil Cycles
Data read over network (loopback)
Tested using a simple echo server, where client connects to server over TCP socket and transfers and receives 8 MB of data.
For 16 MB of data it took 1.5-2 Mil Cycles
This experiment is still ongoing and a lot of the comparisons are still pending. Also I intend to try variations in the number of bytes and system as well. Current data is captured on an Ubuntu 20.04 machine i7 2.5Hz, 4MB Cache, 16GB RAM, 512 GB SSD.