Vala ABI and branch prediction

Maintaining Libgee for some time I run into interesting phenomena – performance did not behave in a way most people would expect. When the highier-order functions were planned it was pointed out that Vala doesn’t have complex inlining required for fast usage of higher-order functions.

As shown by previous link the reverse is true. In hindsight it seems to be clear that source of discrepancy is due to shorthands we employed while analyzing the performance.

Why do we need the branch prediction

The simplest way of thinking about processor is as machine executing one instruction at a time. While this is very simple description it doesn’t allow fast computation. Usually there is several stages which can work independently or semi-independently and the processor is similar to assembly line (pipeline) of data. Since each discrete stage needs to fit inside 1 cycle finer splitting allows faster clock rate – and larger throughput (this is simplified view as current processors are superscalar and out-of-order.

There are however few things that prevents reaching the theoretical maximum performance (other then bottlenecks outside processor) called hazard. Consider following program (in pseudo-assembly):

    mov [r0], r0
    jz r0, label
    mov [r1], r0
label:
    add r0, 1, r1

The processor does not know how to proceed after line 2. The simples resolution of a hazard is to stall (do nothing until it is known how to resolve the problem). However in many cases the processor can guess (predict) how to proceed – it requires a bit of bookkeeping in order to maintain consistent state when it fails to guess but it avoids just waiting on solution.

Openness and branch prediction

Traditionally the branches were quite simple – either a branch was taken or not taken. Since the direction of the branch was quite simple (1 bit) it allowed a number of smart methods:

Counting the number when the branch is taken or not taken – we guess the branch which was taken most frequently in past.
Considering the history. It is possible to take say 5 past choices of taken/not taken and use it as part of address. It gives performance as single method might be called from 2 different places and have different branching depending on way it is called.

However on modern processors (in practice – any produced during past 30 years or so) we can jump in any place in the code by function pointer:

int main() {
    // f is pointer to some function
    int (*f)() = getF();
    f();
}

The style of calling an unknown function is currently very common – the method call depend on the data that is passed to function instead of place of such call. From processor point of view the virtual method calls, calling closure etc. are calling an arbitrary address. Now instead of potentially two outcomes of branching there can be arbitrary number of them. While usually processors guess that the last taken branch will be taken this time I have not heard about processors doing anything more complicated (the history considered is still from traditional taken/not taken branches from what I understand

Usually the functional and object-oriented languages deal with the problem in various ways. For example they can employ data/control-flow analysis to check what types can occur at give pointers – example from C:

struct A {
    virtual int f() = 0;
};

struct B : A {
    int f() __attribute__((noinline)) {
        return 0;
    }
};

struct C : A {
    int f() {
        return 1;
    }
};

int main() {
    B b;
    A *a = &b;
    return a->f();
}

The gcc can see that nothing but B can be pointed by a. Hence he calls the B::f directly.

	.file	"test.cc"
	.section	.text._ZN1B1fEv,"axG",@progbits,_ZN1B1fEv,comdat
	.align 2
	.p2align 4,,15
	.weak	_ZN1B1fEv
	.type	_ZN1B1fEv, @function
_ZN1B1fEv:
.LFB0:
	.cfi_startproc
	xorl	%eax, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	_ZN1B1fEv, .-_ZN1B1fEv
	.section	.text.startup,"ax",@progbits
	.p2align 4,,15
	.globl	main
	.type	main, @function
main:
.LFB2:
	.cfi_startproc
	leaq	-24(%rsp), %rdi
	jmp	_ZN1B1fEv
	.cfi_endproc
.LFE2:
	.size	main, .-main
	.ident	"GCC: (Gentoo 4.7.2-r1 p1.4, pie-0.5.5) 4.7.2"
	.section	.note.GNU-stack,"",@progbit

However such techniques are (usually) not employed by C compilers as in many cases programmers can go behind its back and change the function pointers – even if programmers know it won’t be changed.

Note about symbols

Potential source of openness is symbol resolution. If the symbol can be outside library the resolution goes through PLT. If f was shared function `int main() {return f ();}` would compile into:

	.file	"test-plt.c"
	.text
	.globl	f
	.type	f, @function
f:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	f, .-f
	.globl	main
	.type	main, @function
main:
.LFB1:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	$0, %eax
	call	f@PLT
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE1:
	.size	main, .-main
	.ident	"GCC: (Gentoo 4.7.2-r1 p1.4, pie-0.5.5) 4.7.2"
	.section	.note.GNU-stack,"",@progbits

The f@PLT is small function which calls appropriate function according to resolution. In general resolve to one additional indirect jump which can be perfectly predictable after first two misses. On x86 (32-bit version) due to technical limitations there is one additional direct jump. The detailed description is in Ulrich Drepper’s guide“How To Write Shared Libraries”.

A note about real performance

Measurement of the problems which can arise can be quite tricky. The modern processors and compilers can be quite smart and microbenchmarking of instruction can be fight against it while at the same time for the result to be meaningful there is need to exploit them.

I tried to use following code in my test:

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <inttypes.h>
#include <math.h>

static void f() __attribute__((noinline));
static void g() __attribute__((noinline));
 
static void f() {
    __asm__ __volatile__ ("");
    return;
}
 
static void g() {
    __asm__ __volatile__ ("");
    return;
}
 
 
typedef void (*fptr_t)();
 
#ifndef DIRECT
#ifdef CHANGE
static void swap(fptr_t *f, fptr_t *g) {
  fptr_t tmp = *f;
  *f = *g;
  *g = tmp;
}
#endif
#endif
 
#ifdef VOLATILE
#define LOAD_FPTR(n) \
  volatile fptr_t n ## ptr = n
#else
#define LOAD_FPTR(n) \
  static volatile fptr_t n ## ptr_static = n; \
  fptr_t n ## ptr = n ## ptr_static
#endif
 
int main() {
#ifndef DIRECT
  LOAD_FPTR(f);
#ifdef CHANGE
  LOAD_FPTR(g);
#endif
#else
    fptr_t fptr = f;
#ifdef CHANGE
    fptr_t gptr = g;
#endif
#endif
 
#ifndef TRIES
#define TRIES 10
#endif
    double times[TRIES];
    for (uint64_t j = 0; j < TRIES; j++) {
        struct timespec start, end;
        clock_gettime (CLOCK_THREAD_CPUTIME_ID, &start);
        for (uint64_t i = 0; i < UINT64_C(0xFFFFFFF); i++) {
#ifndef CHANGE
            fptr();
#else
#ifndef DIRECT
            swap (&fptr, &gptr);
            fptr();
#else
            uint64_t f_or_g;
            asm ("and $1, %0" : "=r"(f_or_g) : "0"(i));
            if (f_or_g) {
                fptr();
            } else {
                gptr();
            }
#endif
#endif
        }
        clock_gettime (CLOCK_THREAD_CPUTIME_ID, &end);
        times[j] = (end.tv_sec - start.tv_sec) + 1e-9*(end.tv_nsec - start.tv_nsec);
        printf("Time[%" PRIu64 "] = %f\n", j, times[j]);
    }
    double mean = 0.0;
    double mean2 = 0.0;
    for (int j = 0; j < TRIES; j++) {
        mean += times[j]/TRIES;
        mean2 += (times[j] * times[j])/TRIES;
    }
    double stdev = sqrt(mean2 - mean * mean);
    printf("Mean  = %f\n", mean);
    printf("Stdev = %f\n", stdev);
    return 0;
}

Type			Mean	Standard deviation	Notes
Type			Mean	Standard deviation	Notes
Direct jump	Same location		0.540	0.036	This just measures the overhead of direct jump
Direct jump	Alternating locations		0.524	0.031	All it does is add additional jump which can be predicted by history
Indirect jump	Same location	From register	0.459	0.031	The indirect jump which can be predicted and is known
	Same location	From memory	0.459	0.029	This corresponds to Vala call which can be predicted
	Alternating locations	From register	1.570	0.035	This corresponds to mis-predicted indirect jump
	Alternating locations	From memory	2.080	0.033	This is mis-predicted Vala call

The table is closer to rough guide than any realistic simulation. For example many Vala virtual calls have 2 indirect jumps and the last row have additional stores to the memory.

How does Vala fit into it

Vala uses GObject under the hood and it was to some extend constraint by it. It needs to fit into existing ABI of libraries written in GObject/C. Since the methods were implemented in C by function pointers it is what done by Vala. However by transition the benefits of more complicated optimizations are lost – the C can no longer optimize the code. However using the open style (with methods) is more encouraged in Vala then C.

List<int> list = new ArrayList<int>();
list.add(0);
// In C++ the add could point directly to ArrayList::add
// In Vala we do the indirect call

Even if the analysis was implemented in Vala the optimization could not be easily implemented without tweaks to detect the ABI. It would not be impossible to simply export gee_array_list_add to allow compiler to look up the correct function and use it directly. As it changes the ABI the classes would need to be explicitly annotated.

Other problem present in Vala is the abstract classes deriving from interfaces. If class overrides a method of such interface currently the following happens:

Interface method resolution function (my own naming – not a standard one) is called – 1 direct jump
The interface vtable is looked up (memory load) and from it is read a location of abstract class resolution function – 1 indirect jump
The abstract class vtable is looked up (memory load) and from it the real function is executed – 1 indirect jump
The real method is executed

In total it is 2 indirect jumps (depending on memory load) and 1 direct jump. There seems to be at least 2 ways of improving it:

Instead of overriding just the abstract class the child class might simply reimplement the interface under the hood thus allowing to directly jump to the method. It seems to be possible given existence of g_type_interface_peek_parent.

As the interface implementation can be per-derived class instead of one per hierarchy the pointer to real implementation can be inserted into the vtable. This allows to skip the detour via abstract class resolution function and one indirect call saving some time.
While technically not fully compatible with GObject it is possible to directly resolve the method from outside the class. Consider following loop:
```
List<int> list = ...
for (int i = 0; i < list.size; i++) {
    do_something (list.get (i));
}
```
It would be compiled into something roughly corresponding to:
```
GeeList list = ...
for (int i = 0; i < gee_list_get_size (list); i++) {
    int tmp = gee_list_get (i);
    do_something (tmp);
}
```
Instead it would be possible to simply get the function pointers for the gee_list_get_size and gee_list_get – chances are that register won’t be saved on stack and they would be used right away. In following code it might not do much difference but it more complicated ones it might make an impact.

Conclusion

Control flow may have an impact on the performance of the program. Usually the compiler, JIT and natural flow of language directs to certain usage-patterns which can be exploited. Unfortunately Vala doesn’t have many optimizations implemented (at least yet) and they cannot be done by C compiler.

In many cases people either don’t care about performance (does it matter if calculator displays result in few 17 ms faster?). Once people are interested in performance they can start optimizing even if it is not natural way in a language (for example struct of arrays in C/C++). It is worth to keep in mind for such case that Vala does not optimize the virtual calls as much as C++ or Java.