Virtual Functions Considered Not Harmful, Part I

Every now and then, the use of virtual functions gets questioned when it comes to runtime performance. I did a bit of work to find the (halfway) truth …

Recently on gamedev.net
Original post by NotAYakk
Note that implementing a faster/leaner virtual function object system in C++ is possible, but (A) it isn’t worth it in most every case, and (B) it isn’t easy to get right.


Did he now mean a virtual-function object system or a virtual function-object system? Anyways, I could guess what he means.

I don’t believe it is possible to code something that is faster and leaner and at the same time as mighty as virtual functions. At least not in valid C++, because there I already found something fast and lean, as follows in the next paragraphs. Anyways, feel free to slap the bitch with actual material, if you find some!

Original post by NotAYakk
Virtual function overhead requires something done on a per-frame and per-pixel basis. Ie, if on every frame, you call the virtual function once per pixel, then you are starting to look at the the point where you should worry about virtual function overhead.

Before that (or other similar cases), it isn’t a concern.

Calling virtual functions on per pixel per frame basis doesn’t mean anything. Actually, virtual functions are not that bad compared to ordinary, non-inlined calls, when they actually contain code. For example, if you do some dot-products, square roots, Reinhard operators, and the usual stuff one does at per-pixel level, then the cost gets negligible.

Let’s ask Dr. GCC about the cost of virtual function calls

Preparation

#include <cstdio>
 
#ifdef __GNUC__
#define MARK(x) do{asm volatile ("# "#x);}while(0)
#define NO_OPTIMIZE_AWAY() do{asm("");}while(0)
#else
#error "dies"
#endif
 
struct Interface {
    virtual int fun() const = 0;
    virtual int fun2 (int, int) const = 0;
    virtual float dot3 (const float *, const float *) const = 0;
};
 
struct Foo : public Interface {
    int const ff;
    Foo (int ff) : ff (ff) {}
 
    int fun() const {
        return ff;
    }
 
    int fun2 (int, int) const {
        return ff;
    }
 
    float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
    }
};
 
struct Oof : public Foo {
    int const ff;
    Oof (int ff) : Foo (0), ff (ff) {}
 
    int fun() const {
        return ff;
    }
};
 
struct Bar {
    int const ff;
    Bar (int ff) : ff (ff) {}
    int fun() const { return ff; }
    int fun2 (int, int) const {
        return ff;
    }
    float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
    }
};


Tests

int main () {
    int entropy;
    scanf ("", &entropy);
    Interface &foo = *new Foo(entropy);
    MARK("Foo starts here ... ");
    entropy = foo.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &foo;
 
    scanf ("", &entropy);
    Interface &oof = *new Oof(entropy);
    MARK("Oof starts here ... ");
    entropy = oof.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &oof;
 
    scanf ("", &entropy);
    Bar &bar = *new Bar(entropy);
    MARK("Bar starts here ... ");
    entropy = bar.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &bar;
}

Results

# call to Foo::fun()
	movl	-12(%ebp), %eax
	movl	(%eax), %edx
	movl	-12(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
 
# call to Oof::fun()
	movl	-16(%ebp), %eax
	movl	(%eax), %edx
	movl	-16(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
 
# call to Bar::fun()
	movl	-20(%ebp), %eax
	movl	%eax, (%esp)
	call	__ZNK3Bar3funEv
	movl	%eax, -8(%ebp)


We see that the op-count grew from 4 to 7 (increased by a factor of 1.75). The costs for the pure call will partially vanish with an increasing number of operands.

Adding Parameters

Let’s add another test-case (you can copy this into the function main() if you’re trying yourself):

*snip*
int main () {
    *snip*
    int entropy2, entropy3;
    scanf ("", &entropy, &entropy2, &entropy3);
    Interface &foo2 = *new Foo(entropy);
    MARK("Foo2 starts here ... ");
    entropy = foo2.fun2 (entropy2,entropy3);
    MARK("... and here it ends.");
    printf ("", entropy, entropy2, entropy3);
    delete &foo2;
 
    scanf ("", &entropy, &entropy2, &entropy3);
    Bar &bar2 = *new Bar(entropy);
    MARK("Bar2 starts here ... ");
    entropy = bar2.fun2 (entropy2,entropy3);
    MARK("... and here it ends.");
    printf ("", entropy, entropy2, entropy3);
    delete &bar2;
}


The results for this:

# Foo::fun2(int,int)
	movl	-32(%ebp), %eax
	movl	(%eax), %edx
	addl	$4, %edx
	movl	-28(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	-32(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
# Bar::fun2(int,int)
	movl	-28(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	-36(%ebp), %eax
	movl	%eax, (%esp)
	call	__ZNK3Bar4fun2Eii
	movl	%eax, -8(%ebp)


This time, it increased by 1.5 (12 ops virtual, 8 ops non-virtual).

Adding code at the target site

This was just the call! Now we’ll add upp the code of the functions themselves. The function definitions look all the same, so I just show one of them:

__ZNK3Bar4fun2Eii:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	popl	%ebp
	ret


We see that to the call-ops, we have to add 6 more ops, so our previous growage of 1.75 decreases to (7+6) / (4+6) = 1.3, and 1.5 decreaes to (12+6) / (8+6) = approx. 1.29.

Adding actual code to the target site

Now let’s have a look at the assembly produced for our dot-product-functions. Remember, they looked as simple as:

float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
}

The assembly for all versions again look the same, so here is just one of them:

pushl	%ebp
	movl	%esp, %ebp
	movl	12(%ebp), %eax
	movl	16(%ebp), %edx
	flds	(%eax)
	fmuls	(%edx)
	movl	12(%ebp), %eax
	addl	$4, %eax
	movl	16(%ebp), %edx
	addl	$4, %edx
	flds	(%eax)
	fmuls	(%edx)
	faddp	%st, %st(1)
	movl	12(%ebp), %eax
	addl	$8, %eax
	movl	16(%ebp), %edx
	addl	$8, %edx
	flds	(%eax)
	fmuls	(%edx)
	faddp	%st, %st(1)
	popl	%ebp
	ret

And we find 22 instructions, for a simple dot-product. Our additional cost is now:

  • (12+22) / (8+22) = approx. 1.1333

So instead of 40 frames per second, you get “only” 36, but you get many benefits. Also, just add some more operations, and the relative overhead further decreases. And the blanket statement “virtual functions at per pixel per frame level considered harmful” (not sic!) vanishes.

We learn today (how pathetic :D)…

that the cost (*) of calling a virtual function compared to a non-inlined non-virtual function vanishes with the number of arguments you pass to the function, the complexity of what is inside the function and with the number of return values (in C/C++ either 0 or 1)

(*): Actually with respect to size/op-count, as we haven’t done any benchmark yet, but see part 2.

sidenote:

Quote:

No, don’t do epic switch statements.


But sometimes, only sometimes, epic switch statements are the right thing. For example if you write a performant software based virtual machine, with many trivial micro operations (e.g. add or mul), then you want jump tables. And jump tables you generally only get with epic switch statements, or by relying on pointers-to-labels (GCC has support for them, but they are not standard).

This entry was posted in C++ and tagged , , , , . Bookmark the permalink.

2 Responses to Virtual Functions Considered Not Harmful, Part I

  1. Maciej says:

    Interesting read but in my opinion your virtual function profiling is a bit incomplete. The real performance trouble with virtual functions shows up when you have more complex code than just simple vector operations and you’re trashing your CPU caches (L1, L2) a lot.

    Imagine your CPU cache being completely invalidated – in that case before calling virtual function you must load virtual table to your cache first, then check address of the code to jump to, then load function to your cache. Non-virtual function call needs just an address.

    If your application has hundreds of classes with thousands of virtual functions being called in random order all over the time, then you’re likely to be experiencing lots of cache misses for your virtual tables.

  2. phresnel says:

    Maciej: That’s of course true. In very complex cases, the CPU will get into trouble. What I really wanted to show is that, if your code can signifcantly benefit from virtual functions, and if it is not too nested/complex, then you should consider using virtual functions.