Huh? » gamedev.net http://phresnel.org/blog D'Oh! (feel free to leave me a mail at "phresnel.gmail@com , swap @ with . for proper mail) Fri, 12 Feb 2010 09:17:43 +0000 en hourly 1 http://wordpress.org/?v=3.1.2 Virtual Functions Considered Not Harmful, Part II http://phresnel.org/blog/2009/02/virtual-functions-considered-not-harmful-part-ii/ http://phresnel.org/blog/2009/02/virtual-functions-considered-not-harmful-part-ii/#comments Fri, 20 Feb 2009 13:40:01 +0000 phresnel http://phresnel.org/blog/?p=58 Continue reading ]]>

For the lack of a profiler (and for the lack of time), I commited to the cardinal sin of making unproven statements in Virtual Functions Considered Not Harmful, Part I


Original post by dascandy
call *%eax

<snip>

call __ZNK3Bar4fun2Eii

The first causes (often) a pipeline stall. The second never does. Please try to *profile* your speed tests before making any statements. You’re not working on 386′s, 486′s or Pentiums, your CPU is not trivial anymore.


The key concepts here are the Branch History Table (as in “your CPU is not trivial anymore” [sic!]) on modern CPU and the fact that the advantage of non-virtual functions decreases with increasing complexity of the function itself.

You are right, I should have used a profiler, but unfortunately, I don’t have one handy on this box. Hence I wrote the following benchmark (mostly Std-C++ compatible, except for the asm()’s  and the __attribute(())’s for the sake of benchmarking).

Benchmark

I tried to be as careful as possible, if I have done something bogus, please let me know. So, here it is:

#ifdef __GNUC__
#define MARK(x) do{asm volatile ("# "#x);}while(0)
#define NO_OPTIMIZE_AWAY() do{asm("");}while(0)
#else
#error "dies"
#endif
 
#include <ctime>
#include <iostream>
#include <cmath>
 
struct StaticNoInline {
    float dist (float const volatile *lhs, float const volatile *rhs)
     __attribute__((noinline)) {
        float const diff [] = {
            rhs [0] - lhs [0],
            rhs [1] - lhs [1],
            rhs [2] - lhs [2],
        };
        float const lenSq = diff [0]*diff [0] + diff [1]*diff [1]
                                                + diff [2]*diff [2];
        return sqrt (lenSq);
    }
};
 
struct StaticForceInline {
    float dist (float const volatile *lhs, float const volatile *rhs)
     __attribute__((forceinline)) {
        float const diff [] = {
            rhs [0] - lhs [0],
            rhs [1] - lhs [1],
            rhs [2] - lhs [2],
        };
        float const lenSq = diff [0]*diff [0] + diff [1]*diff [1]
                                                + diff [2]*diff [2];
        return sqrt (lenSq);
    }
};
 
struct IVirtual {
    virtual float dist (float const volatile *lhs, float const volatile *rhs)
        __attribute__((noinline)) = 0;
};
 
struct Virtual : public IVirtual {
    float dist (float const volatile *lhs, float const volatile *rhs)
     __attribute__((noinline)) {
        float const diff [] = {
            rhs [0] - lhs [0],
            rhs [1] - lhs [1],
            rhs [2] - lhs [2],
        };
        float const lenSq = diff [0]*diff [0] + diff [1]*diff [1]
                                                + diff [2]*diff [2];
        return sqrt (lenSq);
    }
};
 
template <typename T, typename I, uint32_t count>
void test () {
    static volatile float dist;
    static volatile float lhs [3];
    static volatile float rhs [3];
 
    ::std::cout << "Beginning test for " << typeid (T).name()
                << ", count is " << count
                << 'n';
 
    I &subject = *new T ();
    clock_t const beginT = clock();
    MARK("entering loop");
    for (uint32_t i=0; i<count; ++i) {
        dist = subject.dist (lhs, rhs);
    }
    MARK("left loop");
    clock_t const endT = clock();
    delete &subject;
    ::std::cout << "Test ended, total time " << endT - beginT << "msec.n";
}
 
int main () {
    const uint32_t count = 1<<27;
    test<Virtual, IVirtual, count> ();
    test<StaticNoInline, StaticNoInline, count> ();
    test<StaticForceInline, StaticForceInline, count> ();
    // Re-execute to take care of CPU heat and priority switches
    test<Virtual, IVirtual, count> ();
    test<StaticNoInline, StaticNoInline, count> ();
    test<StaticForceInline, StaticForceInline, count> ();
    ::std::cout << ::std::flush;
}

Basically, this benchmark calculates many million vector3-lengths.

Results

On a Pentium(R) M, 1.8GHz, the program gives me really debunking numbers.

stdout of benchmark

  Beginning test for 7Virtual, count is 134217728
  Test ended, total time 5828msec.
  Beginning test for 14StaticNoInline, count is 134217728
  Test ended, total time 5839msec.
  Beginning test for 17StaticForceInline, count is 134217728
  Test ended, total time 5118msec.
  Beginning test for 7Virtual, count is 134217728
  Test ended, total time 5798msec.
  Beginning test for 14StaticNoInline, count is 134217728
  Test ended, total time 5809msec.
  Beginning test for 17StaticForceInline, count is 134217728
  Test ended, total time 5097msec.

interpretation

clock() is not too exact (especially on Windows(R)), so I ran it for over 5 seconds (134,217,728 loops). In more readable form, the results are:

  • virtual: 5.828 sec, 5.798 sec = 5.813 sec on average
  • static, not inlined: 5.839 sec, 5.809 sec = 5.824 sec on average
  • static, inlined: 5.118 sec, 5.097 sec = 5.1075 sec on average

The static and the virtual variant run at nearly the same speed (roughly 5.8185 sec on average), the inlined version runs in only roughly 87.78 % of the time.

Upshot

Certainly, inlining helps with a bunch of other optimizations, like auto-vectorisation or re-use of data, plus it increases locality, but has the disadvantage of potential code-bloat.

Also, at the time where the compiler decides to not inline a function, the performance difference between virtual and non-virtual decreases (w.r.t. several parameters and configurations), especially in tight loops. So if it would be a great benefit for your application to use virtual functions, then you might decide to use them.

Example for a “great benefit” of virtual functions

Just to give an example (blatant self promotion, sorry), my ray tracer picogen uses virtual functions for one of its shader systems (basically an executable abstract syntax tree; the performance drop is not too big as the shader is only called for definitive correct intersections, where in the meanwhile many other operations happened), plus it uses virtual functions for its intersectable objects (which include a quadtree, a BIH, a kd-tree oops that was the ray tracer I wrote before that, sorry, spheres of course, a cube, a simple DDA based heightmap, and some hacks where I tried things out). The general BIH is an aggregate that can house any other type of Intersectable, including specific and general BIH. Also, I implemented a Whitted-style Ray Tracer and a Path Tracer (called “Surface Integrators”, like pbrt calls it), which definitely are called for each pixel, even if there is no intersection at all.

Without virtual functions, this all would have been far clumsier, and not necessarily faster. Had I organized those objects in multiple lists, code could have even run slower than with virtual functions.

]]>
http://phresnel.org/blog/2009/02/virtual-functions-considered-not-harmful-part-ii/feed/ 0
Virtual Functions Considered Not Harmful, Part I http://phresnel.org/blog/2009/02/virtual-functions-considered-not-harmful-part-i/ http://phresnel.org/blog/2009/02/virtual-functions-considered-not-harmful-part-i/#comments Fri, 20 Feb 2009 12:52:15 +0000 phresnel http://phresnel.org/blog/?p=51 Continue reading ]]> Every now and then, the use of virtual functions gets questioned when it comes to runtime performance. I did a bit of work to find the (halfway) truth …

Recently on gamedev.net
Original post by NotAYakk
Note that implementing a faster/leaner virtual function object system in C++ is possible, but (A) it isn’t worth it in most every case, and (B) it isn’t easy to get right.


Did he now mean a virtual-function object system or a virtual function-object system? Anyways, I could guess what he means.

I don’t believe it is possible to code something that is faster and leaner and at the same time as mighty as virtual functions. At least not in valid C++, because there I already found something fast and lean, as follows in the next paragraphs. Anyways, feel free to slap the bitch with actual material, if you find some!

Original post by NotAYakk
Virtual function overhead requires something done on a per-frame and per-pixel basis. Ie, if on every frame, you call the virtual function once per pixel, then you are starting to look at the the point where you should worry about virtual function overhead.

Before that (or other similar cases), it isn’t a concern.

Calling virtual functions on per pixel per frame basis doesn’t mean anything. Actually, virtual functions are not that bad compared to ordinary, non-inlined calls, when they actually contain code. For example, if you do some dot-products, square roots, Reinhard operators, and the usual stuff one does at per-pixel level, then the cost gets negligible.

Let’s ask Dr. GCC about the cost of virtual function calls

Preparation

#include <cstdio>
 
#ifdef __GNUC__
#define MARK(x) do{asm volatile ("# "#x);}while(0)
#define NO_OPTIMIZE_AWAY() do{asm("");}while(0)
#else
#error "dies"
#endif
 
struct Interface {
    virtual int fun() const = 0;
    virtual int fun2 (int, int) const = 0;
    virtual float dot3 (const float *, const float *) const = 0;
};
 
struct Foo : public Interface {
    int const ff;
    Foo (int ff) : ff (ff) {}
 
    int fun() const {
        return ff;
    }
 
    int fun2 (int, int) const {
        return ff;
    }
 
    float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
    }
};
 
struct Oof : public Foo {
    int const ff;
    Oof (int ff) : Foo (0), ff (ff) {}
 
    int fun() const {
        return ff;
    }
};
 
struct Bar {
    int const ff;
    Bar (int ff) : ff (ff) {}
    int fun() const { return ff; }
    int fun2 (int, int) const {
        return ff;
    }
    float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
    }
};


Tests

int main () {
    int entropy;
    scanf ("", &entropy);
    Interface &foo = *new Foo(entropy);
    MARK("Foo starts here ... ");
    entropy = foo.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &foo;
 
    scanf ("", &entropy);
    Interface &oof = *new Oof(entropy);
    MARK("Oof starts here ... ");
    entropy = oof.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &oof;
 
    scanf ("", &entropy);
    Bar &bar = *new Bar(entropy);
    MARK("Bar starts here ... ");
    entropy = bar.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &bar;
}

Results

# call to Foo::fun()
	movl	-12(%ebp), %eax
	movl	(%eax), %edx
	movl	-12(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
 
# call to Oof::fun()
	movl	-16(%ebp), %eax
	movl	(%eax), %edx
	movl	-16(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
 
# call to Bar::fun()
	movl	-20(%ebp), %eax
	movl	%eax, (%esp)
	call	__ZNK3Bar3funEv
	movl	%eax, -8(%ebp)


We see that the op-count grew from 4 to 7 (increased by a factor of 1.75). The costs for the pure call will partially vanish with an increasing number of operands.

Adding Parameters

Let’s add another test-case (you can copy this into the function main() if you’re trying yourself):

*snip*
int main () {
    *snip*
    int entropy2, entropy3;
    scanf ("", &entropy, &entropy2, &entropy3);
    Interface &foo2 = *new Foo(entropy);
    MARK("Foo2 starts here ... ");
    entropy = foo2.fun2 (entropy2,entropy3);
    MARK("... and here it ends.");
    printf ("", entropy, entropy2, entropy3);
    delete &foo2;
 
    scanf ("", &entropy, &entropy2, &entropy3);
    Bar &bar2 = *new Bar(entropy);
    MARK("Bar2 starts here ... ");
    entropy = bar2.fun2 (entropy2,entropy3);
    MARK("... and here it ends.");
    printf ("", entropy, entropy2, entropy3);
    delete &bar2;
}


The results for this:

# Foo::fun2(int,int)
	movl	-32(%ebp), %eax
	movl	(%eax), %edx
	addl	$4, %edx
	movl	-28(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	-32(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
# Bar::fun2(int,int)
	movl	-28(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	-36(%ebp), %eax
	movl	%eax, (%esp)
	call	__ZNK3Bar4fun2Eii
	movl	%eax, -8(%ebp)


This time, it increased by 1.5 (12 ops virtual, 8 ops non-virtual).

Adding code at the target site

This was just the call! Now we’ll add upp the code of the functions themselves. The function definitions look all the same, so I just show one of them:

__ZNK3Bar4fun2Eii:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	popl	%ebp
	ret


We see that to the call-ops, we have to add 6 more ops, so our previous growage of 1.75 decreases to (7+6) / (4+6) = 1.3, and 1.5 decreaes to (12+6) / (8+6) = approx. 1.29.

Adding actual code to the target site

Now let’s have a look at the assembly produced for our dot-product-functions. Remember, they looked as simple as:

float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
}

The assembly for all versions again look the same, so here is just one of them:

pushl	%ebp
	movl	%esp, %ebp
	movl	12(%ebp), %eax
	movl	16(%ebp), %edx
	flds	(%eax)
	fmuls	(%edx)
	movl	12(%ebp), %eax
	addl	$4, %eax
	movl	16(%ebp), %edx
	addl	$4, %edx
	flds	(%eax)
	fmuls	(%edx)
	faddp	%st, %st(1)
	movl	12(%ebp), %eax
	addl	$8, %eax
	movl	16(%ebp), %edx
	addl	$8, %edx
	flds	(%eax)
	fmuls	(%edx)
	faddp	%st, %st(1)
	popl	%ebp
	ret

And we find 22 instructions, for a simple dot-product. Our additional cost is now:

  • (12+22) / (8+22) = approx. 1.1333

So instead of 40 frames per second, you get “only” 36, but you get many benefits. Also, just add some more operations, and the relative overhead further decreases. And the blanket statement “virtual functions at per pixel per frame level considered harmful” (not sic!) vanishes.

We learn today (how pathetic :D)…

that the cost (*) of calling a virtual function compared to a non-inlined non-virtual function vanishes with the number of arguments you pass to the function, the complexity of what is inside the function and with the number of return values (in C/C++ either 0 or 1)

(*): Actually with respect to size/op-count, as we haven’t done any benchmark yet, but see part 2.

sidenote:

Quote:

No, don’t do epic switch statements.


But sometimes, only sometimes, epic switch statements are the right thing. For example if you write a performant software based virtual machine, with many trivial micro operations (e.g. add or mul), then you want jump tables. And jump tables you generally only get with epic switch statements, or by relying on pointers-to-labels (GCC has support for them, but they are not standard).

]]>
http://phresnel.org/blog/2009/02/virtual-functions-considered-not-harmful-part-i/feed/ 2