April 11, 2007

  • Loop performance in C#

    Lately I’ve been attempting to perform arithmetic-heavy computations with C#, and
    I encountered a somewhat surprising detail about loop performance. Computing the
    squared distance (euclidian) between two vectors, I discovered that using a variant
    of loop unrolling actually improves performance.

    Check it out. I ran the following two functions 25 million times each for random
    64-dimensional arrays, under ‘Release’ mode in Visual Studio 2005 on a 2GHz Core
    Duo.

    Using this loop, the full operation takes about 10.0 seconds:


    int
    style="font-size: 10pt;"> Distance(int style="font-size: 10pt;">[] a,
    int
    []
    b)
    {


    int
    d_sum
    = 0;

    int
    d
    = 0;


    for (int
    i = 0; i < a.Length; ++i)
    {
    d = a[i] – b[i];
    d_sum += (d * d);
    }


    }

    Whereas this loop takes about 7.9 seconds!?:


    int

    Distance(
    int style="font-size: 10pt;">[] a,

    int
    []
    b)
    {


    int
    d_sum
    = 0;

    int
    d1
    = 0, d2 = 0, d3 = 0, d4 = 0;


    for (int
    i = 0; i < a.Length; ++i)
    {
    d1 = a[i] – b[i]; ++i;
    d2 = a[i] – b[i]; ++i;
    d3 = a[i] – b[i]; ++i;
    d4 = a[i] – b[i];

    d_sum += (d1 * d1) + (d2 * d2) + (d3
    * d3) + (d4 * d4);
    }

    return
    d_sum;

    }

    This is quite puzzling to me. Okay, so the loop unrolling
    reduces the branching rate, but this branch should (almost) always be predicted
    correctly, and, if unrolling makes sense at a lower level, this is one of those
    cases where the compiler should definitely optimize things on its own. (I even tried
    explicitly specifying ‘i < 64′ for the loop to increase the odds of this… no
    improvement.) I’m not sure what’s going on here, and I guess the next step would
    be to look at the IL code generated by the C# compiler.

    Other interesting points:

    • Using a temporary value for ‘a.Length’ doesn’t speed anything up. Apparently the
      compiler is able to notice that the loop has no chance of affecting the array’s
      size.
    • Explicitly inlining this function inside of the outer loop (5,000 runs) actually
      slows performance back down to the 10 sec range!
    • Switching this loop to use fixed unsafe pointers doesn’t improve performance either.
      Why? Either fixing the pointer takes too much time, or there are some fancy low-level
      optimizations that are helping loop #2 which aren’t being applied to the unsafe
      code.
    • Software pipelining (moving the use of d1, d2, … until the next loop iteration)
      doesn’t help, either.

Comments (5)

  • That is fascinating. I’d be curious to see the IL code.

    You know, if anyone has covered this, it’d have to be Rico Mariani. And sure enough, with a little help from Google:

    http://blogs.msdn.com/ricom/archive/2003/12/02/40777.aspx

  • You should not run performance tests from visual studio, not even in release mode, the debugger turns of some IL-optimizations.
    Run from command line.

  • thanks for the code it’s an excellent output

    but please i need to ask

    since it computes the euclidean distance where is the squareroot for d_sum

    if we compute the square root should the return type also be integer , it can be double but should i have to return it as an integer

  • @mety2004 - you would want to return as double since square root won’t be integer. I didn’t take the square root because often you can use squared distances in calculations. Square root can be quite slow if called often.

  • The dude is absolutely right, and there is no question.
    check | this site | here

Post a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *