Month: April 2007

  • Performance of blurring a photo

    I’ve been working on the problem of blurring an image. Take, for example this picture that I borrowed from flickr:
    Original
    Blurred

    The processing that I need to do on the image is as follows:

    • Convert to grayscale
    • Apply a Gaussian blur

    The challenge is that the image was originally 3037×2007, or 6.1 megapixels. To do the blurring, each final pixel ends up being determined by a weighted average of 122 other pixels. So, that’s 0.7 billion floating point (double) multiplications & adds. I split the work up on two threads, and tried to speed the process up as much as possible. Here are some timing results for doing this operation:

    • 1607 ms, unmanaged C++ single precision (float) with /fp:precise flag
    • 1431 ms, C# using fixed pointers
    • 1270 ms, managed C++, identical to the C# code
    • 938 ms, unmanaged C++ with SSE2 compiler flag
    • 845 ms, unmanaged with SSE2 & compiled-in filter size
    • 801 ms, unmanaged C++ with default compile options
    • 800 ms, unmanaged C++ with defaults & compiled-in filter size
    • 765 ms, unmanaged C++ single precision (float) with /fp:fast flag
    • 490 ms, C# with managed Direct3D on Radeon X1600
    • 171 ms, same as above with XFX GeForce 7900GS 600M (thanks Tony)
    • 140 ms, C# with XNA on XFX GeForce 7900GS 600M on WinXP

    As you can see, there is a pretty big reward jump (35%) between managed and unmanaged code, but I was surprised that the SSE2 optimizations actually slowed things down, apparently due to a poor job by the compiler. Also striking was the effect of the floating-point precision mode! In the C++ compiler, you can select one of several floating point modes. If you don’t select the ‘fp:fast‘ mode, it appears that the compiler does everything with double (32-bit) precision, and then casts results back to single precision (32-bit) before storing results. As a result, running the single-precision version of the code actually took the longest of all!

    However, it seems the best way to do this is to use a hardware-accelerated 3D graphics card. The method I used involved writing Pixel shaders, which are tiny programs that execute onboard a graphics processor. It is quite a pain — I ended up making a function which writes a Pixel shader, which in turn gets compiled by DirectX on the fly, and sent to the graphics card. Fortunately, that all happens is pretty quickly. Timings include copying the blurred result back into system memory (via PCI-express). At the extreme end, the GeForce card on Windows XP runs the operation 25-times faster than my Pentium D 2.0GHz. This was also my first pixel shader program, so I suspect it could be made more efficient. The other benefit is that that the main CPU is free to do whatever it wants (say, blur another image!!) in the meantime.

    Vista versus Windows XP. The comparison between the GeForce card on Vista and then on XP originally seemed like proof that Vista and/or the Vista drivers were terrible for this application, but it turns out there were incompatibilities when the program was run on XP, and the output wasn’t correct. After reworking the application to run under XNA instead of Managed DirectX, the erroneous output went away and we were left with a modest improvement from Vista to XP. Hopefully this can be addressed with driver improvements.

    Difference between CPU and GPU Blurring
    Difference between GPU and CPU blurring

    Accuracy with the graphics card: To use the graphics card, you have to be very careful to align pixels to a fake “3D” scene, or the GPU will take the liberty to interpolate between two neighboring pixels. I ran a pixel-to-pixel comparison between C# and the GPU output. In the center of the image (white region), pixel values differed by at most 0.00038% (less than 1/65536 gray levels). But near the edges, some pixels were off by about 10%! I’m not surprised, because the Pixel shader just blindly addressed texture values right off the edges of the image. I was just hoping the GPU would magically do the “right” thing when tiling textures next to each other — no such luck. So the conclusion is that if you demand precise values near the edges of your images, you’ll need to either pad them appropriately before executing a pixel shader, or make sure you understand exactly how the adjacent textures will be indexed.

    Other things I might try:

    • Use 32-bit floating points because SSE2 can multiply 4 of those at once (vs. 2 64-bit doubles)
    • Write manual SSE2 assembly
    • Utilize DirectX and a hardware-accelerated pixel shader to do the job
    • Some better idea that you have…?

    Which one do you think I should try next?

  • Loop performance in C#

    Lately I’ve been attempting to perform arithmetic-heavy computations with C#, and
    I encountered a somewhat surprising detail about loop performance. Computing the
    squared distance (euclidian) between two vectors, I discovered that using a variant
    of loop unrolling actually improves performance.

    Check it out. I ran the following two functions 25 million times each for random
    64-dimensional arrays, under ‘Release’ mode in Visual Studio 2005 on a 2GHz Core
    Duo.

    Using this loop, the full operation takes about 10.0 seconds:


    int
    style="font-size: 10pt;"> Distance(int style="font-size: 10pt;">[] a,
    int
    []
    b)
    {


    int
    d_sum
    = 0;

    int
    d
    = 0;


    for (int
    i = 0; i < a.Length; ++i)
    {
    d = a[i] – b[i];
    d_sum += (d * d);
    }


    }

    Whereas this loop takes about 7.9 seconds!?:


    int

    Distance(
    int style="font-size: 10pt;">[] a,

    int
    []
    b)
    {


    int
    d_sum
    = 0;

    int
    d1
    = 0, d2 = 0, d3 = 0, d4 = 0;


    for (int
    i = 0; i < a.Length; ++i)
    {
    d1 = a[i] – b[i]; ++i;
    d2 = a[i] – b[i]; ++i;
    d3 = a[i] – b[i]; ++i;
    d4 = a[i] – b[i];

    d_sum += (d1 * d1) + (d2 * d2) + (d3
    * d3) + (d4 * d4);
    }

    return
    d_sum;

    }

    This is quite puzzling to me. Okay, so the loop unrolling
    reduces the branching rate, but this branch should (almost) always be predicted
    correctly, and, if unrolling makes sense at a lower level, this is one of those
    cases where the compiler should definitely optimize things on its own. (I even tried
    explicitly specifying ‘i < 64′ for the loop to increase the odds of this… no
    improvement.) I’m not sure what’s going on here, and I guess the next step would
    be to look at the IL code generated by the C# compiler.

    Other interesting points:

    • Using a temporary value for ‘a.Length’ doesn’t speed anything up. Apparently the
      compiler is able to notice that the loop has no chance of affecting the array’s
      size.
    • Explicitly inlining this function inside of the outer loop (5,000 runs) actually
      slows performance back down to the 10 sec range!
    • Switching this loop to use fixed unsafe pointers doesn’t improve performance either.
      Why? Either fixing the pointer takes too much time, or there are some fancy low-level
      optimizations that are helping loop #2 which aren’t being applied to the unsafe
      code.
    • Software pipelining (moving the use of d1, d2, … until the next loop iteration)
      doesn’t help, either.