I’ve been working on the problem of blurring an image. Take, for example this picture that I borrowed from flickr:

The processing that I need to do on the image is as follows:
- Convert to grayscale
- Apply a Gaussian blur
The challenge is that the image was originally 3037×2007, or 6.1 megapixels. To do the blurring, each final pixel ends up being determined by a weighted average of 122 other pixels. So, that’s 0.7 billion floating point (double) multiplications & adds. I split the work up on two threads, and tried to speed the process up as much as possible. Here are some timing results for doing this operation:
- 1607 ms, unmanaged C++ single precision (float) with /fp:precise flag
- 1431 ms, C# using fixed pointers
- 1270 ms, managed C++, identical to the C# code
- 938 ms, unmanaged C++ with SSE2 compiler flag
- 845 ms, unmanaged with SSE2 & compiled-in filter size
- 801 ms, unmanaged C++ with default compile options
- 800 ms, unmanaged C++ with defaults & compiled-in filter size
- 765 ms, unmanaged C++ single precision (float) with /fp:fast flag
- 490 ms, C# with managed Direct3D on Radeon X1600
- 171 ms, same as above with XFX GeForce 7900GS 600M (thanks Tony)
- 140 ms, C# with XNA on XFX GeForce 7900GS 600M on WinXP
As you can see, there is a pretty big reward jump (35%) between managed and unmanaged code, but I was surprised that the SSE2 optimizations actually slowed things down, apparently due to a poor job by the compiler. Also striking was the effect of the floating-point precision mode! In the C++ compiler, you can select one of several floating point modes. If you don’t select the ‘fp:fast‘ mode, it appears that the compiler does everything with double (32-bit) precision, and then casts results back to single precision (32-bit) before storing results. As a result, running the single-precision version of the code actually took the longest of all!
However, it seems the best way to do this is to use a hardware-accelerated 3D graphics card. The method I used involved writing Pixel shaders, which are tiny programs that execute onboard a graphics processor. It is quite a pain — I ended up making a function which writes a Pixel shader, which in turn gets compiled by DirectX on the fly, and sent to the graphics card. Fortunately, that all happens is pretty quickly. Timings include copying the blurred result back into system memory (via PCI-express). At the extreme end, the GeForce card on Windows XP runs the operation 25-times faster than my Pentium D 2.0GHz. This was also my first pixel shader program, so I suspect it could be made more efficient. The other benefit is that that the main CPU is free to do whatever it wants (say, blur another image!!) in the meantime.
Vista versus Windows XP. The comparison between the GeForce card on Vista and then on XP originally seemed like proof that Vista and/or the Vista drivers were terrible for this application, but it turns out there were incompatibilities when the program was run on XP, and the output wasn’t correct. After reworking the application to run under XNA instead of Managed DirectX, the erroneous output went away and we were left with a modest improvement from Vista to XP. Hopefully this can be addressed with driver improvements.
Difference between GPU and CPU blurring
Accuracy with the graphics card: To use the graphics card, you have to be very careful to align pixels to a fake “3D” scene, or the GPU will take the liberty to interpolate between two neighboring pixels. I ran a pixel-to-pixel comparison between C# and the GPU output. In the center of the image (white region), pixel values differed by at most 0.00038% (less than 1/65536 gray levels). But near the edges, some pixels were off by about 10%! I’m not surprised, because the Pixel shader just blindly addressed texture values right off the edges of the image. I was just hoping the GPU would magically do the “right” thing when tiling textures next to each other — no such luck. So the conclusion is that if you demand precise values near the edges of your images, you’ll need to either pad them appropriately before executing a pixel shader, or make sure you understand exactly how the adjacent textures will be indexed.
Other things I might try:
Use 32-bit floating points because SSE2 can multiply 4 of those at once (vs. 2 64-bit doubles)
- Write manual SSE2 assembly
Utilize DirectX and a hardware-accelerated pixel shader to do the job
- Some better idea that you have…?
Which one do you think I should try next?