I would hope that someone would turn on GCC3s optimization flags for various platforms, build a version geared toward each, pass it off to some users of that CPU and let them test it, get the results back, verify all is working good and check the benchmark times, then release a few CPU specific clients, since after all it is a freebie and could potentially help a bunch and might not even require any code changes.What's the ratio of SSE/SSE2/3DNOW enabled users to Altivec-enabled users, again? Oops! It's in the chart. 95%-ish SSE/SSE2/3DNow potential users... Call everything Win98 and below as MMX or worse. 20%... That leaves us with a conservative 75% to 3%? One would hope SSE would be evaluated first...
(I'm sure this will stoke the flames)
Regarding users of CPUs, try this logic:
All Apple PPC machines are made by Apple.
Apple makes extensive use of Altivec in MacOS X.
Apple makes sure the compiler that ships with OS X Dev tools works great for CPU optimization and Altivec coding.
Of the 95% of SSE/SSE2/3DNow potential users, there are probably 40+ different motherboards, and 20+ memory controllers.
Testing on Apple's machines requires maybe 14 different machines, but all are *very* similar, including the memory controllers.
Testing on SSE/SSE2/3DNow machines requires hundreds of different machines, some similar some very different, some using SDRAM paths, some of the same machine using DDR paths.
All Apple users you are interested in are running MacOS X(since you aren't coding for OS 9).
All SSE/SSE2/3DNow enabled CPU users are running, uh lessee, 12 different *nix variants, 5 differnt WinXX variants, 5+ different types of CPUs(I really don't know the exact specifics).
The above is why it is far, far easier and better to support the MacOS running on PowerPCs. One client works for Millions.
It is more bang for your coding time, less testing, less odd bugs you can't track down, less headache.
As for the DF client, using FPU math could be a boost, and you'll get some of that with CPU optimization for free.
Multi-threadiing would be great as there are many dual CPUs out there(not just macs) and would offer some gain.
Using Cache instructions might help as well if you know what needs to stay around and gets repeated, again some of that you get for free with CPU specific optimization.
Using Vector Math would be a huge boost and make it easier to support SSE/SSE2/3DNow/Altivec in general.
It all really comes down to whether you want to take the most advantage of the user's CPU, how quickly you want results and the willingness to explore new coding frontiers. Of course, there are lots of folks willing to help out, from coding to testing, so that is a big boon.