Altivec Enhancements

**runestar** · 10-04-2002, 03:28 PM

SSE and 3DNow! and other such compilations are really oriented towards gamers and heavy duty graphics modeling which make heavy duty of floating point.

There's a certain amount of work vs performance gained ratio that must be maintained or it doesn't really pay off. As one person noted, floating point operations only consist about 5% of the calculations so the amount of work Brian (Howard) would have to spend researching and then incorporting it would very likely outweigh the speed benefits.

Its not an unwillingness to incorporate them, but its like putting high-speed performance tires on the car when all you do is drive around locally... Its really great, the car may run just a tad smoothier... but is it really worth the extra cost?

I do think that Brian would be interesting in optimizations that would make a significant improvement in the calculations. I wouldn't claim to speak for him, but for me I would guess at least a 20to25% performance increase for the extra time spent incorporating it in.

As for the Mac community, even though they are a small chunk of the market, they tend to be a loyal bunch to Mac related topics, so just because they are Mac users, they shouldn't necessarily be ruled out for improvements. Its been too long that the WinIntel giant has cast its shadow.

Now as for SDRAM and DDR... remember that these are the max potential speeds that data can move. Its similar to hard drives in that the max transfer speed, is not the speed that all data is going to transfer at, but rather how fast it could potentially travel.

Of course there are a lot more factors as someone pointed out. The MotherBoard chipset does play a big role. Some designs just work out better than others, and over time even the same speed chipset gets new tweaks.

From what I found, its accepted that DDR is BOINC consists of a client program and a data-distribution server backed by a database. BOINC, however, is not a specific application program - it's a framework that can support many different applications. This will make it easy for us to run multiple computations simultaneously - like AstroPulse and our southern hemisphere search - and to release new versions of these applications without requiring you to manually download and install software.

Even more significantly, BOINC is an open system. Other science projects can create their own distributed computations using BOINC. You choose the projects in which to participate, and you decide how much of your computing resources should go to each project. faster than the older SDRAM.

The rule of thumb is, if you are getting a newer system with support for a newer standard, go with the newer standard parts even it supports the older standard. Just make sure to match it up with what your board supports, for example don't put slower of faster RAM than what the board is rated for. Slower RAM limits you, and the faster RAM is wasted since the board can't take advantage of it.

TTFN,

RS½

**mikkyo** · 10-04-2002, 06:11 PM

What's the ratio of SSE/SSE2/3DNOW enabled users to Altivec-enabled users, again? Oops! It's in the chart. 95%-ish SSE/SSE2/3DNow potential users... Call everything Win98 and below as MMX or worse. 20%... That leaves us with a conservative 75% to 3%? One would hope SSE would be evaluated first...

I would hope that someone would turn on GCC3s optimization flags for various platforms, build a version geared toward each, pass it off to some users of that CPU and let them test it, get the results back, verify all is working good and check the benchmark times, then release a few CPU specific clients, since after all it is a freebie and could potentially help a bunch and might not even require any code changes.

(I'm sure this will stoke the flames)
Regarding users of CPUs, try this logic:
All Apple PPC machines are made by Apple.
Apple makes extensive use of Altivec in MacOS X.
Apple makes sure the compiler that ships with OS X Dev tools works great for CPU optimization and Altivec coding.
Of the 95% of SSE/SSE2/3DNow potential users, there are probably 40+ different motherboards, and 20+ memory controllers.
Testing on Apple's machines requires maybe 14 different machines, but all are *very* similar, including the memory controllers.
Testing on SSE/SSE2/3DNow machines requires hundreds of different machines, some similar some very different, some using SDRAM paths, some of the same machine using DDR paths.
All Apple users you are interested in are running MacOS X(since you aren't coding for OS 9).
All SSE/SSE2/3DNow enabled CPU users are running, uh lessee, 12 different *nix variants, 5 differnt WinXX variants, 5+ different types of CPUs(I really don't know the exact specifics).

The above is why it is far, far easier and better to support the MacOS running on PowerPCs. One client works for Millions.
It is more bang for your coding time, less testing, less odd bugs you can't track down, less headache.

As for the DF client, using FPU math could be a boost, and you'll get some of that with CPU optimization for free.
Multi-threadiing would be great as there are many dual CPUs out there(not just macs) and would offer some gain.
Using Cache instructions might help as well if you know what needs to stay around and gets repeated, again some of that you get for free with CPU specific optimization.
Using Vector Math would be a huge boost and make it easier to support SSE/SSE2/3DNow/Altivec in general.

It all really comes down to whether you want to take the most advantage of the user's CPU, how quickly you want results and the willingness to explore new coding frontiers. Of course, there are lots of folks willing to help out, from coding to testing, so that is a big boon.

**Jodie** · 10-04-2002, 10:00 PM

We do high-end video compression servers. I think I can speak with some authority on the topic of SSE/SSE2/etc.

Quite the contrary to what you posted, SSE's big strength is not in floating, but rather in integer math. mmx had the floating boosts, SSE is a bit more in the float, but was focused on integer calculations. Just in 3DNow! instruction set, for example, there are 19 additional integer calculation SIMD instructions.

The compiler can use SSE instructions, but can't do intelligent pipelining for integer calculations.

Add to that cache-hit-hinting (which compilers do basically nothing with) and your walking a tree gets substantially faster.

SSE2 has 144 new instructions including a substantial number devoted to 128-bit SIMD integer arithmetic. In pure integer math, we see a 15% speed increase with Intel's compilers and a 254% speed increase in integer math with hand optimization over already highly optimized code.

SSE2 wasn't intended for gaming. It was intended for encryption, voice and video compression, financial analysis, engineering and scientific calculations

That, as I remember it, is straight from the horse's mouth. I took a class from Intel on SSE/SIMD. I believe the web page describing the intent is at: http://www.intel.com/design/Pentium4/prodbref/index.htm

Aha -

Streaming SIMD Extensions 2 (SSE2) Instructions
With the introduction of SSE2, the Intel NetBurst microarchitecture now extends the SIMD capabilities that MMX technology and SSE technology delivered by adding 144 instructions. These instructions include 128-bit SIMD integer arithmetic and 128-bit SIMD double-precision floating-point operations. These instructions reduce the overall number of instructions required to execute a particular program task and as a result can contribute to an overall performance increase. They accelerate a broad range of applications, including video, speech, and image, photo processing, encryption, financial, engineering and scientific applications.

Data Prefetch Logic
Functionality that anticipates the data needed by an application and pre-loads it into the Advanced Transfer Cache, further increasing processor and application performance.

**Jodie** · 10-04-2002, 10:11 PM

And I call :bs: on your argument, Mikkyo. 3DNow, MMX, SSE, SSE2 is entirely motherboard, memory architecture, etc. independant. It's processor dependant. 3DNow! is Athlon only, so toss it out.

Athlons support MMX and SSE.

P4 is the only SSE2 processor other than Server P3. So toss that out.

If you code for MMX it runs on every Athlon and P-II or greater.

So that takes you to seventy-something percent of the total machines out there today.

If you code for SSE it runs on EVERY Athlon Tbird + and every P-3 +

Now you're still over 40% of the machines.

It's also operating system independant. The same MMX code that compiles under windows compiles under any *nix that runs on that processor.

By autodetecting your processor, you can dynamically choose MMX, SSE, SSE2, etc.

Suggesting that it's smarter to code for an operating system that runs on less than 5% of the computers in the world is ludicrous. "Your entire market is a few thousand machines. So it's much easier to test your code. No one will run it, but that just means you have less to support!"

By your argument, the smartest machine to code for would probably be a PDP-03. I think I have one of a half dozen left running in the world... MUCH easier to test! Only code that was written on the '03 can run on the '03, and since there's only ONE board to test, look at how much wiser a decision that is!

**Darkness Productions** · 10-04-2002, 11:50 PM

Wellll...... he could do something like the d.net client did, and have a different *core* for it. Then, the client would detect what you're using, and use the appropriate core.

**Jodie** · 10-05-2002, 01:19 AM

Sure. In fact, back in my hacking days we used to do a universal executable. You could do a single executable that ran on every platform you wanted to support. But you carry a lot of extra "weight" doing that...

**bwkaz** · 10-05-2002, 09:21 AM

Not to mention a whole lot of time to develop each one...

**runestar** · 10-05-2002, 12:29 PM

Well put, but... Breathe Jodie BREATHE... =)

RS½

Thread: Altivec Enhancements

Thread Tools

Rate This Thread

Display

Posting Permissions