Altivec Enhancements

**Scotttheking** · 04-05-2002, 02:14 AM

Now that Team MacNN is moving into DF, would it be possible to get some Altivec enhancements for the client?
We like seeing our processors fully used

Thx,

Scott

**Shaktai** · 04-05-2002, 02:33 AM

Originally posted by Scotttheking
Now that Team MacNN is moving into DF, would it be possible to get some Altivec enhancements for the client?
We like seeing our processors fully used

Thx,

Scott

Now there is an idea, if it is doable. I would love to give my G4 a crack at it with Altivec.

**Shaktai** · 04-05-2002, 02:42 AM

Originally posted by Scotttheking
Now that Team MacNN is moving into DF, would it be possible to get some Altivec enhancements for the client?
We like seeing our processors fully used

Thx,

Scott

Now there is an idea, if it is doable. I would love to give my G4 a crack at it with Altivec.

**Brian the Fist** · 04-05-2002, 11:49 AM

We have had a number of requests for Altivec enhancements, however we are currently unable to perform such enhancements. We do not even have an Altivec machine to test on, let alone the knowledge to use it. Also, I suspect that our algorithm would not benefit that much because the majority of its time is spent doing pointer traversals, not math.

Can someone knowledgeable on Altivec tell me exactly what sorts of operations it is good at optimizing (and don't say 'everything'

) We may be able to get a 3rd party involved if there is enough interest and if it appears to be worth the time it would take.

**Shaktai** · 04-05-2002, 03:36 PM

Originally posted by Brian the Fist
We have had a number of requests for Altivec enhancements, however we are currently unable to perform such enhancements. We do not even have an Altivec machine to test on, let alone the knowledge to use it. Also, I suspect that our algorithm would not benefit that much because the majority of its time is spent doing pointer traversals, not math.

Can someone knowledgeable on Altivec tell me exactly what sorts of operations it is good at optimizing (and don't say 'everything' ) We may be able to get a 3rd party involved if there is enough interest and if it appears to be worth the time it would take.

It is my understanding that it accellerates floating point and vector calculations primarily. But I am not an authority on it. Operations that can take advantage of it, can see performance increases anywhere from 50% to 400% on the Power PC G4 chips. Altivec.org website provides information that you may find helpful in determining if your process can benefit from it.

If you find that your process can benefit from it, you might approach Apple computer for additional help in how to implement it or possibly even the loan of a "development/testing" machine. After all they like to demonstrate how altivec can benefit many different processes.

**Scotttheking** · 04-05-2002, 04:24 PM

The best place for info / help is the apple developer mailing lists.

**rayson_ho** · 04-05-2002, 10:18 PM

If you can tell us where the program spends most of the time it, and what it does, then we may be able to determine whether we can use AltiVec or not.

Also, not only AltiVec, but there are other kinds of SIMD (Single Instruction Stream, Multiple Data Streams) implementations, like MMX, SSE, or 3D Now!. Each CPU vender has its only implementation of SIMD instructions.

Rayson

**wheeles** · 04-06-2002, 07:36 AM

If have a dual cpu PowerMac G4 that I am using to run the client.

Any Altivec enhancements would be appreciated. Any speed improvements to the client of this nature would result in more Mac people lending their cpu power to this project.

**wheeles** · 04-08-2002, 10:48 AM

Not sure if you guys have seen the following article on O'Reilly but it gives a bit of info about Altivec and how to code for it.

http://www.oreillynet.com/pub/a/mac/...5/altivec.html

Hopefully this will come in useful.

**Marc2211** · 04-08-2002, 12:51 PM

I'm also using a DP G4 Powermac...altivec would be very much welcomed, and I'm sure attract more G4/mac users to the project...

Marc

**SkiBikeSki** · 04-09-2002, 02:45 PM

I think everyone could benefit from reading this article by O'reilly about Alti-Vec.

http://www.oreillynet.com/pub/a/mac/...5/altivec.html

**Shaktai** · 04-09-2002, 03:52 PM

Howard,

Hope some of the references that have been given are helpful. I you need more, let us know. Of course the question remains, will there be enough G4 users to make the effort worthwhile. I think you will find that answer to be yes. RC-5 is nearing its end, and has been a haven for Mac G4 users because of its Altivec enhancenments. You already have a large number of G4's working on this project, and can expect more. Not to mention the fact that the Mac Community will get the word out for you, if you are able to provide Altivec enhancements to your clients. You could soon have a few thousand Mac G4's crunching for you and a good many of those folks will bring other computers with them as well. (I myself have 1 G4, 1 G3, 1 celeron and 1 Athlon all crunching on this project.) In the end though, only you and your team can decide if it is doable and worth the effort. Please keep us advised.

**Brian the Fist** · 04-09-2002, 09:48 PM

I may have someone who is willing to look into it for me (a Mac user of course

) Still not sure that it could really benefit from the Altivec though as the bottleneck is not in floating point operations. But only one way to find out I guess. Probably couldn't start doing it until the summer though (i.e. May) as it would probably be a summer student doing it. We'll post a notice if/when we begin such a port.

**Shaktai** · 04-09-2002, 10:01 PM

Thanks for keeping an open mind.

**Scotttheking** · 04-09-2002, 10:17 PM

Originally posted by Brian the Fist
I may have someone who is willing to look into it for me (a Mac user of course ) Still not sure that it could really benefit from the Altivec though as the bottleneck is not in floating point operations. But only one way to find out I guess. Probably couldn't start doing it until the summer though (i.e. May) as it would probably be a summer student doing it. We'll post a notice if/when we begin such a port.

Thanks for looking into it.

Can you give us an idea what you think the bottleneck is?
Thx,

Scott

**Shaktai** · 04-30-2002, 05:48 PM

Just as a heads up. Vijay Pande at Folding@Home has announced that they are working on a new core that will support Altivec (and I presume some other vectorization technologies). It would be great if a way could be found to do this for dFold as well.

The Thread is http://forum.folding-community.org/viewtopic.php?t=102

**Scotttheking** · 05-11-2002, 06:59 AM

I know you said you'd post, but I'm impatient

Any news

?

**Jodie** · 05-11-2002, 09:13 AM

Bottom line is if you can't take advantage of 128bit register math through vector math or other intense floating point routines - then the Altivec optimizations are a waste of time. The same being true for SSE/2, 3DNow!, and the rest of the floating optimized SIMD instructions sets.

SIMD isn't a magic bullet guys. Apple may have sold you on that being the case - but it's not.

In fact, very few real-world tasks are really all that sped by it.

Look at how dog slow the P4 is in reality. If you're encoding video, you're in floating-point land, and it screams. If you doing a 500k cell cut and paste in excel, then it's going to come down to memory bandwidth.

Our team's off-the-cuff profiling has shown so far that it's all in the memory bandwidth for this project - and processor cache. That jibes with it being pointer math.

So for this project, the USparc 3's, Alpha's and to a lesser extent, higher-end Xeons should be the prom queens.

**Paratima** · 05-11-2002, 12:09 PM

Originally posted by Jodie
So for this project, the USparc 3's, Alpha's and to a lesser extent, higher-end Xeons should be the prom queens.

That's an interesting premise that I've read hereabouts before. Anyone have actual performance numbers on those machines?

**Scotttheking** · 05-11-2002, 07:35 PM

Jodie, I know you know more about this stuff then most of the rest of us, so could you explain a bit more?
Would it then make more sense to code prefetch into the app (can that even happen?) or something?

Also, would this mean that the new G4s L3 cache will provide a performance boost?

BTW, I'd still like to see a evaluation of the code and it's altivec potential from someone looking at the code. I never said altivec would be faster, but I'd like to see if it can be.

Thx,

Scott

**Brian the Fist** · 05-12-2002, 04:04 PM

MHz for MHz, Alpha is definitely THE fastest processor for distributed folding. The majority of the program's time, as I have mentioned before, is traversing pointers (specifically in binary-tree-like data structures). This accounts for 50% or more of its time. Another good but smaller chunk is spent RLE decompressing the data in protein.trj, the protein data file. The expanddb utility that originally came with foldtrajlite uncompressed protein.trj, but we found this made things slower, not faster, probably due to increased loading from disk.

Altivec will be looked at this summer, if all goes well.

**Jodie** · 05-13-2002, 01:31 AM

That makes sense to me. I would expect, in order of performance (based on a suspect model that I have derivived from what you just said

)

Alpha
USparc3
R10k
P3 Xeon w/2M ->1M
P3 Xeon w/512k
P4 Xeon w/512k
P3 w/512k (they have those in server units like Dell, right?)
P4 Rambus 800
AMD XP
P4 Rambus 400
G3/G4
P3 standard
Celeron

The new G4 should then be faster due to greater cache. Place it probably betwixt the P4 Xeon and P3 Xeon

Increasing your memory bandwidth should help substantially. So DDR on the AMD would be a priority. (shoot - and here my whole cluster is optimized for G@H performance...)

Vectorization isn't a priority - so Crays, Hitachis and Fujitsus are right out.

If vectorization is out, how much response do you expect to see from Altivec or other SIMD optimization? A perfect parallelization shouldn't see you more than a 30% improvement. Realistically, probably what, 10%? Is it worth the effort?

There's the perception thing here too, I think. If you do Altivec optimization, you're going to have to do atleast SSE2 and most likely SSE1 and/or 3DNow!2 optimization... Otherwise, the PC-class users are going to scream that the 5% market distribution of Mac users got their optimization whilst the 90% of WinTel didn't. sounds like RC all over again.

If it were I, and obviously it's not, I'd optimize for 64-256bit register math (if you're doing binary trees, it's a natural), go with a register compiler where available (can you say Watcom - ZOOOM!

) and hold out for the RealComputers like the US3, Sledgehammer (errr, I mean Opteron or whatever the heck the stupid marketing team came up with), Xeon, etc...

But I'm just rambling.

Scott - when I'm a tad less busy, I'll see if I can come up with something to help explain. It's a *really* complicated topic...

**Scotttheking** · 05-13-2002, 06:18 AM

Originally posted by Jodie
Scott - when I'm a tad less busy, I'll see if I can come up with something to help explain. It's a *really* complicated topic...

That's fine.
I can probably understand it in tech talk also, and if something's confusing I have translators

Just curious, I forgot about the new new G4s

Is this the correct speed order?
L3 cache
256 on die L2
1MB backside L2

Also, since it's a lot of memory stuff, are the altivec memory calls I've been reading about useful

?

(Yeah, I know I'm wanting altivec a lot, even if it is mainly psycological. There are a lot of people who'd join up just because of that word, even if there's not much benefit. And I want those users

)

--Scott

**eXXile** · 05-13-2002, 01:09 PM

Originally posted by Jodie

Increasing your memory bandwidth should help substantially. So DDR on the AMD would be a priority. (shoot - and here my whole cluster is optimized for G@H performance...)

How substantial is the difference between SDRAM and DDR?

**Jodie** · 05-15-2002, 12:51 AM

Hmm, I have a P4-400RD a P4-800 and a P4 - 333 DDR. Downside is they're all different speed chips.

So let me scrounge up some 1.6's (I know I have lots of those) and then compare.

Note they will be different chipsets so that could change things a bit.

Should be able to bite into that this weekend.

**Jodie** · 05-15-2002, 12:54 AM

Is there anyway to do one set of 5000 and then exit? Or should I write a script to watch progress.txt and time from 5000 to 0 remaining? Night quite as accurate, 'cause my polling will slow things down a bit...

**Welnic** · 05-15-2002, 02:08 AM

For benchmarking I would just run with the -i f switch which will prevent the client from trying to upload. Then after running for a certain length of time just remove the foldtrajlite.lock and count the number of WU that were done.

**Scotttheking** · 05-28-2002, 03:15 AM

bump back up.

Jodie, got time for that explanation now?

**eXXile** · 05-28-2002, 01:18 PM

Iwill XP333 with an AXP 1800@1.66 turns out approximately 3400 structures per hour. The XP333 mobo uses an Ali Magik chipset which runs DDR.

Iwill KK266 with an Athlon 1.4@1.65 turns out approximately 3300 structures per hour. The KK266 mobo uses a VIA KT133A chipset which runs SDRAM.

I know there will be some discrepency about how the Ali Magik chipset isn't the fastest DDR chipset, and that I'm not comparing the same processor. But, the numbers are relatively close, so by using DDR or SDRAM doesn't make a dramatic difference in structure production.

**MAD-ness** · 05-28-2002, 06:12 PM

You might try a SiS735 based board, it has good (KT266 level but not KT266a level) performance with DDR and it takes either SDRAM or DDR SDRAM.

If no one else steps up and runs one I might be convinced to open the case, dig up some SDRAM somewhere and do some benchmarks. I would prefer to avoid the extra work though.

**jamesa** · 07-14-2002, 12:06 PM

Originally posted by Jodie
If vectorization is out, how much response do you expect to see from Altivec or other SIMD optimization? A perfect parallelization shouldn't see you more than a 30% improvement. Realistically, probably what, 10%? Is it worth the effort?

Jodie,

I'm going to bump this for two reasons:
1. Because I think it's worthwhile, and
2. Because I have something new to add. Apple have within it an Architecture and Performance Group whose job is to look at algorithms and make them run fast on PPC (and Altivec) hardware. I know they exist, but I don't know how to get hold of them directly, but I can point you in the right direction should you be willing.

I know they exist because a guy whose screensaver is to be included in Jaguar (OS 10.2) just had his code optimised by these people (check it out: http://www.versiontracker.com/morein...d=11393&db=mac). What I'd suggest you do is contact him (email address is listed as calumr@mac.com ) and ask him if he can point you in the right direction.

I'd really like to see if they can help - I want to use my clock cycles where they are most effective and right now having a few macs dedicated to the effort, without Altivec optimisations I don't feel they're really being fully utilised. Surely there's somewhere in your code where it can make a difference

Thanks

-- james

**Brian the Fist** · 07-14-2002, 12:28 PM

To clarify, Jodie is NOT a member of this project team, I am; she is a user (with lots of computers). So Ill assume your comments were directed to myself.

We have rigorously optimized our code almost to the level of assembly code, and knwo exactly where the bottleneck is. The majority of its time is spent doing 32-bit pointer traversal, something which AltiVec cannot help with. Thus we expect only minimal improvement. Plus if we looked at the actual number of users using the software, it would make more sense to optimize the Windows version, not the mac one, if we were to make any specific CPU-optimized versions. However, at 13 platforms currently supported, we have enough versions to maintain to keep our hands full already.

**The_Equivocator** · 07-16-2002, 11:02 AM

Yeah, I looked into this a little while ago with some of the utilities that ship on the OS X developer CD. The utility I used looks at the core function calls that a program uses and tells you the percent of time that that they are being executed. It is a really good way of seeing whether or not a program would benefit from Altivec enhancements.

Unfortunately, it was immediately obvious that Altivec would not be worth implementing in Distributed Folding.

**dtsang** · 08-11-2002, 09:06 AM

Howard,

Are you still looking into G4 optimization? If yes, may I suggest taking a look at this page:
http://developer.apple.com/hardware/ve/performance.html

It allows you to run a diagnostic to see if the distributed folding program can really be optimized.

From the page:

MONster and Shikari are most suitable for applications level or OS level performance measurement. They are a good way of identifying which applications or which functions might benefit from performance tuning and why. As such they are somewhat above the scope of the sorts of optimizations discussed here. These are the sorts of tools you should use first to discover what to vectorize. The actual process of verifying that your optimizations are working as intended however relies much more heavily on trace utilities and simulators like Sim_G4 and Acid.

I hope this helps!

**dtsang** · 08-11-2002, 09:29 AM

Also, I found another example of how Distributed Folding can benefit from AltiVec enhancement. On the Folding@Home website (which, I'm assuming, is running a similar protein folding initiative), they have a page dedicated to their new speed-oriented client. Read it here:

http://folding.stanford.edu/gromacs.html

On the page:

How can Gromacs be that much faster? Gromacs is built for speed. Everything about it has been optimized to be the very fastest MD code on the planet. ... Altivec is supported on Macs. The inner loops are handcoded in assembly. It has algorithms creatively designed for speed. It's an amazing feat. For us to include all of these optimizations into our current scientific code did not seem a judicious use of our programming resources (why reinvent the wheel?) and we instead decided to collaborate with the Gromacs team.

AltiVec enhancement is possible!

**Jodie** · 08-11-2002, 06:43 PM

As has been posted before - it appears that the algos involved in DF are rather very different than those employed by F@H.

**dtsang** · 08-11-2002, 10:04 PM

My boo-boo. I read most of the thread, but not all of it, I guess...

**mikkyo** · 10-02-2002, 05:52 AM

So I noticed some of the CHUD tools were mentioned but did anyone look at them and do some profiling?
Performance, Debugging, Profiling

Here is an example of the kind of info they can provide...
This is a partial analysis of a sample of 100000000 instructions of the foldtrajlite process running under OS X grabbed with amber and analyzed with acid.

Code:

Total Instruction Count = 100000000

------------------------------------------------------------------
Instruction Type          |        Count   |   % of Total
------------------------------------------------------------------
Integer                         39601893           39.60
Floating Point                   5291767            5.29
Altivec                                0            0.00
Branch                          18743879           18.74
Load                            18273513           18.27
Store                            9192360            9.19
Cache Control                         21            0.00
Data Stream                            0            0.00
Miscellaneous                    8896567            8.90
------------------------------------------------------------------

Revealing eh? The full report gives much more detail.

Looks to me like you do use floating point a bit but not as much as integer math.
You could also use altivec for your pointer tree traversal, if you wanted.
It sure would be nice to see the speed benefits of someone spending some time on this.

**Jodie** · 10-03-2002, 11:31 PM

What's the ratio of SSE/SSE2/3DNOW enabled users to Altivec-enabled users, again? Oops! It's in the chart. 95%-ish SSE/SSE2/3DNow potential users... Call everything Win98 and below as MMX or worse. 20%... That leaves us with a conservative 75% to 3%? One would hope SSE would be evaluated first...

**Brian the Fist** · 10-04-2002, 10:08 AM

Exactly. Not that I know the first thing about using SSE instructions either... though I THINK the Intel compiler automatically uses them where it sees fit.

Thread: Altivec Enhancements

Thread Tools

Rate This Thread

Display

Altivec Enhancements

Re: Altivec Enhancements

Re: Altivec Enhancements

Posting Permissions