Page 1 of 2 12 LastLast
Results 1 to 40 of 48

Thread: Altivec Enhancements

  1. #1
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140

    Altivec Enhancements

    Now that Team MacNN is moving into DF, would it be possible to get some Altivec enhancements for the client?
    We like seeing our processors fully used

    Thx,

    Scott

  2. #2
    Mac since '86
    Join Date
    Apr 2002
    Location
    Silverdale, WA
    Posts
    51

    Re: Altivec Enhancements

    Originally posted by Scotttheking
    Now that Team MacNN is moving into DF, would it be possible to get some Altivec enhancements for the client?
    We like seeing our processors fully used

    Thx,

    Scott
    Now there is an idea, if it is doable. I would love to give my G4 a crack at it with Altivec.

  3. #3
    Mac since '86
    Join Date
    Apr 2002
    Location
    Silverdale, WA
    Posts
    51

    Re: Altivec Enhancements

    Originally posted by Scotttheking
    Now that Team MacNN is moving into DF, would it be possible to get some Altivec enhancements for the client?
    We like seeing our processors fully used

    Thx,

    Scott
    Now there is an idea, if it is doable. I would love to give my G4 a crack at it with Altivec.

  4. #4
    We have had a number of requests for Altivec enhancements, however we are currently unable to perform such enhancements. We do not even have an Altivec machine to test on, let alone the knowledge to use it. Also, I suspect that our algorithm would not benefit that much because the majority of its time is spent doing pointer traversals, not math.

    Can someone knowledgeable on Altivec tell me exactly what sorts of operations it is good at optimizing (and don't say 'everything' ) We may be able to get a 3rd party involved if there is enough interest and if it appears to be worth the time it would take.
    Howard Feldman

  5. #5
    Mac since '86
    Join Date
    Apr 2002
    Location
    Silverdale, WA
    Posts
    51
    Originally posted by Brian the Fist
    We have had a number of requests for Altivec enhancements, however we are currently unable to perform such enhancements. We do not even have an Altivec machine to test on, let alone the knowledge to use it. Also, I suspect that our algorithm would not benefit that much because the majority of its time is spent doing pointer traversals, not math.

    Can someone knowledgeable on Altivec tell me exactly what sorts of operations it is good at optimizing (and don't say 'everything' ) We may be able to get a 3rd party involved if there is enough interest and if it appears to be worth the time it would take.
    It is my understanding that it accellerates floating point and vector calculations primarily. But I am not an authority on it. Operations that can take advantage of it, can see performance increases anywhere from 50% to 400% on the Power PC G4 chips. Altivec.org website provides information that you may find helpful in determining if your process can benefit from it.

    If you find that your process can benefit from it, you might approach Apple computer for additional help in how to implement it or possibly even the loan of a "development/testing" machine. After all they like to demonstrate how altivec can benefit many different processes.

  6. #6
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140
    The best place for info / help is the apple developer mailing lists.

  7. #7
    If you can tell us where the program spends most of the time it, and what it does, then we may be able to determine whether we can use AltiVec or not.

    Also, not only AltiVec, but there are other kinds of SIMD (Single Instruction Stream, Multiple Data Streams) implementations, like MMX, SSE, or 3D Now!. Each CPU vender has its only implementation of SIMD instructions.

    Rayson

  8. #8
    Junior Member
    Join Date
    Apr 2002
    Location
    London, UK
    Posts
    2
    If have a dual cpu PowerMac G4 that I am using to run the client.

    Any Altivec enhancements would be appreciated. Any speed improvements to the client of this nature would result in more Mac people lending their cpu power to this project.

  9. #9
    Junior Member
    Join Date
    Apr 2002
    Location
    London, UK
    Posts
    2
    Not sure if you guys have seen the following article on O'Reilly but it gives a bit of info about Altivec and how to code for it.

    http://www.oreillynet.com/pub/a/mac/...5/altivec.html

    Hopefully this will come in useful.


  10. #10
    Junior Member
    Join Date
    Apr 2002
    Location
    Welwyn Garden City, UK
    Posts
    4
    I'm also using a DP G4 Powermac...altivec would be very much welcomed, and I'm sure attract more G4/mac users to the project...

    Marc

  11. #11
    I think everyone could benefit from reading this article by O'reilly about Alti-Vec.

    http://www.oreillynet.com/pub/a/mac/...5/altivec.html

  12. #12
    Mac since '86
    Join Date
    Apr 2002
    Location
    Silverdale, WA
    Posts
    51
    Howard,

    Hope some of the references that have been given are helpful. I you need more, let us know. Of course the question remains, will there be enough G4 users to make the effort worthwhile. I think you will find that answer to be yes. RC-5 is nearing its end, and has been a haven for Mac G4 users because of its Altivec enhancenments. You already have a large number of G4's working on this project, and can expect more. Not to mention the fact that the Mac Community will get the word out for you, if you are able to provide Altivec enhancements to your clients. You could soon have a few thousand Mac G4's crunching for you and a good many of those folks will bring other computers with them as well. (I myself have 1 G4, 1 G3, 1 celeron and 1 Athlon all crunching on this project.) In the end though, only you and your team can decide if it is doable and worth the effort. Please keep us advised.

  13. #13
    I may have someone who is willing to look into it for me (a Mac user of course ) Still not sure that it could really benefit from the Altivec though as the bottleneck is not in floating point operations. But only one way to find out I guess. Probably couldn't start doing it until the summer though (i.e. May) as it would probably be a summer student doing it. We'll post a notice if/when we begin such a port.
    Howard Feldman

  14. #14
    Mac since '86
    Join Date
    Apr 2002
    Location
    Silverdale, WA
    Posts
    51
    Thanks for keeping an open mind.

  15. #15
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140
    Originally posted by Brian the Fist
    I may have someone who is willing to look into it for me (a Mac user of course ) Still not sure that it could really benefit from the Altivec though as the bottleneck is not in floating point operations. But only one way to find out I guess. Probably couldn't start doing it until the summer though (i.e. May) as it would probably be a summer student doing it. We'll post a notice if/when we begin such a port.
    Thanks for looking into it.

    Can you give us an idea what you think the bottleneck is?
    Thx,

    Scott

  16. #16
    Mac since '86
    Join Date
    Apr 2002
    Location
    Silverdale, WA
    Posts
    51
    Just as a heads up. Vijay Pande at Folding@Home has announced that they are working on a new core that will support Altivec (and I presume some other vectorization technologies). It would be great if a way could be found to do this for dFold as well.

    The Thread is http://forum.folding-community.org/viewtopic.php?t=102

  17. #17
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140
    I know you said you'd post, but I'm impatient

    Any news ?

  18. #18
    Bottom line is if you can't take advantage of 128bit register math through vector math or other intense floating point routines - then the Altivec optimizations are a waste of time. The same being true for SSE/2, 3DNow!, and the rest of the floating optimized SIMD instructions sets.

    SIMD isn't a magic bullet guys. Apple may have sold you on that being the case - but it's not.

    In fact, very few real-world tasks are really all that sped by it.

    Look at how dog slow the P4 is in reality. If you're encoding video, you're in floating-point land, and it screams. If you doing a 500k cell cut and paste in excel, then it's going to come down to memory bandwidth.

    Our team's off-the-cuff profiling has shown so far that it's all in the memory bandwidth for this project - and processor cache. That jibes with it being pointer math.

    So for this project, the USparc 3's, Alpha's and to a lesser extent, higher-end Xeons should be the prom queens.

  19. #19
    Ancient Programmer Paratima's Avatar
    Join Date
    Dec 2001
    Location
    West Central Florida
    Posts
    3,296
    Originally posted by Jodie
    So for this project, the USparc 3's, Alpha's and to a lesser extent, higher-end Xeons should be the prom queens.
    That's an interesting premise that I've read hereabouts before. Anyone have actual performance numbers on those machines?

  20. #20
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140
    Jodie, I know you know more about this stuff then most of the rest of us, so could you explain a bit more?
    Would it then make more sense to code prefetch into the app (can that even happen?) or something?

    Also, would this mean that the new G4s L3 cache will provide a performance boost?

    BTW, I'd still like to see a evaluation of the code and it's altivec potential from someone looking at the code. I never said altivec would be faster, but I'd like to see if it can be.

    Thx,

    Scott
    Last edited by Scotttheking; 05-11-2002 at 07:50 PM.

  21. #21
    MHz for MHz, Alpha is definitely THE fastest processor for distributed folding. The majority of the program's time, as I have mentioned before, is traversing pointers (specifically in binary-tree-like data structures). This accounts for 50% or more of its time. Another good but smaller chunk is spent RLE decompressing the data in protein.trj, the protein data file. The expanddb utility that originally came with foldtrajlite uncompressed protein.trj, but we found this made things slower, not faster, probably due to increased loading from disk.

    Altivec will be looked at this summer, if all goes well.
    Howard Feldman

  22. #22
    That makes sense to me. I would expect, in order of performance (based on a suspect model that I have derivived from what you just said )

    Alpha
    USparc3
    R10k
    P3 Xeon w/2M ->1M
    P3 Xeon w/512k
    P4 Xeon w/512k
    P3 w/512k (they have those in server units like Dell, right?)
    P4 Rambus 800
    AMD XP
    P4 Rambus 400
    G3/G4
    P3 standard
    Celeron

    The new G4 should then be faster due to greater cache. Place it probably betwixt the P4 Xeon and P3 Xeon

    Increasing your memory bandwidth should help substantially. So DDR on the AMD would be a priority. (shoot - and here my whole cluster is optimized for G@H performance...)

    Vectorization isn't a priority - so Crays, Hitachis and Fujitsus are right out.

    If vectorization is out, how much response do you expect to see from Altivec or other SIMD optimization? A perfect parallelization shouldn't see you more than a 30% improvement. Realistically, probably what, 10%? Is it worth the effort?

    There's the perception thing here too, I think. If you do Altivec optimization, you're going to have to do atleast SSE2 and most likely SSE1 and/or 3DNow!2 optimization... Otherwise, the PC-class users are going to scream that the 5% market distribution of Mac users got their optimization whilst the 90% of WinTel didn't. sounds like RC all over again.

    If it were I, and obviously it's not, I'd optimize for 64-256bit register math (if you're doing binary trees, it's a natural), go with a register compiler where available (can you say Watcom - ZOOOM! ) and hold out for the RealComputers like the US3, Sledgehammer (errr, I mean Opteron or whatever the heck the stupid marketing team came up with), Xeon, etc...

    But I'm just rambling.

    Scott - when I'm a tad less busy, I'll see if I can come up with something to help explain. It's a *really* complicated topic...


  23. #23
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140
    Originally posted by Jodie
    Scott - when I'm a tad less busy, I'll see if I can come up with something to help explain. It's a *really* complicated topic...
    That's fine.
    I can probably understand it in tech talk also, and if something's confusing I have translators

    Just curious, I forgot about the new new G4s
    Is this the correct speed order?
    L3 cache
    256 on die L2
    1MB backside L2

    Also, since it's a lot of memory stuff, are the altivec memory calls I've been reading about useful ?

    (Yeah, I know I'm wanting altivec a lot, even if it is mainly psycological. There are a lot of people who'd join up just because of that word, even if there's not much benefit. And I want those users )

    --Scott

  24. #24
    Registered User
    Join Date
    Mar 2002
    Location
    [H]awaii
    Posts
    20
    Originally posted by Jodie

    Increasing your memory bandwidth should help substantially. So DDR on the AMD would be a priority. (shoot - and here my whole cluster is optimized for G@H performance...)
    How substantial is the difference between SDRAM and DDR?

  25. #25
    Hmm, I have a P4-400RD a P4-800 and a P4 - 333 DDR. Downside is they're all different speed chips.

    So let me scrounge up some 1.6's (I know I have lots of those) and then compare.

    Note they will be different chipsets so that could change things a bit.

    Should be able to bite into that this weekend.

  26. #26
    Is there anyway to do one set of 5000 and then exit? Or should I write a script to watch progress.txt and time from 5000 to 0 remaining? Night quite as accurate, 'cause my polling will slow things down a bit...

  27. #27
    Senior Member
    Join Date
    Apr 2002
    Location
    Santa Barbara CA
    Posts
    355
    For benchmarking I would just run with the -i f switch which will prevent the client from trying to upload. Then after running for a certain length of time just remove the foldtrajlite.lock and count the number of WU that were done.

  28. #28
    The Cruncher From Hell
    Join Date
    Dec 2001
    Location
    The Depths of Hell
    Posts
    140
    bump back up.

    Jodie, got time for that explanation now?

  29. #29
    Registered User
    Join Date
    Mar 2002
    Location
    [H]awaii
    Posts
    20
    Iwill XP333 with an AXP 1800@1.66 turns out approximately 3400 structures per hour. The XP333 mobo uses an Ali Magik chipset which runs DDR.

    Iwill KK266 with an Athlon 1.4@1.65 turns out approximately 3300 structures per hour. The KK266 mobo uses a VIA KT133A chipset which runs SDRAM.

    I know there will be some discrepency about how the Ali Magik chipset isn't the fastest DDR chipset, and that I'm not comparing the same processor. But, the numbers are relatively close, so by using DDR or SDRAM doesn't make a dramatic difference in structure production.

  30. #30
    You might try a SiS735 based board, it has good (KT266 level but not KT266a level) performance with DDR and it takes either SDRAM or DDR SDRAM.

    If no one else steps up and runs one I might be convinced to open the case, dig up some SDRAM somewhere and do some benchmarks. I would prefer to avoid the extra work though.

  31. #31
    Junior Member
    Join Date
    Jun 2002
    Location
    Canberra, Australia
    Posts
    6
    Originally posted by Jodie
    If vectorization is out, how much response do you expect to see from Altivec or other SIMD optimization? A perfect parallelization shouldn't see you more than a 30% improvement. Realistically, probably what, 10%? Is it worth the effort?
    Jodie,

    I'm going to bump this for two reasons:
    1. Because I think it's worthwhile, and
    2. Because I have something new to add. Apple have within it an Architecture and Performance Group whose job is to look at algorithms and make them run fast on PPC (and Altivec) hardware. I know they exist, but I don't know how to get hold of them directly, but I can point you in the right direction should you be willing.

    I know they exist because a guy whose screensaver is to be included in Jaguar (OS 10.2) just had his code optimised by these people (check it out: http://www.versiontracker.com/morein...d=11393&db=mac). What I'd suggest you do is contact him (email address is listed as calumr@mac.com ) and ask him if he can point you in the right direction.

    I'd really like to see if they can help - I want to use my clock cycles where they are most effective and right now having a few macs dedicated to the effort, without Altivec optimisations I don't feel they're really being fully utilised. Surely there's somewhere in your code where it can make a difference

    Thanks

    -- james
    Divide and conquer

  32. #32
    To clarify, Jodie is NOT a member of this project team, I am; she is a user (with lots of computers). So Ill assume your comments were directed to myself.

    We have rigorously optimized our code almost to the level of assembly code, and knwo exactly where the bottleneck is. The majority of its time is spent doing 32-bit pointer traversal, something which AltiVec cannot help with. Thus we expect only minimal improvement. Plus if we looked at the actual number of users using the software, it would make more sense to optimize the Windows version, not the mac one, if we were to make any specific CPU-optimized versions. However, at 13 platforms currently supported, we have enough versions to maintain to keep our hands full already.
    Howard Feldman

  33. #33
    Junior Member
    Join Date
    Jul 2002
    Location
    Northfield, MN
    Posts
    7
    Yeah, I looked into this a little while ago with some of the utilities that ship on the OS X developer CD. The utility I used looks at the core function calls that a program uses and tells you the percent of time that that they are being executed. It is a really good way of seeing whether or not a program would benefit from Altivec enhancements.

    Unfortunately, it was immediately obvious that Altivec would not be worth implementing in Distributed Folding.


    Crunch Something

  34. #34
    Junior Member
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    27
    Howard,

    Are you still looking into G4 optimization? If yes, may I suggest taking a look at this page:
    http://developer.apple.com/hardware/ve/performance.html

    It allows you to run a diagnostic to see if the distributed folding program can really be optimized.

    From the page:
    MONster and Shikari are most suitable for applications level or OS level performance measurement. They are a good way of identifying which applications or which functions might benefit from performance tuning and why. As such they are somewhat above the scope of the sorts of optimizations discussed here. These are the sorts of tools you should use first to discover what to vectorize. The actual process of verifying that your optimizations are working as intended however relies much more heavily on trace utilities and simulators like Sim_G4 and Acid.
    I hope this helps!
    Last edited by dtsang; 08-11-2002 at 09:17 AM.
    Derek

  35. #35
    Junior Member
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    27
    Also, I found another example of how Distributed Folding can benefit from AltiVec enhancement. On the Folding@Home website (which, I'm assuming, is running a similar protein folding initiative), they have a page dedicated to their new speed-oriented client. Read it here:

    http://folding.stanford.edu/gromacs.html

    On the page:
    How can Gromacs be that much faster? Gromacs is built for speed. Everything about it has been optimized to be the very fastest MD code on the planet. ... Altivec is supported on Macs. The inner loops are handcoded in assembly. It has algorithms creatively designed for speed. It's an amazing feat. For us to include all of these optimizations into our current scientific code did not seem a judicious use of our programming resources (why reinvent the wheel?) and we instead decided to collaborate with the Gromacs team.
    AltiVec enhancement is possible!
    Derek

  36. #36
    As has been posted before - it appears that the algos involved in DF are rather very different than those employed by F@H.

  37. #37
    Junior Member
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    27
    My boo-boo. I read most of the thread, but not all of it, I guess...
    Derek

  38. #38
    Junior Member
    Join Date
    Sep 2002
    Location
    Silly Valley, CA
    Posts
    5

    Post

    So I noticed some of the CHUD tools were mentioned but did anyone look at them and do some profiling?
    Performance, Debugging, Profiling

    Here is an example of the kind of info they can provide...
    This is a partial analysis of a sample of 100000000 instructions of the foldtrajlite process running under OS X grabbed with amber and analyzed with acid.
    Code:
    Total Instruction Count = 100000000
    
    ------------------------------------------------------------------
    Instruction Type          |        Count   |   % of Total
    ------------------------------------------------------------------
    Integer                         39601893           39.60
    Floating Point                   5291767            5.29
    Altivec                                0            0.00
    Branch                          18743879           18.74
    Load                            18273513           18.27
    Store                            9192360            9.19
    Cache Control                         21            0.00
    Data Stream                            0            0.00
    Miscellaneous                    8896567            8.90
    ------------------------------------------------------------------
    Revealing eh? The full report gives much more detail.

    Looks to me like you do use floating point a bit but not as much as integer math.
    You could also use altivec for your pointer tree traversal, if you wanted.
    It sure would be nice to see the speed benefits of someone spending some time on this.

  39. #39
    What's the ratio of SSE/SSE2/3DNOW enabled users to Altivec-enabled users, again? Oops! It's in the chart. 95%-ish SSE/SSE2/3DNow potential users... Call everything Win98 and below as MMX or worse. 20%... That leaves us with a conservative 75% to 3%? One would hope SSE would be evaluated first...

  40. #40
    Exactly. Not that I know the first thing about using SSE instructions either... though I THINK the Intel compiler automatically uses them where it sees fit.
    Howard Feldman

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •