Apple Mac PPC G5 proth_sieve update
Hello,
A quick update on this as I haven't posted anything about it recently.
First off, big thanks to Rogue (Mark R) who has been putting in a big effort on the assembly front, without which it would be about 1/4 of the speed it is now.
It's not ready for release yet as there's still quite a bit of work to do:
- Remove last of the C++ code (only the dat reading functions left as C++)
- Remove all dependencies on GMP (using it for gcd and 64-bit multinv)
- Go through the possible optimisations that Mikael had noted.
- Test it thoroughly against previous factors!
It's running happily at around 600 kp/sec on a 2.5GHz G5. So my dual-cpu dual-core is crunching at about 2400 kp/sec :-) This is range up at 1000T.
That's a 1T range for SoB in just under 5 days.
I reckon it will go up to 700 kp/sec once we've finished all of the optimisations.
Once finished it should support p up to 2^52 and further work is in progress to push this to 2^60.
The G4 version will be looked at afterwards, the performance will not be as good as there is lots of 64-bit specific G5 assembly in there.
Congrats with a 'heads up'
Nicely done. ...
But, one issue I see, and offer a 'heads up' on, is the switch to Intel processors by Apple. Any plans for handling the platform migration?
Good job on the speed... very nice indeed. I get about 550 kp/sec on my Opterons. (I'm still using the dinosaur 250's)
Clarification...... with apologies
I am sorry.. I forgot to mention that I am using the Riesel.dat as well.
The 991-50M .dat file comes in around 750. I will post up a log file of the two.
I'm sorry for the miscommunication.
C.
The prelim Linux benchmarks/first cut
Quote:
Originally Posted by Greenbank
OK, I'll take that challenge:-
991 -> 50M SoB.dat. 2.5GHz PPC G5 running MacOS X 10.4.
Sieving 1800892200000000 <= p <= 1800892800000000, 991 <= n <= 50000000
p = 1800892210000123 @ 748kp/s
p = 1800892220000131 @ 740kp/s
p = 1800892230000139 @ 746kp/s
p = 1800892240000153 @ 778kp/s
p = 1800892250000213 @ 739kp/s
Mainly thanks to Rogue's fantastic work on the magic number multiplication!
They'll be another 10 kp/sec or so in the new sieve code, plus I can make the sieve window larger to make the p-1 phase more efficient. Memory usage is up to around 140MB though.
Given that I've got a slight advantage (2.5GHz vs 2.4GHz) I'd say we're pretty equal!
Under 2^50 I'm getting roughly 780kp/sec:
Sieving 926804760000000 <= p <= 926804800000000, 991 <= n <= 50000000
p = 926804770000009 @ 772kp/s
p = 926804780000059 @ 768kp/s
926804780689751 | 55459*2^2415946+1 (xmod 300))
p = 926804790000071 @ 784kp/s
Done
3120 kp/sec on one machine is lovely. :-)
Nice specs indeed.
Here I'm running a dual AMD Opteron, 2.4Ghz dual core machine w/ CL3 ECC, Gentoo on 2.6.13. I'm still working on the 32/64 bit port. I have some work to do thanks to GCC3 -> GCC4 and the 'twist' Redhat put in FC4, breaking a lot of my 32/64 bit portability, but the additional registers I get in -m64 mode are making things nice across the board. :)
The other dual Opteron is running Windows / FC4 dual boot.
I have both 32 and 64 bit versions in test atm. Here's what I get with the 991-50M .dat file:
chuck@innsbruck ~/prs
$ ./prs092_32 -S
PRS (32bit) v0.92a for processor type X86
Found SoBStatus.dat with unfinished work.
Continuing from last save point.
Setting priority to nice.
Running in Sierpinski mode
Starting setup.
Expected 8 k values.
Found 8 k values.
Done.
Setup took 4.211024 seconds.
Starting sieve.
P Range 1800892334217726 <= p <= 1800898200000000
N Range 991 <= n <= 50000000
pmin=1800892401326581 @ 796kp/s
pmin=1800892468435399 @ 797kp/s
pmin=1800892535544259 @ 797kp/s
pmin=1800892602653173 @ 796kp/s
pmin=1800892669761989 @ 796kp/s
pmin=1800892736870819 @ 798kp/s
pmin=1800892803979721 @ 796kp/s
1800892814825837 | 10223*2^15515117+1 (duplicate)
pmin=1800892871088631 @ 798kp/s
pmin=1800892938197473 @ 796kp/s
pmin=1800893005306319 @ 794kp/s
pmin=1800893072415209 @ 796kp/s
pmin=1800893139524041 @ 795kp/s
pmin=1800893206632953 @ 795kp/s
pmin=1800893273741759 @ 797kp/s
pmin=1800893340850601 @ 793kp/s
pmin=1800893407959539 @ 796kp/s
pmin=1800893475068333 @ 797kp/s
^C
chuck@innsbruck ~/prs
$ ./prs092_64 -S
PRS (64bit) v0.92a for processor type X86_64/EM64T
Found SoBStatus.dat with unfinished work.
Continuing from last save point.
Setting priority to nice.
Running in Sierpinski mode
Starting setup.
Expected 8 k values.
Found 8 k values.
Done.
Setup took 3.460346 seconds.
Starting sieve.
P Range 1800893475068332 <= p <= 1800898200000000
N Range 991 <= n <= 50000000
pmin=1800893542177193 @ 958kp/s
pmin=1800893609286029 @ 958kp/s
pmin=1800893676394921 @ 959kp/s
pmin=1800893743503781 @ 957kp/s
pmin=1800893810612647 @ 960kp/s
pmin=1800893877721483 @ 959kp/s
pmin=1800893944830361 @ 959kp/s
pmin=1800894011939219 @ 958kp/s
pmin=1800894079048091 @ 957kp/s
pmin=1800894146156957 @ 957kp/s
pmin=1800894213265817 @ 959kp/s
pmin=1800894280374667 @ 960kp/s
pmin=1800894347483533 @ 959kp/s
pmin=1800894414592417 @ 959kp/s
pmin=1800894481701257 @ 958kp/s
pmin=1800894548810117 @ 958kp/s
^C
chuck@innsbruck ~/prs
$
Calling 958 and 796 my averages, that clocks me in at (4x958) 3832 kp/sec in 64 bit mode and (4x796) 3184 kp/sec in 32 bit mode. That's about what I expect for cache hit rate vs NUMA bus interaction and the slow ECC memory (3-3-3-8 w/ scrubber). I do have to work on the memory utilization a bit though. I would like it smaller/more strategic, since I don't have an L3 cache.
It's nice to see that we are running really well overall across all the platforms and OS's.
Chuck
Cross Platform Perf standardization
Joe, Alex, Mark,... and anyone working with/on sieves:
Question for all:
Given we are trying to support projects at various stages of maturity (and/or 'Trial P' magnitude).... any thoughts as to which ranges we use as a 'community acceptable' set of P ranges ?
I'm suggesting we use multiple P ranges for our code to ensure that the target platform, X86, PPC, SGI, Alpha, etc... provides us with a consistent profile so we can create / tune software to be the best possible.
I toss out as example: P= 10E4, 10E6, 10E8, 10E12, 10E18, 10E20, etc
(this is not mathematically accurate, but hope it conveys the 'target' points
of interest)
Thoughts? (open to all)
Chuck