PDA

View Full Version : Apple Mac PPC G5 proth_sieve update



Greenbank
01-31-2006, 10:16 AM
Hello,

A quick update on this as I haven't posted anything about it recently.

First off, big thanks to Rogue (Mark R) who has been putting in a big effort on the assembly front, without which it would be about 1/4 of the speed it is now.

It's not ready for release yet as there's still quite a bit of work to do:

Remove last of the C++ code (only the dat reading functions left as C++)
Remove all dependencies on GMP (using it for gcd and 64-bit multinv)
Go through the possible optimisations that Mikael had noted.
Test it thoroughly against previous factors!


It's running happily at around 600 kp/sec on a 2.5GHz G5. So my dual-cpu dual-core is crunching at about 2400 kp/sec :-) This is range up at 1000T.

That's a 1T range for SoB in just under 5 days.

I reckon it will go up to 700 kp/sec once we've finished all of the optimisations.

Once finished it should support p up to 2^52 and further work is in progress to push this to 2^60.

The G4 version will be looked at afterwards, the performance will not be as good as there is lots of 64-bit specific G5 assembly in there.

R. Gerbicz
01-31-2006, 12:00 PM
Hi Greenbank!

Can you post me the c source code, to robert ( DOT ) gerbicz ( AT ) gmail ( DOT ) com I'm really interested in that how you implemented the sieving algorithm. Have you included my ideas posted some years ago in this forum?

vjs
01-31-2006, 01:19 PM
Wow cats out of the bag!!!

I wouldn't ask for the source just yet this has been a long drawn out task that has taken months if not a year. I'm sure the code will be released once a final version is complete, but certainly not before then.

R. Gerbicz
01-31-2006, 01:33 PM
I wouldn't ask for the source just yet this has been a long drawn out task that has taken months if not a year. I'm sure the code will be released once a final version is complete, but certainly not before then..
For me it isn't very interesting an assembly code, my assembly knowledge is almost zero. But I know c/c++, GMP, so that would be good for me, with some comments, if possible.

On Mikael Klassons's homepage ( prothsieve ) the situation is much worse, there are only executables for various systems, without any source code.

I think that there is also possible some algorithmic improvements to these codes, it isn't trivial to implement in an efficient way these algorithms, defining some parameters for the program...

Ps. Sorry for the new user name, I haven't remember for my previous password, so I've just created a new.

Greenbank
01-31-2006, 02:14 PM
For me it isn't very interesting an assembly code, my assembly knowledge is almost zero. But I know c/c++, GMP, so that would be good for me, with some comments, if possible.

On Mikael Klassons's homepage ( prothsieve ) the situation is much worse, there are only executables for various systems, without any source code.

I think that there is also possible some algorithmic improvements to these codes, it isn't trivial to implement in an efficient way these algorithms, defining some parameters for the program...

Ps. Sorry for the new user name, I haven't remember for my previous password, so I've just created a new.

I just asked for the proth_sieve source from Mikael Klasson and he sent it to me. I did promise I would not share it with anyone else without asking him. Once I've finished with it I'll ask him to see if can pass it on.

I still need to do lots of cleaning up on it. I've just removed the last of the C++ code and it now compiles with gcc as a plain C file. That was ~15 hours work in itself.

More news soon.

Matt
01-31-2006, 02:21 PM
That sounds seriously good! Can't wait to have that awesome PC on our side with the sieve effort!

Chuck
01-31-2006, 08:39 PM
Nicely done. ...

But, one issue I see, and offer a 'heads up' on, is the switch to Intel processors by Apple. Any plans for handling the platform migration?

Good job on the speed... very nice indeed. I get about 550 kp/sec on my Opterons. (I'm still using the dinosaur 250's)

Joe O
01-31-2006, 09:42 PM
Yes, Congratulations.


Now you x86 users need not worry. The following was done with the 991-50M dat on an AMD3300/2400MHz


pmin=1800892207959409 @ 702kp/s
pmin=1800892275068279 @ 701kp/s
pmin=1800892342177121 @ 701kp/s
pmin=1800892409285933 @ 701kp/s
pmin=1800892476394871 @ 703kp/s
pmin=1800892543503701 @ 701kp/s
pmin=1800892610612521 @ 702kp/s
pmin=1800892677721447 @ 702kp/s
pmin=1800892744830313 @ 706kp/s
pmin=1800892811939173 @ 704kp/s

Chuck
01-31-2006, 10:00 PM
I am sorry.. I forgot to mention that I am using the Riesel.dat as well.

The 991-50M .dat file comes in around 750. I will post up a log file of the two.

I'm sorry for the miscommunication.


C.

Greenbank
02-01-2006, 05:44 AM
Chuck/Joe,

I'll drop you an email later on today.

Great work on the x86 code!

Greenbank
02-07-2006, 11:58 AM
Yes, Congratulations.

Now you x86 users need not worry. The following was done with the 991-50M dat on an AMD3300/2400MHz

pmin=1800892207959409 @ 702kp/s
pmin=1800892275068279 @ 701kp/s
pmin=1800892342177121 @ 701kp/s
pmin=1800892409285933 @ 701kp/s
pmin=1800892476394871 @ 703kp/s
pmin=1800892543503701 @ 701kp/s
pmin=1800892610612521 @ 702kp/s
pmin=1800892677721447 @ 702kp/s
pmin=1800892744830313 @ 706kp/s
pmin=1800892811939173 @ 704kp/s

OK, I'll take that challenge:-

991 -> 50M SoB.dat. 2.5GHz PPC G5 running MacOS X 10.4.

Sieving 1800892200000000 <= p <= 1800892800000000, 991 <= n <= 50000000
p = 1800892210000123 @ 748kp/s
p = 1800892220000131 @ 740kp/s
p = 1800892230000139 @ 746kp/s
p = 1800892240000153 @ 778kp/s
p = 1800892250000213 @ 739kp/s

Mainly thanks to Rogue's fantastic work on the magic number multiplication!

They'll be another 10 kp/sec or so in the new sieve code, plus I can make the sieve window larger to make the p-1 phase more efficient. Memory usage is up to around 140MB though.

Given that I've got a slight advantage (2.5GHz vs 2.4GHz) I'd say we're pretty equal!

Under 2^50 I'm getting roughly 780kp/sec:

Sieving 926804760000000 <= p <= 926804800000000, 991 <= n <= 50000000
p = 926804770000009 @ 772kp/s
p = 926804780000059 @ 768kp/s
926804780689751 | 55459*2^2415946+1 (xmod 300))
p = 926804790000071 @ 784kp/s
Done

3120 kp/sec on one machine is lovely. :-)

Chuck
02-07-2006, 07:51 PM
OK, I'll take that challenge:-

991 -> 50M SoB.dat. 2.5GHz PPC G5 running MacOS X 10.4.

Sieving 1800892200000000 <= p <= 1800892800000000, 991 <= n <= 50000000
p = 1800892210000123 @ 748kp/s
p = 1800892220000131 @ 740kp/s
p = 1800892230000139 @ 746kp/s
p = 1800892240000153 @ 778kp/s
p = 1800892250000213 @ 739kp/s

Mainly thanks to Rogue's fantastic work on the magic number multiplication!

They'll be another 10 kp/sec or so in the new sieve code, plus I can make the sieve window larger to make the p-1 phase more efficient. Memory usage is up to around 140MB though.

Given that I've got a slight advantage (2.5GHz vs 2.4GHz) I'd say we're pretty equal!

Under 2^50 I'm getting roughly 780kp/sec:

Sieving 926804760000000 <= p <= 926804800000000, 991 <= n <= 50000000
p = 926804770000009 @ 772kp/s
p = 926804780000059 @ 768kp/s
926804780689751 | 55459*2^2415946+1 (xmod 300))
p = 926804790000071 @ 784kp/s
Done

3120 kp/sec on one machine is lovely. :-)


Nice specs indeed.

Here I'm running a dual AMD Opteron, 2.4Ghz dual core machine w/ CL3 ECC, Gentoo on 2.6.13. I'm still working on the 32/64 bit port. I have some work to do thanks to GCC3 -> GCC4 and the 'twist' Redhat put in FC4, breaking a lot of my 32/64 bit portability, but the additional registers I get in -m64 mode are making things nice across the board. :)

The other dual Opteron is running Windows / FC4 dual boot.

I have both 32 and 64 bit versions in test atm. Here's what I get with the 991-50M .dat file:

chuck@innsbruck ~/prs
$ ./prs092_32 -S

PRS (32bit) v0.92a for processor type X86
Found SoBStatus.dat with unfinished work.
Continuing from last save point.
Setting priority to nice.
Running in Sierpinski mode
Starting setup.
Expected 8 k values.
Found 8 k values.
Done.
Setup took 4.211024 seconds.
Starting sieve.
P Range 1800892334217726 <= p <= 1800898200000000
N Range 991 <= n <= 50000000
pmin=1800892401326581 @ 796kp/s
pmin=1800892468435399 @ 797kp/s
pmin=1800892535544259 @ 797kp/s
pmin=1800892602653173 @ 796kp/s
pmin=1800892669761989 @ 796kp/s
pmin=1800892736870819 @ 798kp/s
pmin=1800892803979721 @ 796kp/s
1800892814825837 | 10223*2^15515117+1 (duplicate)
pmin=1800892871088631 @ 798kp/s
pmin=1800892938197473 @ 796kp/s
pmin=1800893005306319 @ 794kp/s
pmin=1800893072415209 @ 796kp/s
pmin=1800893139524041 @ 795kp/s
pmin=1800893206632953 @ 795kp/s
pmin=1800893273741759 @ 797kp/s
pmin=1800893340850601 @ 793kp/s
pmin=1800893407959539 @ 796kp/s
pmin=1800893475068333 @ 797kp/s
^C


chuck@innsbruck ~/prs
$ ./prs092_64 -S
PRS (64bit) v0.92a for processor type X86_64/EM64T
Found SoBStatus.dat with unfinished work.
Continuing from last save point.
Setting priority to nice.
Running in Sierpinski mode
Starting setup.
Expected 8 k values.
Found 8 k values.
Done.
Setup took 3.460346 seconds.
Starting sieve.
P Range 1800893475068332 <= p <= 1800898200000000
N Range 991 <= n <= 50000000
pmin=1800893542177193 @ 958kp/s
pmin=1800893609286029 @ 958kp/s
pmin=1800893676394921 @ 959kp/s
pmin=1800893743503781 @ 957kp/s
pmin=1800893810612647 @ 960kp/s
pmin=1800893877721483 @ 959kp/s
pmin=1800893944830361 @ 959kp/s
pmin=1800894011939219 @ 958kp/s
pmin=1800894079048091 @ 957kp/s
pmin=1800894146156957 @ 957kp/s
pmin=1800894213265817 @ 959kp/s
pmin=1800894280374667 @ 960kp/s
pmin=1800894347483533 @ 959kp/s
pmin=1800894414592417 @ 959kp/s
pmin=1800894481701257 @ 958kp/s
pmin=1800894548810117 @ 958kp/s
^C


chuck@innsbruck ~/prs
$


Calling 958 and 796 my averages, that clocks me in at (4x958) 3832 kp/sec in 64 bit mode and (4x796) 3184 kp/sec in 32 bit mode. That's about what I expect for cache hit rate vs NUMA bus interaction and the slow ECC memory (3-3-3-8 w/ scrubber). I do have to work on the memory utilization a bit though. I would like it smaller/more strategic, since I don't have an L3 cache.

It's nice to see that we are running really well overall across all the platforms and OS's.

Chuck

Matt
02-08-2006, 08:54 AM
Are you both working on seperate projects here ie one for PPC and one for x86? Any chance the projects could be brought together..? Seems to me we don't want too much confusion in the system if we can help it and one program that has different versions for PPC and x86 would seem better than 1 program for x86 and one for PPC. Also, as someone who's keen on doing GUI wrappers for these if/when they come out (I've already made one for the current sieve client that I need to polish off) it would be nice to have a matching client output.

Greenbank
02-08-2006, 09:20 AM
Are you both working on seperate projects here ie one for PPC and one for x86? Any chance the projects could be brought together..? Seems to me we don't want too much confusion in the system if we can help it and one program that has different versions for PPC and x86 would seem better than 1 program for x86 and one for PPC. Also, as someone who's keen on doing GUI wrappers for these if/when they come out (I've already made one for the current sieve client that I need to polish off) it would be nice to have a matching client output.

The majority of the work on the clients is the associated assembly to make things faster.

When each of us (Chuck/Joe and Rogue/myself) are happy with our code I'm sure we'll do a swap and aim towards one codebase with a set of #defines to allow architecture specific compilation.

I can only speak for the PPC version when I say that there is still quite a bit of work to be done cleaning up the code and making it more readable and making more optimisations.

Also different algorithms work better on different processors. Many of the algorithmic changes made by Chuck/Joe may not be faster on PPC and many of the optimisations on PPC may be slower than the versions that Chuck/Joe are using on x86.

So, yes, I do expect the projects to come together eventually, but trying to do that too early may hinder progress rather than help.

[EDIT - additions]

The first goal is to get something that can use G5 PPC machines to sieve for proth/riesel. That way we can utilise however many of these machines people have available.

Release 1 will be like for like with the existing proth_sieve, so completely manual.

I would expect Release 2 will be closer to the codebase of Chuck's/Joe's work.

Release 3 would include G4 (32-bit) support and hopefully a move towards the fabled network client code.

R. Gerbicz
02-24-2006, 10:43 AM
I've gotten the source code of proth_sieve from MKlasson :thumbs:
I'm also trying to write a sieve for proth/riesel numbers, however the slowest routine in my program is the modular multiplication. But when I've inserted the asm routine ( from proth_sieve ) for mulmod I've gotten always many errors when I compiled. So how can I compile c code with asm?

Greenbank
02-27-2006, 05:23 AM
Ah, Mikael didn't give you the other program he uses to allow gcc to compile the inline assembly. There are two choices:

1) You need the gccfixer program he has, he's sent me a copy so I'll forward it on to you.

2) Download the Intel C compiler (which is free for non-commercial use). That can compile the non-AT&T style inline assembly.

Also, when compiling always use the -fomit-frame-pointer flag as Mikael relies on this by leaving various values is registers which would otherwise get trampled.

[EDIT] OK, email sent, let me know if you haven't received it as my webmail service randomly forgets to deliever emails. Oh, and don't forget "-O2" as well when compiling. Any other options (-O1 or -O3) can cause it to fail as it doesn't place values on the stack the way the code expects.

Chuck
02-27-2006, 06:01 AM
If I may add, With a small change in syntax, GCCfixer can be eliminated.
The assembly is written in Intel syntax (.intel_syntax directive).
Simply quick editing the syntax to '.att_syntax' will allow permanent
elimination of the gcc fixer. .att_syntax is the GCC default.

This gives us a single GCC source for all X86 flavors. VS6 allows for selection of ASM
dialect (VC7 does not). VS6 also has 'tuned support' modules
for extra customization of the AMD processors and Intel processors. It is a support addon from MS.

Now you have two compilers which will handle the same source.

If you use GCC, you can cross compile to any of the supported
platforms.

Also, please don't forget to use Alex's specs on the command line. This
is critical for proper compilation with GCC 3.x

You may also use the -march= option. -march=k8 is optimum for AMD64's.
Use 'athlon-xp' for the XP cpus. 'info GCC' in linux or cygwin will give
all the details.


Chuck

R. Gerbicz
02-27-2006, 04:49 PM
Thanks Greenbank and Chuck!
I've solved the problem by gccfixer program.

Now I can do (a*b)%p mulmod calculation ( here a,b,p<2^50 and if (double) b/p is precomputed ) in about 170 clocks on a P4 Celeron. How is it good?

Chuck
02-27-2006, 10:18 PM
Thanks Greenbank and Chuck!
I've solved the problem by gccfixer program.

Now I can do (a*b)%p mulmod calculation ( here a,b,p<2^50 and if (double) b/p is precomputed ) in about 170 clocks on a P4 Celeron. How is it good?

That's not too bad... (if I remember P4's correctly). It does depend a LOT on which version of the Celeron you have. I also do caution (from personal experience) some die versions of the P4 FPU have an alternating round-up/round-down sequence. There are (again, I think) some FPU control codes you can send to control this. I do not know if this is still the case in current CPUs.

The lack of predictability in the FPU was the main reason I went to AMD processors. Correct, Incorrect, or otherwise, I would always get repeatable results.

Hope that helps.

/* Edit
Would someone with current Intel P4/Xeon knowledge help out here please? I can't be of any further assistance. Tnx.
End Edit */

Greenbank
03-13-2006, 04:55 AM
2.5GHz G5 PPC running MacOS X 10.4.

p = 1013542810000027 @ 1007kp/s
p = 1013542820000057 @ 1005kp/s
p = 1013542830000097 @ 1004kp/s
1013542836121463 | 24737*2^26816311+1 (xmod 421896))
p = 1013542840000217 @ 1008kp/s
p = 1013542850000281 @ 1005kp/s
p = 1013542860000347 @ 1004kp/s

That's 1Mp/sec at p ~= 1000T.

Once again thanks to Rogue (Mark R) for his amazing work on this.

Chuck
03-13-2006, 07:44 PM
Joe, Alex, Mark,... and anyone working with/on sieves:

Question for all:

Given we are trying to support projects at various stages of maturity (and/or 'Trial P' magnitude).... any thoughts as to which ranges we use as a 'community acceptable' set of P ranges ?

I'm suggesting we use multiple P ranges for our code to ensure that the target platform, X86, PPC, SGI, Alpha, etc... provides us with a consistent profile so we can create / tune software to be the best possible.

I toss out as example: P= 10E4, 10E6, 10E8, 10E12, 10E18, 10E20, etc
(this is not mathematically accurate, but hope it conveys the 'target' points
of interest)


Thoughts? (open to all)

Chuck

Greenbank
03-14-2006, 05:47 AM
Do you mean picking a set of standard ranges for testing and comparing speeds?

If so, I'd recommend these values:-

../proth_sieve_current -d -f 275000000000 275050000000
../proth_sieve_current -d -f 4398050000000 4398100000000
../proth_sieve_current -d -f 70368750000000 70368800000000
../proth_sieve_current -d -f 1125899940000000 1125899990000000
../proth_sieve_current -d -f 18014398510000000 18014398560000000
../proth_sieve_current -d -f 288230376160000000 288230376210000000
../proth_sieve_current -d -f 4611686018430000000 4611686018480000000

They are all 50M ranges just above the 38, 42, 46, 50, 54, 58, 62 bit boundaries.

I still have a bit of work to do before my version can handle all of these values automatically. It does everything up to 50-bits fine but I need to extend MAX_PRIME and the aallprimes array separately for larger invocations.

I'd also suggest we publish results for different dat files, i.e.

SoB
PSP (without the extra k value that is needed to fiddle the parameters)
SoB+PSP combined
Riesel

Way down on my list of things to do is to allow mixed sieving of SoB + Riesel, but I'd still like to get this done...

Chuck
03-14-2006, 01:06 PM
Alex,
Yes,,, just below and above boundaries and/or other conditions.

It would serve as our test set to make sure we all have the same
conditions to work from. I think input from the existing project leads
would help too.

I thought about this in the car while driving home. I was trying
to think of a standardized test set that we all have had chance to
look at and all agree covers all the conditions a sieve would have to
handle (range related). It would make regression testing a no-brainer
operation.

Having a set like this would allow us to make fixes and improvements
while still ensuring we have not lost any functionality.

I think it we (the community at large) really need and will benefit from
having a set.. Developers know what limits (sizes, etc) need to be
handled and where problems are likely to come up. I think of it as
our 'checklist' before releasing a sieve for community use/testing.

Sound like a nice QA-like idea?

C.

Greenbank
03-14-2006, 01:30 PM
Yup, sounds perfect.

I also added a separate feature into my version of the code.

Instead of sieving primes and finding the p-1 factors, next_sieved_p() reads from a flat file containing p and its p-1 factors. It then places these in the appropriate arrays (pminusfactors, asievewindowbits, etc) and runs as normal.

I've downloaded the results.txt.bz2 file and found all of the p-1 factors (that are < 64bit). I then run through every single previously found factor and ensure that it is output where appropriate (the results.txt.bz2 file contains factors for entries no longer in the current .dat file!).

Greenbank
03-21-2006, 06:08 AM
Latest speed, still got a bit to go...most importantly is I've removed the need to link to GMP. :-)

p = 1013542810000027 @ 1065kp/s
p = 1013542820000057 @ 1063kp/s
p = 1013542830000097 @ 1064kp/s
1013542836121463 | 24737*2^26816311+1 (xmod 421896))
p = 1013542840000217 @ 1064kp/s
p = 1013542850000281 @ 1063kp/s
p = 1013542860000347 @ 1060kp/s
p = 1013542870000363 @ 1061kp/s

Greenbank
03-22-2006, 06:53 AM
Almost there, just have to tidy up the code a bit and remove the code that limits it to 2^50 (as it can do up to 2^63).

p = 1013542810000027 @ 1081kp/s
p = 1013542820000057 @ 1063kp/s
p = 1013542830000097 @ 1079kp/s
1013542836121463 | 24737*2^26816311+1 (xmod 421896))
p = 1013542840000217 @ 1079kp/s
p = 1013542850000281 @ 1075kp/s
p = 1013542860000347 @ 1075kp/s

Greenbank
03-22-2006, 01:19 PM
Nice specs indeed.

Here I'm running a dual AMD Opteron, 2.4Ghz dual core machine w/ CL3 ECC, Gentoo on 2.6.13. I'm still working on the 32/64 bit port. I have some work to do thanks to GCC3 -> GCC4 and the 'twist' Redhat put in FC4, breaking a lot of my 32/64 bit portability, but the additional registers I get in -m64 mode are making things nice across the board. :)

The other dual Opteron is running Windows / FC4 dual boot.

I have both 32 and 64 bit versions in test atm. Here's what I get with the 991-50M .dat file:

32-bit results snipped by Greenbank

chuck@innsbruck ~/prs
$ ./prs092_64 -S
PRS (64bit) v0.92a for processor type X86_64/EM64T
Found SoBStatus.dat with unfinished work.
Continuing from last save point.
Setting priority to nice.
Running in Sierpinski mode
Starting setup.
Expected 8 k values.
Found 8 k values.
Done.
Setup took 3.460346 seconds.
Starting sieve.
P Range 1800893475068332 <= p <= 1800898200000000
N Range 991 <= n <= 50000000
pmin=1800893542177193 @ 958kp/s
pmin=1800893609286029 @ 958kp/s
pmin=1800893676394921 @ 959kp/s
pmin=1800893743503781 @ 957kp/s
pmin=1800893810612647 @ 960kp/s
pmin=1800893877721483 @ 959kp/s
pmin=1800893944830361 @ 959kp/s


Same range as yours...

Sieving 1800893475068332 <= p <= 1800898200000000, 991 <= n <= 50000000
p = 1800893485068407 @ 1067kp/s
p = 1800893495068439 @ 1075kp/s
p = 1800893505068443 @ 1074kp/s
p = 1800893515068457 @ 1075kp/s
p = 1800893525068483 @ 1075kp/s

Figures for other bit sizes:-

36 bits: from 68720000000 = 998 kp/s
38 bits: from 275000000000 = 1052 kp/s
42 bits: from 4398050000000 = 1102 kp/s
46 bits: from 70368750000000 = 1089 kp/s
50 bits: from 1125899940000000 = 1111 kp/s
54 bits: from 18014398510000000 = 1000 kp/s

58 bits is giving me a bit of a problem, not only because it takes up large amounts of memory for the prime array. Given that a 2^54 limit gives us 15 times as much search space as 2^50 I'm happy capping this version at 2^54.

Greenbank
03-22-2006, 02:46 PM
*sigh* just when you think you've done everything I go and fiddle with some of the values and find that it can go slightly faster.

Up to ~1100kp/s now (for 1800893475068332 to 1800898200000000) mainly by reducing trial_primes down to 2 and only keeping a record of 2 p-1 factors for SPH in pminus1factors[].

Going to the pub to stop me playing even more.

Matt
03-22-2006, 03:00 PM
I spose you might get to the point where the time taken to find new speed improvements could be offset by the extra work that could be done while your tweaking.

Greenbank
03-22-2006, 03:15 PM
Indeed, but I'm sieving an 8T combined SoB+PSP range on the machine anyway (split across the 4 CPUs) in the background.

I only run the program in short bursts, although the final test suite use one CPU for 4 or 5 hours.

All of the tweaks I've made have ruined the performance for Riesel Sieve, yet another thing to fix.

Bah, neverending.

Chuck
03-22-2006, 06:39 PM
Nice results... My compliments on the G5 ASM coding you've done.

If I may suggest though, please do keep an eye on cutting back the trial primes value.
It helps in places and hurts in others. The balance across the projects is the kicker.

C.

rogue
03-22-2006, 10:01 PM
Nice results... My compliments on the G5 ASM coding you've done.

If I may suggest though, please do keep an eye on cutting back the trial primes value.
It helps in places and hurts in others. The balance across the projects is the kicker.

C.

Thanks. Fortunately the ASM is limited to three places. I wonder if a similar method could be used on the AMD-64 to gain any speed.

Unfortunately, it is my sieving code that wastes so much memory. It would be possible to reduce the number of primes in the array (thus testing with a few composites) to reduce the memory requirements. If the number of additional tests is < .01 %, it is probably worth it on memory strapped machines.

I've noticed that there are some oddities with the compiler (gcc) that cause performance to die with -O3. The performance for -O3 was less than 10%of -O2, which is really odd. I assume it has something to do with how memory is used, but I can't say for certain without profiling the code.

Chuck
03-22-2006, 10:32 PM
Thanks. Fortunately the ASM is limited to three places. I wonder if a similar method could be used on the AMD-64 to gain any speed.

Unfortunately, it is my sieving code that wastes so much memory. It would be possible to reduce the number of primes in the array (thus testing with a few composites) to reduce the memory requirements. If the number of additional tests is < .01 %, it is probably worth it on memory strapped machines.

I've noticed that there are some oddities with the compiler (gcc) that cause performance to die with -O3. The performance for -O3 was less than 10%of -O2, which is really odd. I assume it has something to do with how memory is used, but I can't say for certain without profiling the code.

Rogue,
If you are using GCC 3 (which sounds like the case), then your observations are consistent with mine. You are actually dealing with a GCC code generator issue and not a code issue. I recommend -O2 -march=xxx and then the appropriate '-f' options..... and/or switch to GCC 4, but expect COMPLETELY different stack frame construction. That is my biggest headache: GCC3 and Microsoft are very common. GCC4 and MS are different, but the code generator/optimizer improvement is worth it.

Given the size of the arrays we need to need/use, I would recommend using the additional memory and doing the testing. Yes, you will take a small hit in speed for high P values, but conversely you will gain in the lower P ranges. Remember Riesel... It will be your guide to performance. Get Riesel screaming and the SoB code will fall in place perfectly.

I believe this will also make your eventual G4 port easier, as well as making performance more linear and consistent. I suggest the use of the G4/G5 equivalent to the X86 'Prefetch' to compensate. You are going to 'cache bust' no matter what you do when you have the arrays balanced. It can't be helped. Just 'plan ahead'.

I think you've got a nice strong base for Sob, but experience has taught me to use the technique above to help with all processors, memory configs and linux kernel versions. Riesel's high K count will be a tough case to crack. I suggest you emphasize there. Make the array data 'flow' like a stream through the caches and cpu. Doing so will get you up to the efficiency you need and make the G4 performance fall into line where you want it.
We have only 2 small ASM routines, with C versions used most of the time. The ASM routines help with the 128 bit math. The code otherwise still hauls with 64 bit based C in 32 bit mode. IMHO, I suggest that be your basis for planning... Think 64, write in 32, #ifdef 64 bit vs 32 bit processors. My previous posts show this. This is the 'processor type' field which is printed out.

Please do try to take the ASM and minimize it or even eliminate it. We have a lot of platforms and OS's to port to, but leave the hooks for ASM. The X86 series exists in a fully C version with SSE and CMOV assembly routines as 'benefits'. You know the game, just use what you know.

For a goal, shoot for SoB of 2400+ kp/sec in both 32 and 64 bit mode while still doing 64 & 128 bit math. That is what I use. You should find you can exceed that value at high P easily and it's got a nice broad 'Q' quality to it to maintain performance for projects running at low P (1G and above). Watch the code in operation... you know the cpu, just smooth it out for large and small P, and everyone gets what we need.
With luck, we will be able to put the any ASM routines (Gx / x86 / other) routines as 'extras' for all to use as needed. That is another goal, but a good solid fast C version first is your target.

Hope that helps.

C.

rogue
03-23-2006, 12:22 PM
If you are using GCC 3 (which sounds like the case), then your observations are consistent with mine. You are actually dealing with a GCC code generator issue and not a code issue. I recommend -O2 -march=xxx and then the appropriate '-f' options..... and/or switch to GCC 4, but expect COMPLETELY different stack frame construction.

I am using gcc 4 (4.0.1 to be more precise).

I couldn't care less about G3/G4 implementations. I will leave it to Alex to take my ASM code and convert for a 32-bit implementation. If necessary one can use GMP for those routines which would at least abstract the asm from the casual person who wants to build the code. I don't know how easy/difficult it would be to eliminate GMP and ASM altogether.

Clearly there are some caching issues. I have had no time to work them out. Alex has been doing the vast majority of coding and testing. He's been using my ASM routines and I've been pointing out some optimizations along the way. If I have time this weekend, I intend to dig more into the memory issues and will try to find a happy medium. BTW, I hadn't thought about prefetching. I will have to think about where it would make the most sense to do it.

Mystwalker
03-23-2006, 03:41 PM
Did you consider trying the Intel compiler? AFAIK, it produces code that is a few % faster, is available for Windows & Linux and free for personal use. I think the syntax is a bit different sometimes, though.

rogue
03-24-2006, 08:20 AM
Did you consider trying the Intel compiler? AFAIK, it produces code that is a few % faster, is available for Windows & Linux and free for personal use. I think the syntax is a bit different sometimes, though.

You must be talking to Chuck. PowerPC is an IBM CPU, not an Intel CPU.

Greenbank
03-24-2006, 08:33 AM
I've used the Intel compiler (for x86) as I hadn't got the gccfixer stuff from Mikael and was trying to compile proth_sieve on x86.

It wasn't any faster than the exiting binaries.

Matt
03-24-2006, 09:11 AM
Big question is, when are we gonna get this new, faster x86 siever?!

Joe O
03-24-2006, 12:18 PM
Big question is, when are we gonna get this new, faster x86 siever?!

What new faster x86 siever?

Matt
03-24-2006, 02:23 PM
I thought some people were working on improving the x86 siever... no?

Joe O
03-24-2006, 04:43 PM
I thought some people were working on improving the x86 siever... no?
Improving, Yes.
It goes to 2^52, page faults less, has yet to go into a loop or leak memory.

Matt
03-24-2006, 08:49 PM
Improving, Yes.
It goes to 2^52, page faults less, has yet to go into a loop or leak memory.

And I heard there were significant speed improvements, is this true? Any idea when it will be out?

rogue
03-24-2006, 08:51 PM
Clearly there are some caching issues. I have had no time to work them out. Alex has been doing the vast majority of coding and testing. He's been using my ASM routines and I've been pointing out some optimizations along the way. If I have time this weekend, I intend to dig more into the memory issues and will try to find a happy medium. BTW, I hadn't thought about prefetching. I will have to think about where it would make the most sense to do it.

I figured out how to resolve the thrashing issue, but I don't understand what causes it. When I pull some of the functions out and put them into a separate file, I am then able to compile with -O3 and get a 5% gain over -O2. Is gcc inlining one of those functions? If so, why is it killing performance so badly. Maybe one of you guys know gcc much better than I and can answer the question.

Chuck
03-24-2006, 09:20 PM
Mark,
Check the symbol table addresses of the functions in the external file for starters and see if you are getting page alignments. Conversely, pulling out the functions will allow the functions 'above' and 'below' the removed one(s) to potentially load and be put on the same page and/or fit in cache at the same time. Locality of reference with respect to cache is key.

You have control of inlining with GCC as you know, and it shouldn't inline
functions unless you have a default CFLAGS set to do so or specify it in code or on the command line. Check the '-03' flags set for your processor. It might do an 'auto inline' as you suspect. You can override it of course on the command line.

The X86 version behaves differently, but still is largely 'cache centric'. Cache row size vs memory bus interface is the other. The number of cycles required to get a cache row into the cpu makes a big difference. The AMD and Intel versions have major differences right here. I suspect you are seeing similar.

Email me if you wish and we can discuss off board.

C.

rogue
03-25-2006, 11:43 AM
Chuck,

Since Alex (who is aware of my findings) is leading the effort on the PPC port, I'll leave it to him to find a solution. To me it isn't an issue since this workaround works perfectly well.

Death
01-22-2007, 10:08 AM
where can I download G5 siever???

got a few macs around

cedricvonck
01-23-2007, 03:10 AM
I am also interested in a version, but then for the Intel Macbook Pros.

Thank you