multi precision divide on 32-bit architecture

**Greenbank** · 11-01-2005, 06:09 AM

Hello,

I'm trying to knock up some core multi-precision math asm routines for a G4 (32-bit Apple mac) processor and I'm stuck on the divide/mod algorithm.

I've written one that does shifting and subtraction but it's slower than the algorithm used by GMP and I'm trying to work out how that works. If you've ever seen the source to GMP it's not that easy to work out.

I think the algorithm works something like this for a 32-bit machine:-

1. Get the most significant 16-bits of the divisor. Keep a note of how many bits you had to shift left/right to get this. Add one to it. (You want to make sure that the values from the successive divides, when multiplied by the original divisor, are less than the original number).

Take the number to be divided and find the most significant 32-bits, store this one register and keep note of how many left shifts were needed to get a 1 in the most significant bit.

Divide this number by the top 16 bits of the divisor.

Multiply this result by the original divisor. Making sure you use a multiply routine that is specifically coded for smaller operands. Why use a 64x64=128 bit routine when you know you'll be doing 32x64=96 bits.

Shift the number up by the appropriate number of bits (until it is slightly less than the current number).

Subtract it from the original number.

Lather rinse and repeat.

Here's an example...

a=382793255223295 89125,4294967295
b=129196911230775 30080,4294967095
p=882861709928411 205557,1117464539

So I want the top 16 bits of p, so I shift the top word right twice to get:

tmp_divisor=205557>>2 == 51389

then add one.

tmp_divisor++; // = 51390

a*b = 49455706214823442715900903625 = 0,2680999205,4277022889,201

Divide the top 32 bits of the answer by tmp_divisor (luckily they are all in one register already):-

2680999205 / 51390 = 52169

52169*p*2^30 = 49454414400157279869485449216

(The 2^30 comes from the fact that I'm working 2^32 bits higher up but I had to shift tmp_divisor right twice, 32-2=30.)

subtracting this from the original number I get:-

curr = 1291814666162846415454409 = 0,70029,1775432620,1073742025

Do it again...

Top 32-bits of curr is 2294723817.

2294723817 / 51390 = 44653

44653*p*2^15 = 1291793987450743566598144

subtracting this from curr I get:-

curr2 = 20678712102848856265

Again, although a slight twist since we are almost there. I only need the top 31-bits:

Top 31-bits of curr is 1203659462

1203659462 / 51390 = 23422

23422*p = 20678386969943242442

Subtracting this from curr2 I get:-

325132905613823

Which is less than 882861709928411 so it's the answer!

I'm starting to code this up in assembly but I'm still trying to get my head round it. I prefer to understand things properly before starting. Oh, and I've just found out that I'm off to Stockholm for work. In 5 hours. There's goes playing football this evening.

**Greenbank** · 11-03-2005, 03:04 PM

Ignore this.

I now have some appropriate reading material (Chapter 14 of Handbook of Applied Cryptography (HAC)).

"Efficient Implementation" detailing how to do powmod, mulmod, Chinese-Remainder-Theorem and GCD quickly.

That covers over 99% of the code that a siever performs. Woo.

**Greenbank** · 11-04-2005, 06:56 AM

The HAC is excellent. A good read over a couple of pints of Guinness and I now understand Montgomery Reduction and Montgomery Multiplication.

Took me a few reads to work out how to quickly go from the output of xy(R^-1) to xy without dividing or performing another modulus but I got it eventually.

Will see how quick this is to implement in C then G4 assembly.

**Greenbank** · 11-04-2005, 12:53 PM

One last tiny bug with carries fixed and it's lovely and quick...

gmp_mul() followed by gmp_mod() to calculate a*b mod p.

10M iterations in 3.6 seconds

My 32-bit mulmod_mont() call. (Including setup time to calculate yR and (-m)^-1 mod b using mpz_invert() etc)

10M iterations in 0.710 seconds.

So it's only slightly slower than my original mul_64_64_asm() function which did 10M iterations in 0.630 seconds (but of course it didn't do the mod part of it).

I likey....

Thanks you Rogue for pointing me in the direction of Montgomery multiplication...

**rogue** · 11-04-2005, 02:26 PM

Originally posted by Greenbank
Thanks you Rogue for pointing me in the direction of Montgomery multiplication...

You're welcome. I hope you give me an opportunity to profile your code in an attempt to further optimize it.

Will your program have the ability to start sieving a range of k/n from scratch like NewPGen? Will it have the ability to output in ABC format for LLR? I'm asking because it would be nice to use it for general purpose k*2^n+/-1 sieving beyond the SOB or RieselSieve projects.

**Greenbank** · 11-04-2005, 02:36 PM

Originally posted by rogue
You're welcome. I hope you give me an opportunity to profile your code in an attempt to further optimize it.

Will your program have the ability to start sieving a range of k/n from scratch like NewPGen? Will it have the ability to output in ABC format for LLR? I'm asking because it would be nice to use it for general purpose k*2^n+/-1 sieving beyond the SOB or RieselSieve projects.

As soon as I'm happy with it I'll let you know. I've still got an occasional carry bit bug and about 1 in every 1000 results are out by 1. Hmm.

[EDIT] Was using 'cmpw' instead of 'cmplw' to compare unsigned integers. Doh. Fixed now. Ran a 1M range of sieving with both mulmod algorithms in place (my ASM and a generic GMP algorithm. No different results! )

I'm doing this for several reasons:-

1. I want to learn PPC assembly.
2. I want to learn the maths behind making things fast (i.e. Montgomery Multiplication, etc).
3. The assembly routines I write will be contributed to the sieve project that Joe_O and Chuck are working on.

I'm also getting a PowerMac and will have fun porting bits of code to 64-bit G5.

I don't have any real plans to release a seperate Mac sieve program. I'll leave that to Joe_O and Chuck to release a Mac version of jjsieve. If, however, they don't release a Mac version and I get mine stable and tested then I would release it.

I have no plans to allow sieving from low n. My code is based on proth_sieve (by Mikael Klasson) and so I've inherited all of his limitations! At the moment I'm working on the internal core maths functions but I do plan to go over every bit of his code and rewrite it to remove as many limitations as possible.

I too would like to combine SoB and Riesel (and PSP and 321) sieving, however there are various bits in the code that need to be examined as they rely on knowing whether the numbers are + or - 1.

**Greenbank** · 11-06-2005, 11:37 AM

I should point out that the 10M iterations of my mulmod_mont() call in asm only does the initial setup once (and it is quite expensive).

Doing 10M iterations, including the setup each time, takes just over 23 seconds (compared with .7 seconds doing the setup once).

So assuming I can write a standard mulmod that is as fast as GMP then that will do 10M iterations in 3.6 seconds.

So the break even point is as follows. Make the numbers easier by multiplying by 10 (and assuming they actually took 10M as much time on top of that!):-

standard GMP-style mulmod time: 36 (3.6 seconds * 10)
mulmont setup time: 230 (23.0 seconds * 10)
mulmont exec time: 7 (0.7 seconds * 10)

At 7 iterations GMP style would take 7*36 = 252
At 7 iterations mulmod_mont() in asm: 230+(7*7)=279

At 8 iterations: GMP style = 8*36 = 288
At 8 iterations: mulmod_mont() = 230+(8*7) = 286

so if I have more than 7 iterations to perform it is worth going the Montgomery Multiplication method, otherwise I should stick to standard mulmod. This is quite easy to check in the various hash_table creation blocks (standard-BSGS, SPH_IDEA, CHECK_FACTORS_SPH) as I know exactly how many calcualtions I need to make before I start.

This boundary is set to change as:

1) I haven't implemented the GMP style mulmod in G4 assembly yet. (Still trying to work out exactly how it works!)
2) I haven't implemented the Montgomery Multiplication setup stage in G4 assembly yet. I'm still using mpz_invert() to find (-m)^-1 (mod b).

**Greenbank** · 11-08-2005, 10:25 AM

Hurrah, my mulmod_32bit_128_64() assembly routine looks like it is working properly. It's just trying 10,000,000 random numbers and comparing the results against the GMP algorithms. No errors (although I'll give it a harder test later on).

Times for doing 10M iterations:-

GMP: mul = 2.036 sec
GMP: mod = 3.278 sec

ASM: mul = 0.616 sec
ASM: mod = 1.575 sec

So a mulmod would be:-

GMP: 5.314 sec
ASM: 2.191 sec

2.4 times faster!

Now I have to rip the code out and make nice functions for mul, mulmod, mulmod_mont and sqrmod.
Then I have to use these to make pure assembly pow2mod and powmod functions.
Then I have to write the hash table creation blocks in pure assembly using the above code.

Then test the **** out of it!

Then get a PowerMac G5 and port it all to 64-bit.

**Mystwalker** · 11-08-2005, 02:25 PM

Sounds nice!

I would concentrate on borderline testing (numbers near (--> in front of resp. after) certain borders), e.g. the biggest possible numbers, numbers where internal behavior changes etc.

**Greenbank** · 11-08-2005, 03:42 PM

Yup, I've got two test harnesses:-

One harness picks random numbers and checks that my routines produce the same results as GMP does).

The other harness is proth_sieve itself. I run each routine against the GMP routines and compare the results for every computation. I just set it going to sieve for 10 minutes and see if any of the computed values differ.

Knocked up a pow2mod() call in 32-bit G4 assembly.

14-bit exponent (15840). Time for 2,000,000 calls:-

GMP mpz_powm( 2, 15840, p ) = 10.785 seconds
My pow2mod asm(2,15840,p) = 5.062 seconds

48-bit exponent (158401584015840). Time for 2,000,000 calls:-

GMP mpz_powm: = 33.635 seconds
My pow2mod asm= 16.116 seconds.

Roughly twice as fast!

Have to look at it again as I've had to use some nasty hacks as I was running out of registers (scary when you think there are 32). Luckily there are some to be had when I move it over to sqrmod() and fix my horrible bl/blr problems.

powmod( b, e, p) where b != 2 next.

Then the routines to do a^-1 (mod p).

Todays show is brought to you by the numbers 2, 15840 and the G4 assembly instruction "cntlzw".

**rogue** · 11-08-2005, 05:03 PM

Originally posted by Greenbank
GMP mpz_powm: = 33.635 seconds
My pow2mod asm= 16.116 seconds.

Roughly twice as fast!

Have to look at it again as I've had to use some nasty hacks as I was running out of registers (scary when you think there are 32). Luckily there are some to be had when I move it over to sqrmod() and fix my horrible bl/blr problems.

powmod( b, e, p) where b != 2 next.

Then the routines to do a^-1 (mod p).

Todays show is brought to you by the numbers 2, 15840 and the G4 assembly instruction "cntlzw".

Very nice on the speed.

BTW, you can use stw/lw to save/restore registers in memory if you need additional registers. It might save you from having to do anything too nasty. You only have to be careful about managing the stack when you do so.

**Matt** · 11-08-2005, 08:06 PM

All this is way over my head but I'm assuming all this is just for Macs? Any chance of this being applicable to us i386ers?

**Greenbank** · 11-09-2005, 06:24 AM

Rogue,

I'm being a bit cheeky in the pow2mod routine. I'm just trying to things hacked together quickly and so I've not been paying much attention to careful register allocation (and reuse).

I was 1 register short in the routine so I stored a temporary loop counter (which I should move over the the ctr register) in the memory used to store the final answer, whilst I called the sqrmod() block and the grabbed it back after the sqrmod.

If you're interested:
http://www.greenbank.org/sob/macasm/asm_mulmod.c
http://www.greenbank.org/sob/macasm/asm_pow2mod.c

I'm not going to apologise for them, I'm learning PPC assembly as I go.

Matt,

Chuck and Joe_O are working on the x86 stuff, I thought I'd have a go at seeing how fast I can get proth_sieve to run on a 32-bit Mac (and eventually a 64-bit Mac).

**Greenbank** · 11-28-2005, 09:38 AM

My G5 was delivered today, a bit of a surprise. Last time I checked it had an estimated ship date of 12th December.

Woohoo.

**Matt** · 11-28-2005, 01:27 PM

Congrats, it sounds like an absolute beast!

**Mystwalker** · 11-28-2005, 04:21 PM

I think you have to change your sig.

**rogue** · 12-01-2005, 10:26 AM

Is there a thread or a website dedicated to the jjsieve effort? I would like to know more about the features it will support.

**hhh** · 12-01-2005, 11:34 AM

Originally posted by rogue
Is there a thread or a website dedicated to the jjsieve effort? I would like to know more about the features it will support.

To make it a bit difficult:
head over to mersenneforum.org
go to the PSP section
Check for a thread intitulated Which program
There you find a link.

Have fun. H.

**Greenbank** · 12-01-2005, 11:37 AM

There's a bit of discussion over at the Riesel Sieve forum.

http://rieselsieve.com/forum/viewtopic.php?t=452

There's also some discussion in the "Distributed Framework Discussion" forum although this is a private forum. I'll drop a note in there saying that you might be requesting access and that granting you access would be useful. :-)

I haven't seen any of the source code, from what I've read they've put a lot of effort into cleaning up the existing proth_sieve code, removing the C++ vector/map parts that caused lots of page faults, lots of #ifdefs in there to allow assembly for many different architectures to be dropped in, fallback to native integer (somewhat optimised), etc.

I'm going on holiday for a week on Saturday so my G5 coding for proth_sieve will have to wait.

**Matt** · 12-01-2005, 12:39 PM

some disturbing revelations from jasong in that thread, but altogether sounds promising, a new client over the holiday season maybe?

**jasong** · 12-01-2005, 08:22 PM

Originally posted by Matt
some disturbing revelations from jasong in that thread, but altogether sounds promising, a new client over the holiday season maybe?

Being a paranoid skizophrenic, I immediately headed over to see what my "disturbing revelations" were. I'm guessing it's the long time between updates revelation you're referring to?

I scare super-easy, lol.

**Greenbank** · 12-02-2005, 07:25 AM

It was more your revelations from childhood. ;-)

**hhh** · 12-02-2005, 12:01 PM

Originally posted by Greenbank
It was more your revelations from childhood. ;-)

Could there be a link with the fact of being paranoid skizophrenic, as you say?

==>

?

No seriously, I deeply respect your problems, if they are problems, but I'd rather say: take it easy, if you can, it might help just to relax. It helped me often.
Believe me, there is no harm meant; I am the last one to make jokes about serious mental illness. I hope you are just fine, in spite of your traumatic experience.

H.

**jasong** · 12-02-2005, 10:29 PM

Originally posted by Greenbank
It was more your revelations from childhood. ;-)

I'll have to look again, I've written a lot of stuff in that forum.

I need to get myself a social life. Maybe my luck will change in January when the Spring Semester starts.

Edit: Crud, I think I've found it. The things I write to get attention. Although, it's true when they say you never forget your first true love.

**CaptainMooseInc** · 12-03-2005, 06:39 AM

The new proth_sieve client that Chuck is working on is going to be in pure C. He's getting rid of all ASM code and C++ and running it pure C. This will make it even easier to port across platforms. Speeds at the moment (for a Riesel Sieve based run) is around 1000kp/s. Proth_Sieve for most RS users is around 106kp/s. It's getting close to being able to be tested by a select few for stability. Once it's done it will be released and the world will be a better place. Plus the new Proth_Sieve won't skip over potential primes like the current one does. More discussion on it can be found on the RieselSieve forums (as stated above) in the Distributed Framework Discussion. If you are willing to sign up and PM bryan for approval to view then yeah, you'll enjoy what you read.

-Jeff

**Matt** · 12-03-2005, 02:51 PM

Wow that sounds exciting, obviously you'll let us all know when it's released?

**Mystwalker** · 12-05-2005, 10:55 AM

Originally posted by CaptainMooseInc
Speeds at the moment (for a Riesel Sieve based run) is around 1000kp/s. Proth_Sieve for most RS users is around 106kp/s.

:shocked:

Thread: multi precision divide on 32-bit architecture

Thread Tools

Rate This Thread

Display

multi precision divide on 32-bit architecture

Posting Permissions