Hello,

I'm trying to knock up some core multi-precision math asm routines for a G4 (32-bit Apple mac) processor and I'm stuck on the divide/mod algorithm.

I've written one that does shifting and subtraction but it's slower than the algorithm used by GMP and I'm trying to work out how that works. If you've ever seen the source to GMP it's not that easy to work out.

I think the algorithm works something like this for a 32-bit machine:-

1. Get the most significant 16-bits of the divisor. Keep a note of how many bits you had to shift left/right to get this. Add one to it. (You want to make sure that the values from the successive divides, when multiplied by the original divisor, are less than the original number).

Take the number to be divided and find the most significant 32-bits, store this one register and keep note of how many left shifts were needed to get a 1 in the most significant bit.

Divide this number by the top 16 bits of the divisor.

Multiply this result by the original divisor. Making sure you use a multiply routine that is specifically coded for smaller operands. Why use a 64x64=128 bit routine when you know you'll be doing 32x64=96 bits.

Shift the number up by the appropriate number of bits (until it is slightly less than the current number).

Subtract it from the original number.

Lather rinse and repeat.

Here's an example...

a=382793255223295 89125,4294967295
b=129196911230775 30080,4294967095
p=882861709928411 205557,1117464539

So I want the top 16 bits of p, so I shift the top word right twice to get:

tmp_divisor=205557>>2 == 51389

then add one.

tmp_divisor++; // = 51390

a*b = 49455706214823442715900903625 = 0,2680999205,4277022889,201

Divide the top 32 bits of the answer by tmp_divisor (luckily they are all in one register already):-

2680999205 / 51390 = 52169

52169*p*2^30 = 49454414400157279869485449216

(The 2^30 comes from the fact that I'm working 2^32 bits higher up but I had to shift tmp_divisor right twice, 32-2=30.)

subtracting this from the original number I get:-

curr = 1291814666162846415454409 = 0,70029,1775432620,1073742025

Do it again...

Top 32-bits of curr is 2294723817.

2294723817 / 51390 = 44653

44653*p*2^15 = 1291793987450743566598144

subtracting this from curr I get:-

curr2 = 20678712102848856265

Again, although a slight twist since we are almost there. I only need the top 31-bits:

Top 31-bits of curr is 1203659462

1203659462 / 51390 = 23422

23422*p = 20678386969943242442

Subtracting this from curr2 I get:-

325132905613823

Which is less than 882861709928411 so it's the answer!

I'm starting to code this up in assembly but I'm still trying to get my head round it. I prefer to understand things properly before starting. Oh, and I've just found out that I'm off to Stockholm for work. In 5 hours. There's goes playing football this evening.