Math in programs - i would like to learn

**chris** · 10-29-2005, 09:14 PM

Greetings everybody,

i have a few basic questions about using math in a computer program and i hope to get some help.

I'm aware that some or all questions may be stupid and i might blame myself, but i have not figured out the answers on my own, and asking google doesn't help either. (An 'i think' chinese proverb says: the fool who asks stays a fool for some time, the fool who does not ask stays a fool forever)

So i post my questions here, because i believe in this forum are people who do know a lot about math.

// I usally use C/C++ on a 32-bit computer;

When it comes to programming and math, i'm interested in two things most:
Primes and taxicab numbers.

I will start with taxicab first (if you don't know what this is, you might get a start here: http://euler.free.fr/taxicab.htm

i have a 32-bit computer, so i assume i can use up to 2^32 bit of address space, basically 4 GB;

I know that _int64 exists, so i can use numbers up to 2^64 (sometimes called long).

Question 1:

How is this done on a 32-bit machine? I think that my registers (like AX, BX, CX or even EAX and so on) are 32-bit also, which comes to 2 possible explanations:

- other registers are used, like the FPU
- there is a way to 'connect' registers (asm?), but how does it work? I think i might know howto with add, sub, maybe even mul, but what about div ?

How often can i 'connect' a register ?

Question 2:

I can perform add, sub, mul on a 'digit-per-digit' base.

Is there a way to do a div digit-per-digit, too?

For div, i only know to do a loop with subs until i get number1:number2 = something (mod someotherthing) - that doesn't work like intended, of course, if number2 is 'too big'

Question 3:

For the taxicab problem, i would need a _int128. Any chance to program it myself without using a BigInt lib or writing a LARGEINT class using char *? (My compilers doesn't have _int128, only _int64)

(Note: BigInt refers to a lib downloaded from the internet, which i might not understand - and i would like to learn the 'stuff behind it' while LARGEINT refers to my own code using 'strings' to achieve arbitrary large numbers (>> int) - which is so slow)

Question 4:

I have written my own LARGEINT class using 'strings'. While i might be able to actually check the largest known prime, i think im stuck at 2^32 anyway. Right or wrong? This question goes specifically to a C/C++ program using
char *temp = new char[SIZE] for the number in question;

(I know i could somehow go down to bits and achieve 2^32 * 8 but there is still a limit)

Question 5:

Sob is calculation insane big numbers. Some of these numbers are sieved.

How is this be done ? How so fast? What is 'sieve' exactly doing? Is it sieving every possible number starting with the lowest or just 'some' numbers?

I would greatly appreciate any form of input, be it written here or linked from somewhere. What i would hope to get is specific and usable information, not something like: just check www.wikipedia.org and insert your question into search. I tried before.

Regards,

Chris

**Greenbank** · 10-30-2005, 02:21 PM

Chris, only got a few minutes so I'll do a more detailed answer tomorrow.

Take a look at GMP (http://www.swox.com/gmp). It's an arbitrary precision math library which is easy to use from C or C++.

I know you said you're not interested in using this kind of thing, but the manual (as a PDF) has lots of discussion about the algorithms used for division and multiplication.

Also consider investing in a copy of Knuth (Vol I,II and III):-

ISBN: 0201485419

It's £75 on amazon.co.uk and $100 on amazon.com.

But it is well worth it. It is *THE* reference book for these types of questions.

**Greenbank** · 10-31-2005, 06:57 AM

OK, longer response. Comments inline.

Originally posted by chris
Greetings everybody,

i have a few basic questions about using math in a computer program and i hope to get some help.

I'm aware that some or all questions may be stupid and i might blame myself, but i have not figured out the answers on my own, and asking google doesn't help either. (An 'i think' chinese proverb says: the fool who asks stays a fool for some time, the fool who does not ask stays a fool forever)

So i post my questions here, because i believe in this forum are people who do know a lot about math.

// I usally use C/C++ on a 32-bit computer;

When it comes to programming and math, i'm interested in two things most:
Primes and taxicab numbers.

I will start with taxicab first (if you don't know what this is, you might get a start here: http://euler.free.fr/taxicab.htm

i have a 32-bit computer, so i assume i can use up to 2^32 bit of address space, basically 4 GB;

I know that _int64 exists, so i can use numbers up to 2^64 (sometimes called long).

"long long" is the usual type for a 64-bit integer on a 32-bit machine.

Most 64-bit compilers will use a 64-bit integer for a long on a 64-bit machine.

Question 1:

How is this done on a 32-bit machine? I think that my registers (like AX, BX, CX or even EAX and so on) are 32-bit also, which comes to 2 possible explanations:

- other registers are used, like the FPU
- there is a way to 'connect' registers (asm?), but how does it work? I think i might know howto with add, sub, maybe even mul, but what about div ?

How often can i 'connect' a register ?

Consider addition:-

A and B are 64-bit integers, you want R=A+B

First make two 32-bit integers out of these:-

A=a_low+(2^32*a_high)
B=b_low+(2^32*b_high)

r_low=a_low+b_low
# The above may set the carry flag if there is an overflow...
r_high=a_high+b_high+carry_bit

R is now represented by r_low and r_high.

This can be extended for 96-bit integers:-

A+B=R

r_low=a_low+b_low
# The above may set the carry flag if there is an overflow...
r_mid=a_mid+b_mid+carry_bit
# Again the carry bit may be set
r_high=a_high+b_high+carry_bit

etc...

Subtraction is similar. Go from low to high words and subtract with carry.

Multiplication is a bit trickier, luckily most CPUs provide a couple of instructions that multiply two 32-bit numbers together and give you either the low or high word of the result. I think x86 might even span the result over two registers (I think Athlon FX2's do this).

For PowerPC 970 (the 32-bit G4 processor in Apple Macs) you do this:-

A*B=R

A=a_low+(2^32*a_high)
B=b_low+(2^32*b_high)

Now the result can be up to 128 bits long, or 4 words.

r2=0; r3=0;
r0=mul_low(a_low,b_low)
r1=mul_low(a_low,b_high)+mul_low(a_high,b_low);
r2+=0+carry;
r1+=mul_high(a_low,b_low);
r2+=0+carry;
r2+=mul_high(a_low,b_high)
r3+=0+carry;
r2+=mul_high(a_high,b_low);
r3+=0+carry;
r2+=mul_low(a_high,b_high)
r3+=0+carry;
r3+=mul_high(a_high,b_high)

There are more efficient ways to interleve the adds and muls using other registers but it depends on the processor used. It's easy on the G4 as there are 32 general purpose registers :-)

As for division. It's tricky. I'm trying to work this out at the moment (I don't own the Knuth books but my colleague does :-)). Right now my divide is (a) slightly broken and (b) slower than GMP's mpn_tdiv_qr().

Question 2:

I can perform add, sub, mul on a 'digit-per-digit' base.

Is there a way to do a div digit-per-digit, too?

For div, i only know to do a loop with subs until i get number1:number2 = something (mod someotherthing) - that doesn't work like intended, of course, if number2 is 'too big'

As noted above the quickest way is to perform it base 2^32 (i.e. one 32-bit word at a time).

Don't go down the per-digit route, it's horribly slow.

Question 3:

For the taxicab problem, i would need a _int128. Any chance to program it myself without using a BigInt lib or writing a LARGEINT class using char *? (My compilers doesn't have _int128, only _int64)

(Note: BigInt refers to a lib downloaded from the internet, which i might not understand - and i would like to learn the 'stuff behind it' while LARGEINT refers to my own code using 'strings' to achieve arbitrary large numbers (>> int) - which is so slow)

Here's what I'd do.

Use GMP first, it's really easy. Make sure your algorithm is correct and produces valid results, test it, test some more.

Find out where the code spends most of its time (by some profiling tool).

Read up on the algorithms used to do each of these operations.
Write them in assembly and test the hell out of them.

My mul(64bit, 64bit) -> 128bit result code is 5 times faster than GMP's mpz_mul() and about 2 times faster than the underlying bare bones mpn_mul().

However, my mod(128bit,64bit) -> 64bit code is slower than mpn_tdiv_qr() because I haven't worked out how to do divide properly ;-)

Question 4:

I have written my own LARGEINT class using 'strings'. While i might be able to actually check the largest known prime, i think im stuck at 2^32 anyway. Right or wrong? This question goes specifically to a C/C++ program using
char *temp = new char[SIZE] for the number in question;

The number 2^32 is only 10 digits long in decimal. You're confusing the number itself with its size.

A recent prime found by SoB: 4847 * 2^3321063 + 1 is still only 999744 digits long in decimal. So you'd need under 1MB of RAM to store the decimal representation of this number.

GMP stores the numbers in 'limbs' which are just "unsigned long int". So for a 32-bit machine they will be 32-bits. Each number is a just an array of these limbs with a header containing size and sign data.

GMP would only use ~400KB to store the number above as it is stored in binary.

[ Sieve question will be answered in a seperate post. ]

**Greenbank** · 10-31-2005, 07:33 AM

Originally posted by chris
Question 5:

Sob is calculation insane big numbers. Some of these numbers are sieved.

How is this be done ? How so fast? What is 'sieve' exactly doing? Is it sieving every possible number starting with the lowest or just 'some' numbers?

OK, We'll simplify the case for just 3 values of k.

k=10223, 19249 and 21181.

The sieve is checking a range of numbers to see if any of them are factors.

So we look through from p_min to p_max and ignoring as many p values that we know are composite. There's no point checking to see if p is a factor if it's composite because we would have found the factors earlier when checking lower ranges of p, i.e. there's no point checking to see if 15 is a factor of any of the numbers because it's much easier to check to see if either 3 or 5 is a factor.

For a trivial case we can just ignore any even values of p since we know these
aren't prime and we'll check every odd p value in the range. If you want to make a sieve very fast then you'll need to spend some time investigating algorithms to remove composite candidate p values as quickly as possible. The more time you spend checking for composite factors the more time you waste. However the more time you spend weeding out composite p values the more time you could be spending checking to see if p is a factor. That's an important trade-off.

So we have our value for p and want to see if it factors any of the numbers we are interested in.

We could calculate k*2^n+1 for each k and n value we are interested in but:
1) There are hundreds of thousands of them
2) If each one is 400KB then that means we need many Terabytes of memory.
3) Checking to see if p is a factor of a 500,000 digit number is very slow.

So we take a different tack.

Here's what we want (equals sign means equivalence):-

k*2^n+1 = 0 (mod p)

i.e. when k*2^n+1 is divided by p there is no remainder, i.e. p is a factor of this number.

Wouldn't it be nice if we could just rearrange this equation and solve for n. Well, we can almost do that (don't worry, I'll explain these steps):-

k*2^n+1 = 0 (mod p)
k*2^n = -1 (mod p) /* subtract 1 from each side */
2^n = -1/k (mod p) /* divide by k */
2^n = -(1/k) (mod p) /* move minus sign outside brackets */

First off, what does "= -1 (mod p)" mean? Surely anything (mod p) will be a number between 0 and p-1.

Yes and no. I'm using = to mean the equivalence sign, not the equals sign.

What I mean by this statement is that the thing on the left (k*2^n) and the thing on the right (-1) give the same residue mod p.

If I use an equals sign then it should be : k*2^n = p-1 (mod p)

But it's just easier to write it as -1 (mod p).

Next bit. This is modular arithmetic and I've divided 1 by k. Surely that can't be right!

It's not a simple integer divide, that's why.

What I mean by (1/k) is a number that, when multiplied by k gives 1 (mod p).

Taking a simple example:-

5 * x = 1 (mod 7).

I want a number x such that 5 * x = 1 (mod 7).

5*0 = 0 (mod 7)....nope
5*1 = 5 (mod 7)....nope
5*2 = 10 = 3 (mod 7)....nope
5*3 = 15 = 1 (mod 7)....yes!

5 * 3 = 1 (mod 7).

Therefore:-

3 = 1/5 (mod 7)

The term (1/k) (mod p) is sometimes written as k^-1 (mod p).

Finding this value is known as finding the modular inverse. There is a function in GMP to do this called: mpz_invert( answer, value, modulus ).

To find the value above I would do:-

// mpz_init() stuff skipped....
mpz_set_ui( value, 5 );
mpz_set_ui( modulus, 7 );
mpz_invert( answer, value, modulus );

The value 3 would then be placed in the answer variable.

Sometimes, however, there isn't an answer. Here's an example:-

3 * x = 1 (mod 9)

It should be obvious why there is no answer. Since 3 is a multiple of 9 then every multiple of 3 will either be 0, 3 or 6 (mod 9).

In this case mpz_invert() returns 0, if it found an answer it would have returned 1.

OK, where were we. Ah yes:-

2^n = -(k^-1) (mod p)

We can now compute (k^-1) (mod p) using the mpz_invert() function. We do this for all k values.

Taking our k values:-

k=10223, 19249 and 21181.

and a p value of 100003 (which is prime):-

inverse of 10223 (mod 100003) = 95288
inverse of 19249 (mod 100003) = 3486
inverse of 21181 (mod 100003) = 1813

These are the values of k^-1 (mod p) but we need -(k^-1) (mod p). So we just subtract each of these values from p:-

100003-95288 = 4715
100003-3486 = 96517
100003-1813 = 98190

The next step was:-

2^n = -(k^-1) (mod p)

So we now need to solve this equation:-

2^n = x (mod p)

where x is any of these values:- {4715, 96517, 98190}.

If we were dealing with normal maths, rather than modulo arithmetic we would have this equation:-

2^n = x.

To solve for n we would take the logarithm (base 2) of x to get the answer.

2^n = x
n = log_2 (x).

But it's not that easy. Because we are dealing with modulo arithmetic we can't just apply a function to the right-hand side to get the answer.

This problem is known as the Discrete Logarithm problem.

Anyway, back to the problem at hand:-

2^n = {4715,96517,98190 } (mod p)

This is why sieving multiple k values is faster than sieving individually. We now need to find a value for n such that (2^n) (mod p) is one of these values.

We do the relatively easy work of computing -(k^-1) (mod p) for each k value and then we have the hard part of finding the values of n such that an answer is available.

Solving 2^n = x (mod p) can be done in many ways.

The simplest (and probably slowest) would be this pseudo code for checking these values against p=100003:-

Code:

n=0; 
r=1; // 2^0 = 1
while( n < n_max ) {
     n++;
     r=r*2; /* Multiply by 2 */
     if( r >= p ) {
          r-=p; /* Perform (mod p) if necessary */
     }
     if( r == 4715 ) {
          printf( "%d is a factor of 10223*2^%d+1!\n", p, n );
     }
     if( r == 96517 ) {
          printf( "%d is a factor of 19249*2^%d+1!\n", p, n );
     }
     if( r == 98190 ) {
          printf( "%d is a factor of 21181*2^%d+1!\n", p, n );
     }
}

Some important points to note:-

1. The values of (k^-1) (mod p) will change as p changes. These values have to be computed every time.
2. The algorithm above is very slow when we get to large n values.
3. Obviously we could make it a bit faster by putting the required values in a hashtable. This makes this algorithm much faster but we'll find out why it isn't possible when using a different (and much faster) algorithm.

The next lesson will demonstrate the Baby-Steps Giant-Steps algorithm. I need a smoke and a coffee :-)

BTW, I'll tidy up all of this stuff and post it on the Wiki at some point in the future...

**Greenbank** · 10-31-2005, 08:13 AM

Baby-Steps Giant-Steps.

The BSGS method is due to Shanks: http://en.wikipedia.org/wiki/Baby-step_giant-step (where I've stolen some of the text from!)

This starts to get a bit harder to understand. Feel free to ask any questions after you've read this (and the link above).

We are trying to solve this equation:-

2^n = x (mod p)

For multiple values of x, however we'll simplify it to just one value of x for now.

Calculate: m as ceiling( sqrt( n_max ) )

Rewriting n = im+j (where 0 <= i,j < m ) we get:-

2^(im+j) = x (mod p)

or:-

2^(im) * 2^j = x (mod p)

Dividing each side by 2^(im) we get:-

2^j = x / 2^(im) (mod p)

2^j = x * 2^(-im) (mod p)

We now compute 2^j (mod p) for all values for j = 0 to m-1 and store these in a hash table along with the value of j.

Compute: 2^(-m). This is the 'giant-step' that we'll need later on.

We do this by computing 2^-1 in the same way we computed k^-1. (mpz_invert).

We then raise 2^-1 to the m'th power to get 2^-m. (GMP: see the mpz_powm() function).

current = x

We now loop on i from 0 to (m-1).
1. Check to see if current is present in the hash table.
2. If so, then the answer n is (i*m)+j. (Remember we stored the value 2^j (mod p) in the hash table along with the value of j).
3. If not, current = current * 2^(-m) (mod p)

m is usually chosen to be ceiling(sqrt(n_max)) which would be the most efficient value.

This is why it gets its name of 'Baby-Steps Giant-Steps'. We compute the baby-steps (the 2^j values for 0 to m-1). And then store these results. We then calculate the giant-steps and use the hash table to reduce the number of comparisons required.

[ In reality, proth_sieve stores the giant-steps values in the hash_table, not the baby-steps. This requires a bit of extra work but it means that the giant-steps table can be shared for each of the x values we are looking for (corresponding to each k value). ]

Using real life sieving as an example. n_max is 50,000,000.

If we chose m=2 then the first stage (computing 2^j (mod p) for all values j=0 to m-1) would mean calculating just 2 values (2^0 and 2^1). But the second stage would require n_max/m mulmod() operations (25,000,000) in point 3 of the loop.

We want to minimise the number of calculates performed by doing half in each stage. This represents an 'm' value of sqrt( n_max ). The ceiling() function is used to make sure we do enough stages to cover everything. If n was 10 then the integer portion of sqrt(10) would be 3. 3*3 = 9, so we wouldn't be checking for n=10. If we use n=4 then (4*4 > 10). This extra effort is relatively tiny when we are using m values such as 7072 (for n_max = 50,000,000).

The values computed are all dependent on the value of p so we can't reuse anything we've calculated here when we move on to the next p value.

This is how BSGS works. However there are some extra tricks that can be used to rule out k values.

Next topic: Quadratic Residues and Quadratic Reciprocity.

**Greenbank** · 10-31-2005, 08:42 AM

Quadratic Residues and Quadratic Reciprocity.

Getting a bit shaky here as I'm doing most of this from memory!

Congratulations if you understand this after the first read. It took me several re-reads and some test programs plus a few dumb questions before I finally grasped it.

Reading links:

Quadratic Rediues: http://en.wikipedia.org/wiki/Quadratic_residues
Quadratic Reciprocity (Legendre & Jacobi symbols): http://en.wikipedia.org/wiki/Quadratic_reciprocity
Legendre Symbol: http://en.wikipedia.org/wiki/Legendre_symbol

OK, going back a bit we are looking to solve the equation:-

2^n = x (mod p).

We can use Quadratic Residues to rule out some k values before we even get to the Baby-Steps Giant-Steps (BSGS) algorithm.

Consider even values of n (I'll come back to the odd/even distinction later):-

2^n = x (mod p)

Because n is even we can rewrite this as:-

2^(2m) = x (mod p)

Which is the same as:-

(2^m)*(2^m) = x (mod p)

or:-

(2^m)^2 = x (mod p)

So the LHS of this equation is a square, or another way of putting it, the LHS is a Quadratic Residue (QR) (mod p).

So if the equation has a solution (remember there may not be a solution at all) then the RHS must be a QR (mod p) too.

In fact it simplifies such that we only have to check if -k is a QR mod p. So we can do this before we even compute -(k^-1) (mod p).

(If z is a QR mod p then -z will be a QR mod p).

A number can be tested to see if it is a Quadratic Residue by using the Legendre symbol. (GMP has mpz_legendre() ).

For odd n values we can do this:-

2^n = x (mod p)

2^(2m+1) = x (mod p)
2^(2m) = x/2 (mod p)
(2^m)^2 = x/2 (mod p)

Although again there is a short cut (if -(2k)^-1 is a QR (mod p) then -2k will be a QR (mod p) so we only need to check to see if -2k is a QR (mod p). Again this can be done before we even compute -(k^-1) (mod p).

This can be used to eliminate the need for checking particular k values. On average there is a 50% chance that a k will be eliminated at this stage.

Earlier I glossed over the odd/even n distinction. Here's why.

Consider a value of k that is not divisible by 3. We'll use k=10223 (== 2 mod 3).

10223*2^1+1 = 20447 == 2 (mod 3)
10223*2^2+1 = 40893 == 0 (mod 3)
10223*2^3+1 = 81785 == 2 (mod 3)
10223*2^4+1 = 163569 == 0 (mod 3)

So for k=10223 we know that every even value of n will be divisible by 3. So we only ever have to check odd values of n for k=10223.

For k=19249 it's the other way round:-

19249*2^1+1 = 38499 == 0 (mod 3)
19249*2^2+1 = 76997 == 2 (mod 3)
19249*2^3+1 = 153993 == 0 (mod 3)
19249*2^4+1 = 307985 == 2 (mod 3)

So for k=19249 we know that every odd value of n will have a factor of 3. So we only ever have to check even values of n for k=19249.

k values which are divisible by 3 present a minor problem.

k*2^n+1 == ? (mod 3)

rewrite k as 3m since k is multiple of 3:-

3m*2^n+1 == ? (mod 3)
3(m*2^n) + 1 == ? (mod 3)
3(m*2^n) + 1 == ? (mod 3)
3*something + 1 == ? (mod 3)

3 can never be a factor of k*2^n+1 if k is a multiple of 3. Whatever n we choose the remainder (mod 3) will always be 1.

So k values which are a multiple of 3 are treated slightly differently. We seperate out the odd/even n values so we can use the QR tricks on them.

But it does mean that we can rule out one half of the possible n values before we start out big computations.

Next Topic: Silver-Pohlig-Hellman

**Greenbank** · 10-31-2005, 09:11 AM

Silver-Pohlig-Hellman.

Further Reading Links (and credits for work stolen :-):

SPH (Wikipedia): http://en.wikipedia.org/wiki/Pohlig-Hellman_algorithm
SPH: http://www-math.cudenver.edu/~wchero...10/phexam.html
SPH: http://www.adeptscience.co.uk/produc...HellmanEx.html

OK, SPH is an algorithm that can be used when p-1 is relatively smooth. By this I mean p-1 is composed of various smaller prime factors.

What is smoothness?

100003 is prime.
100002 is 2 * 3 * 7 * 2381.

100002 is 2381-smooth because the highest prime factor is 2381.

Compare this to:

100043 is prime.
100042 is 2 * 50021.

So 100042 is 50021 smooth.

SPH works best on numbers whose largest prime factor is quite low. This is because we have to factor p-1 completely in order to use the SPH algorithm. If the number is 1000-smooth then it means we only have to check for prime factors of p-1 up to 1000, instead of all the way up to 50000 as seen above.

As p increases over our sieve range we have to spend longer and longer finding the factors of p-1.

SPH works by splitting the problem up into smaller chunks based on the prime factors of p-1.

I'm not going to explain it, or demonstrate it, but I will discuss its implications on sieving.

At each stage of SPH the result is a relation for the final value of n.

There are many reasons why these relations are useful.

Consider we get the relation, after doing SPH(2) for some prime p for k=10223.

The relation we get back may be:-

n=0 mod 2.

This means this value of p will only ever be a factor of 10223*2^n+1 where n is even.

But we know that every even value of n has a factor of 3. We only need one factor to rule out a k,n pair as being possibly prime so we can stop all work on k=10223 and this particular possible factor p.

If we get through the SPH stages for each factor of p-1 we end up with a set of relations.

These relations can be combined using the Chinese-Remainder-Theorem to one relation.

i.e. for k=19249 and p=914666389281937

p-1 = 2^4 * 3^5 * 443 * 3499 * 151771

We get this relation from the remaining n values in the SoB.dat file:-

n = 2 mod 4

SPH(2) wasn't able to give us a relation (it doesn't work all of the time).

SPH gives us the following relations:-

n = 158 mod 243
n = 439 mod 443

These relations can be combined using the Chinese-Remainder-Theorem to give us:-

n = 358826 mod 430596.

You'll notice that I didn't produce relations for the p-1 factors of 3499 and 151771. There's no point really. The relation above gives us enough information that we can quickly check for a factor.

So instead of checking all 50,000,000 values we have whittled this down to:-

25,000,000 by only having to check even n values thanks to 3 being a factor of all odd n values. And thanks to the QR check.
12,500,000 by looking at the .dat file and seeing that only n = 2 mod 4 (1/4 of all values) are needed.

Now SPH has whittled this down to 50,000,000 / 430596 = 116 possible n values.

n = 358826, 358826+430596, 358826+(2*430596), etc...

This is why proth_sieve sometimes outputs factors outside the proper range (to 50,000,000). Since we only have to check 116 values up to 50,000,000 we may aswell check another 116 values, or 232 in total which covers us all the way up to n=100,000,000.

If SPH wasn't able to give us more information we'd just fall back on the BSGS algorithm and check all values.

**Greenbank** · 10-31-2005, 09:17 AM

OK, that covers most of it.

There are some bits I haven't touched on:

1. Further enhancements from Alex Kruppa and Bob Silverman on the mersenneforum.org:-

"
BTW, the fact that 2 may not be a generator of (Z/ZN)* should not be
much of a problem. Choose a generator g, solve g^m==2 and g^n==k (the
relations and matrix need to be done only once, and since 2 is trivially
B-smooth, solving g^m==2 should be very easy). Then express (g^m)^(n/m)
== 2^(n/m) == k, where the division n/m should be performed modulo p-1,
i.e. by a modular inversion of m (mod p-1). This will only work if the
denominator of the cancelled fraction n/m has no factor in common with
p-1, which once again amounts to the condition that 2 must not be a
power in Z/Zp that k is not.
"

2. It's not necessary to compute every value in the baby-steps portion of the algorithm.

After inspecting the .dat file we find a suitable giant-step size such that the number of different residues (mod giant_step_size) is as low as possible.

For the current SoB.dat file proth_sieve uses a giant-step size of 15840.

All of the n values in the .dat file then fall into one of 2690 different residues mod 15840.

So we only need to calculate these 2690 values for the baby-steps stage. We're not interested in factors for any numbers other than these.

2690/15840 is just under 17%, so we are saving ourselves 83% of the work.

**rogue** · 10-31-2005, 12:14 PM

Originally posted by Greenbank
For PowerPC 970 (the 32-bit G4 processor in Apple Macs)

Minor correction, the 970 is a 64-bit CPU, aka the G5. The G3 is the 750 and the G4 was 7400/7450. There are others, but these are the main ones. The G3 and G4 were 32-bit only.

**chris** · 10-31-2005, 03:10 PM

Greenbank, thank you very much for giving me so much input. This is far more than i could have ever hoped for. This is just fantastic

It will take some time to read and understand all of it.

GMP is far too advanced to start for me (and i'm unfortunatly using the wrong operating system

), but GMP has a link to NTL, which seems to be an easier learning choice (and NTL has a free book about this stuff).

Again, thank you very much for your help !!

Update:

Oh boy, what a day ...

Originally posted by Greenbank

Take a look at GMP (http://www.swox.com/gmp). It's an arbitrary precision math library which is easy to use from C or C++.

Well, after i wrote this post, i decided to give it a try. I realized from your answers that my LARGEINT library, where i use 'strings' to do the math on a digit per digit level is so bad i should really use GMP. Initially, i tried to avoid this because i feared 'the pain' to learn to use it. (thats why i like delaying to migrate to Linux, too)

So i downloaded GMP and started looking for the library. I might have overseen it, because it unzipped to 87 directories and i simply didn't find it.

Then i tried to compile it myself. How, i wondered, do i do this in my old M$ Visual C and which of the 1000s of files do i need?

After a while, i downloaded MinGW to have gcc at my machine. After another while, i downloaded MSYS to have a Unix-like environment. Then i watched in amazement how this thing compiled for more than half an hour on my 2.6 GHz computer.

Finally, the library was there.

So i startet VC and wrote a simple test program and linked the library.

You can already guess it, it didn't work.

Then i downloaded Codeblocks to have an IDE for the gcc ...

Now the test program shows that 5+7=12.

Besides that, when i write a post, the two buttons below the editor window say 'Post Reply' and 'Preview Reply'. As i edited this post, i'm now typing it all again.

For those who don't know, if you edit a post, the buttons say 'Save Changes' and 'Reset Message'.

I have read all your posts and understood everything until BSGS, which i will reread soon.

The GMP manual indeed seems to answer the question about the div algorithm, still have to read that.

**Greenbank** · 11-01-2005, 04:28 AM

Rogue,

Yes, sorry, I mixed up the G5 and G4 processors. I'm currently lusting over the new Quad PowerMac G5. 10GHz of PPC 970 64-bit goodness. I reckon my sieve code will run at a minimum of 4000kp/sec on that machine. That's the reason why I'm learning PPC assembly (and starting off on my Mac Mini with it's PPC 9447A G4).

Chris,

GMP is very easy really. Here's the proth_test in GMP:-

Code:

#include <stdio.h>
#include <gmp.h>

int main(int argc, char **argv)
{
        int intn, intk, inta;
        long t;
        mpz_t N,a,e,one,r;

        if( argc != 4 ) {
                printf( "%s k n a\n", argv[0] );
                return(0);
        }
        intk=atoi(argv[1]);
        intn=atoi(argv[2]);
        inta=atoi(argv[3]);
        mpz_init( N );
        mpz_init( a );
        mpz_init( e );
        mpz_init( one );
        mpz_init( r );

        mpz_set_si( one, 1 );
        mpz_set_si( N, intk );
        mpz_mul_2exp( N, N, intn );

        mpz_set_si( a, inta );

        mpz_fdiv_q_2exp( e, N, 1 );

        mpz_add_ui( N, N, 1);

        mpz_powm( r, a, e, N );
        mpz_sub( r, r, N );
        mpz_add_ui( r, r, 1);
        if( mpz_cmp_si( r, 0 ) == 0 ) {
                printf( "PRIME k=%d n=%d a=%d\n", intk, intn, inta );
        } else {
                gmp_printf( "NOTPRIME k=%d n=%d a=%d r=%Zd\n", intk, intn, inta, r );
        }
}

It is very slow though, it takes 8 seconds when n is up at 12000. But it was useful for proving primality for the other k candidates below 78557.

The real proth test used by SoB has many parts written in assembly and uses a FFT for multiplication. Way beyond my understanding (at the moment!).

I'm trying to work out how to do divide myself. I had a quick read of Knuth but I just wasn't understanding it.

I've got some questions myself on divide so I'll start a new thread.

**ltd** · 11-01-2005, 10:07 AM

Hi Greenbank,

as a silent reader of this thread i wanted to congratulate you on your very clear description of sieving. No i also have something to think about.
I have not read BSGS yet but the rest is much clearer described as in the articles i found before.

Thanks again,

Lars

**rogue** · 11-01-2005, 12:01 PM

Originally posted by Greenbank
Rogue,

Yes, sorry, I mixed up the G5 and G4 processors. I'm currently lusting over the new Quad PowerMac G5. 10GHz of PPC 970 64-bit goodness. I reckon my sieve code will run at a minimum of 4000kp/sec on that machine. That's the reason why I'm learning PPC assembly (and starting off on my Mac Mini with it's PPC 9447A G4).

I have a DP G5 at 2.5 GHz, although it is not dual-core. If you want me to compile and test on the G5, let me know. I also known some PPC asm and am familiar with Shark (a profiling tool), so I might be of assistance to you with respect to optimizations on the G3/G4/G5.

**Greenbank** · 11-03-2005, 05:14 AM

rogue,

Excellent, I'll keep that in mind. I've managed to convince the gf that I can have one more computer at home (the only other one is my Mac Mini which is hidden in a cupboard) and so I guess I'll reward myself with a brand new Quad PowerMac.

They are still 3-4 weeks away from shipping in the UK and I'm going to be off on holiday in South Africa then but I'm already itching to get my hands on one.

I've only written some of the 32-bit G4 assembly so far, and not that much of it. I'm trying to get a 128bit/64-bit mod (or a 64*64 % 64 mulmod) as quick as I can. I can thrash GMP's mpz_mul() but I'm still getting beaten by mpz_mod().

**rogue** · 11-03-2005, 09:19 AM

Originally posted by Greenbank
rogue,

Excellent, I'll keep that in mind. I've managed to convince the gf that I can have one more computer at home (the only other one is my Mac Mini which is hidden in a cupboard) and so I guess I'll reward myself with a brand new Quad PowerMac.

They are still 3-4 weeks away from shipping in the UK and I'm going to be off on holiday in South Africa then but I'm already itching to get my hands on one.

I've only written some of the 32-bit G4 assembly so far, and not that much of it. I'm trying to get a 128bit/64-bit mod (or a 64*64 % 64 mulmod) as quick as I can. I can thrash GMP's mpz_mul() but I'm still getting beaten by mpz_mod().

I don't know how familiar you are with PPC vs. x86, but on PPC you have many additional registers. If you are not doing a divide (or mod), then you should be able to do four multiplies per six cycles (on the G5, assuming there are no dependencies). By taking advantage of this, the G5 version of MultiSieve is twice the speed of the x86 version. BTW, if the divisor does not exceed 52-bits, then you can use Montgomery mutiplication to help you do to mod without using a divide, but I assume you already know that.

**Greenbank** · 11-03-2005, 10:57 AM

Originally posted by rogue
I don't know how familiar you are with PPC vs. x86, but on PPC you have many additional registers. If you are not doing a divide (or mod), then you should be able to do four multiplies per six cycles (on the G5, assuming there are no dependencies). By taking advantage of this, the G5 version of MultiSieve is twice the speed of the x86 version. BTW, if the divisor does not exceed 52-bits, then you can use Montgomery mutiplication to help you do to mod without using a divide, but I assume you already know that.

I did the bulk of my assembly on the Z80 and 6502.

More recently it was the ARM processor in the Acorn Archimedes computers. That was my first sight of proper RISC. 16 registers, compound instructions that even allowed conditionals on each and every opcode.

i.e. say you wanted to subtract r8 from r9 only if r9 was larger or equal to r8 (one of the fundamental operations in a shift/subtract mul implementation).

PPC still requires a branch (or at least I can't think of a way not to have a branch):-

cmpw r9,r8 // compare them
blt no_sub // if r9 is less than r8 then we don't want to subtract
subfw r9, r9,r8 // r9=r9-r8
no_sub:

ARM RISC allows you to apply the 'lt' condition to any opcode, so it would become (a mixture of PPC opcodes and ARM conditionals):-

cmpws r9,r8 // compare them, [s] on the end affects status registers
subfwlt r9,r9,r8 // if less than then subtract

Much cleaner, no brach required, pipeline kept clean. If the condition is not met then the instruction just acts as a NOP.

I've done some x86 assembly (mainly 80386 stuff, nothing of the newer SIMD/MMX/3DNow! stuff). That's probably as good as my sparc assembly.

I don't know if the G4 (PPC 9447A?) can handle concurrent muls. The 32 registers available make it easy to do the 4 mullw and 4 mulhw ops with no overlaps and no dependencies.

As for Montgomery multiplication, no is the simple answer. I'd seen it in the BN library as part of OpenSSL but I've never found a good explanation of it that I could follow. Most of the google responses refer to it but don't explain it. Plus I'd completely forgotten about it as a possibility. Got any references I can follow up?

[EDIT - Google groups, however, has given me some stuff to read up on. ]

[EDIT 2 - And one of the links was to the Handbook of Applied Cryptography (HAC). Why hadn't I ever seen this before, it's perfect! http://www.cacr.math.uwaterloo.ca/hac ]

The 52 bit limit makes me think of the FPU mod hack, I though that Montgomery-Multiplication used modular inverses and not fp reciprocals?

**rogue** · 11-03-2005, 12:07 PM

Originally posted by Greenbank
PPC still requires a branch (or at least I can't think of a way not to have a branch):-

cmpw r9,r8 // compare them
blt no_sub // if r9 is less than r8 then we don't want to subtract
subfw r9, r9,r8 // r9=r9-r8
no_sub:

ARM RISC allows you to apply the 'lt' condition to any opcode, so it would become (a mixture of PPC opcodes and ARM conditionals):-

cmpws r9,r8 // compare them, [s] on the end affects status registers
subfwlt r9,r9,r8 // if less than then subtract

Much cleaner, no brach required, pipeline kept clean. If the condition is not met then the instruction just acts as a NOP.

I don't there there is anything in the integer unit, but there is one in the FPU. I believe it is the fsel instruction, but I don't remember for certain.

Originally posted by Greenbank
I don't know if the G4 (PPC 9447A?) can handle concurrent muls. The 32 registers available make it easy to do the 4 mullw and 4 mulhw ops with no overlaps and no dependencies.

I believe that the G4 can do two at a time, but it might be four. Even if it can only do two, doing four won't hurt.

For the 52-bit mulmod, check out this link on how to do it without using a divide:
http://groups.yahoo.com/group/primenumbers/message/9636

**Greenbank** · 11-03-2005, 12:50 PM

Yup, I've seen Phil post that before here:-

http://groups.google.com/group/sci.c...s&rnum=5&hl=en

It is useful for the giant-step portion of the BSGS algorithm which is (almost directly lifted from proth_sieve.cc):-

Code:

exp = 1;
j = 0;
curr = p-1;
for( ; ; ) {
    store_hash( curr, exp++ );
    j += grange;
    if( j > nhigh ) break;
    ht = ht * gstep (mod p);
}

grange is the size of the giant step. For current SoB .dat this is 15840.
gstep = 2^(-grange) (mod p)

So we are multiplying and modding by the same values each time, over 3000 times in a row with the relatively simple store_hash() code each time.

I plan to use the quickest mulmod available (be it with the FPU, classic shift/subtract mod, Montgomery or Barrett) and code this entire loop in G4 assembly. Once I've done it in G4 32-bit assembly then the transition to G5 64-bit will be easy, although I'll go back and check to see which mulmod code is fastest!).

There are several other loops within the code (baby-steps, SPH routines) that could be improved with hand crafted assembly.

However I want the code to be fast over the 2^52 limit although I really don't mind having various routines that work at different bit sizes, i.e. < 50 bits, 51 and 52 bits, 53 to 64 bit.

I'm not worried about > 64bit. If it's taken us 2 years to get to 2^50 then, at the current rate with no increase in CPU power, efficiency or whatever that gives us 8192 years until we have to think about >64 bit. :-)

**Greenbank** · 11-03-2005, 01:18 PM

Couple of other things.

By writing seperate routines for different bit sizes I can get as much optimisation in there as possible.

It does mean huge amounts of duplication of code but it means that I can make useful assumptions that lead to optimisations.

For instance, if p is 50 bits then the product of a * b will be a maximum of 100 bits. On a 32-bit machine this means the most significant bits would be in the lower 4 bits of the 4th word. That means I can skip the top 28 bits in the top word. Although this would be redundant if I used montgomery or some other routine it can be useful.

The reason for writing the whole loops in assembly (again, lots of duplication would be needed) is that it minimises the saving/restoration of clobbered registers.

Looking at the gcc -S (v 4.0.0 on Darwin) output for my mul_64_64_asm() routine it has to store the values of all of the registers I've marked as clobbered (this is the hairy AT&T style syntax) and restore them after my code.

For the 30 instruction mul there are 15 instructions to save/restore registers. (I know this isn't the same thing as a true cycle count but these are a nightmare to calculate with piplining and concurrent mul instructions).

I'd rather run roughshod over all 32 registers (plus some fp regs if this is the way to go), doing all 3000 mulmods() and storing the results in the hash rather than incur these penalties each iteration. The save/restore would then be ~60 instructions but I don't really care when we're talking about 300,000 instructions in between.

Here's my current 64 bit x 64 bit mul:-

Code:

        __asm__ (

                /* r[0]=mullw(al,bl) */
                "mullw r20,%0,%2\n\t"
                "stw r20,0(%4)\n\t"

                /* r[1]=mulhw(al,bl) */
                "mulhwu r2,%0,%2\n\t"
                /* t[1]+=mullw( al, bh ) */
                "mullw r5,%0,%3\n\t"
                /* t[1]+=mullw( ah, bl ); */
                "mullw r6,%1,%2\n\t"

                /* t[2]+=mullw( ah, bh ); */
                "mullw r7,%1,%3\n\t"
                /* t[2]+=mulhw( al, bh ); */
                "mulhwu r8,%0,%3\n\t"
                /* t[2]+=mulhw( ah, bl ); */
                "mulhwu r9,%1,%2\n\t"

                /* t[3]+=mulhw( ah, bh ); */
                "mulhwu r10,%1,%3\n\t"

                "li r11, 0\n\t"
                "li r12, 0\n\t"

                /* r[1]=r2+r5+r6 */
                "addco r20,r2,r5\n\t"
                "addeo r11, r11, r12\n\t"
                "addco r20,r20,r6\n\t"
                "addeo r11, r11, r12\n\t"
                "stw r20,4(%4)\n\t"

                /* overflow in r11 */
                /* r[2]=r7+r11 + r8+r9 */
                "addco r20,r11,r7\n\t"
                "addeo r10, r10, r12\n\t"
                "addco r11,r8,r9\n\t"
                "addeo r10, r10, r12\n\t"
                "addco r7,r20,r11\n\t"
                "addeo r10, r10, r12\n\t"
                "stw r7,8(%4)\n\t"

                /* r[3]=r10 */
                "stw r10,12(%4)\n\t"

                : /* no outputs */
                : "b"(al), "b"(ah), "b"(bl), "b"(bh), "b"(r)
                : "r20","r2", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12"
        );

al,ah are the low and high 32-bit words of a.
bl,bh are the low and high 32-bit words of b.
r is a 4 word array for the result.

This is nowhere near the finished article. If mul's can run concurrently then I move the first 'stw' until after the last 'mulhwu' so that all 8 mul's are one after each other.

I do like RISC. Having lots of registers rocks!

**rogue** · 11-03-2005, 02:26 PM

Have you considered putting the asm into a .s file instead of using asm blocks? This will not hurt performance, but will improve readability.

**Ken_g6[TA]** · 11-03-2005, 02:43 PM

In the past, I've tried to write sieving programs for k*2^n, without much speed success, and now I see why. I'm understanding most everything but the BSBS algorithm right now. I haven't looked at the Wikipedia page yet, but I don't see what the keys for the hashtable are.

Anyway, for 64x64 multiplication, take a look at this post I made on MersenneForum. It shows how to do it with only 3 multiplies, if multiplies are expensive on RISC.

**Ken_g6[TA]** · 11-03-2005, 06:27 PM

I've been going over the BSGS, and other algorithms, and I think I see a few improvements to BSGS.

First, m=sqrt(n_max) is not sacrosanct. It's just a product of minimizing calculations=i+j, for i*j=n_max. So it occurs to me that one should time each part of the algorithm separately. If each baby-step of j takes B nanoseconds, and each giant-step of i takes A ns, then we need to minimize time=Ai+Bj, for i*j=n_max

j = n_max/i
time = Ai+B*n_max/i
time'= A-B*n_max/i^2 = 0 finds the minimum
i = sqrt(B*n_max/A)

So m = ceiling(sqrt(B*n_max/A))

Second, quadratic residues effectively halve the time B for baby-steps. Recalculate m accordingly. I wonder if cubic residues can be efficiently computed? If so, I think that could remove another 1/3 of n's.

Third, you say you chose m so that the number of baby-steps that had to be computed was reduced. I'm not sure how the number of baby-steps relates to m, but if that m was halved, even if the baby-steps doubled, you would save a few thousand calculations.

Finally, it occurs to me that we have already proven most of the smaller n's composite with two PRP tests. So I assume we don't need to factor below some n around 3,000,000. Then taking your equation from above:

2^j = x * 2^(-im) (mod p)

Let's rearrange this so it looks better. We're trying to solve:

x * 2^(-im) = 2^j (mod p)

where x, m, and p are known, and i and j are variable.
But say we've already primality tested up to some n0. Then we can write this as:

n = im+j+n0

x * 2^n0 * 2^(-im) = 2^j (mod p)

and set m = ceiling(sqrt(B*(n_max-n0)/A)) for optimal performance.

**Greenbank** · 11-04-2005, 05:44 AM

3 multiples not 4...

G4 is 32-bit, so I'm stuck with 8 at the moment.

When I get on to 64-bit G5 I'll have the benefit of 64-bit * 64-bit high/low mul instructions.

Ta for the link though.

Originally posted by Ken_g6[TA]
I've been going over the BSGS, and other algorithms, and I think I see a few improvements to BSGS.

First, m=sqrt(n_max) is not sacrosanct. It's just a product of minimizing calculations=i+j, for i*j=n_max. So it occurs to me that one should time each part of the algorithm separately. If each baby-step of j takes B nanoseconds, and each giant-step of i takes A ns, then we need to minimize time=Ai+Bj, for i*j=n_max

j = n_max/i
time = Ai+B*n_max/i
time'= A-B*n_max/i^2 = 0 finds the minimum
i = sqrt(B*n_max/A)

So m = ceiling(sqrt(B*n_max/A))

Second, quadratic residues effectively halve the time B for baby-steps. Recalculate m accordingly. I wonder if cubic residues can be efficiently computed? If so, I think that could remove another 1/3 of n's.

No idea about Cubic residues, it's just that QR are easy to compute via Jacobi/Legendre symbols (which are easy to implement in C).

Third, you say you chose m so that the number of baby-steps that had to be computed was reduced. I'm not sure how the number of baby-steps relates to m, but if that m was halved, even if the baby-steps doubled, you would save a few thousand calculations.

m is usually set to 'grange' which is computed by proth_sieve. It is a multiple of the LCM of the .dat file resdiue relations (see below).

For current SoB.dat grange is 15840, and with n_high=50M that means there will be 3156 giant steps.

And there are a few extra things at play here.

There are multiple k values (if the siever were able to combine SoB, Riesel, PSP and 321 projects we'd have > 100 k's).

The Baby-Steps portion of BSGS must be run once for each k value that is still in play (i.e. it hasn't been removed via QR).

Each k value also comes with a relation which is obtained from looking at the remaining n values within the .dat file.

For k=21181 in the current SoB.dat file we know that n=20 mod 24.

So in the baby-steps portion, for this k, we loop from 0 to m-1 in steps of 24, starting at 20.

So there will be roughly 15840/24 = 660 baby-step values we need to check.

If, on average there are 5 k's that make it passed the QR checks then we will be doing: 5 * 660 baby-steps which is 3300 baby steps. Roughly equivalent to giant-steps.

However, the .dat residues vary. The k's that get past QR vary, etc, etc. Coming up with a perfect formula is impossible, and I don't know how much time would be taken up computing the best option instead of just doing a reasonable good effort without testing it first.

Finally, it occurs to me that we have already proven most of the smaller n's composite with two PRP tests. So I assume we don't need to factor below some n around 3,000,000. Then taking your equation from above:

2^j = x * 2^(-im) (mod p)

Let's rearrange this so it looks better. We're trying to solve:

x * 2^(-im) = 2^j (mod p)

where x, m, and p are known, and i and j are variable.
But say we've already primality tested up to some n0. Then we can write this as:

n = im+j+n0

x * 2^n0 * 2^(-im) = 2^j (mod p)

and set m = ceiling(sqrt(B*(n_max-n0)/A)) for optimal performance.

I've looked into this, it doesn't provide that much of a performance increase and, personally, I'd much prefer to get factors for k,n pairs rather than matching PRP residues.

It may be of help if we get to the point where we have to start sieving up to n=100M and we've sieved all p < 2^50 to n=50M. But predicting the future ain't easy.

**Greenbank** · 11-06-2005, 10:25 AM

Having a deeper wander through the proth_sieve source I found that it does calculate 2^(min_n) (mod p) and work from there if the minimum required n is greater than 500,000.

However it does come with a comment from Mikael of "isn't worthwhile below this limit (500,000) (not really tested though...)".

I'll add it on to my list of things to investigate although this is a very long list! :-)

**Ken_g6[TA]** · 12-03-2005, 11:55 PM

Originally posted by Greenbank
3 multiples not 4...

G4 is 32-bit, so I'm stuck with 8 at the moment.

When I get on to 64-bit G5 I'll have the benefit of 64-bit * 64-bit high/low mul instructions.

Ta for the link though.

I was just going over this again (which means you've created a very nice reference for Proth sieving math, thank you!

). Anyway, about the 3 instead of 4 multiplies thing, it's recursive. This PDF covers it starting on page 13, and I gather it's covered in one of Knuth's books, but I got it from Sedgewick's Algorithms, pages 527-528.

Anyway, I didn't mention the problem with it, which is that given a number split into chunks of n bits, you have to multiply together two numbers of size n+1.

I haven't figured out a way out of it, other than bit shifting the entire number multiple times.

Thread: Math in programs - i would like to learn

Thread Tools

Rate This Thread

Display

Math in programs - i would like to learn

Re: Math in programs - i would like to learn

Re: Re: Math in programs - i would like to learn

Posting Permissions