Results 1 to 21 of 21

Thread: FPGA siever?

  1. #1
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271

    FPGA siever?

    Has anyone given any thought to an FPGA siver for SoB?

    For anyone who doesn't know, an FPGA is a Field Programmable Gate Array. You design a circuit in a procedural language and it is implemented via circuit-logic.

    Traditionally they are no good at mathematical operations (division, exponentiation, multiplication by non-powers of 2) but I think there may be a way to implement a sieve for proth numbers relatively easily.

    I've got an FPGA development kit (costs around $120 - Spartan-3 from Xilinx) and I'm willing to have a go but I've never done anything near as complicated as this with it.

    A naive sieve, that would be difficult to implement in an FPGA, would be to calculate N=(k*2^n)+1 and use trial division by increasing p's, however this just wouldn't fly.

    In a seperate project I'm trying to implement a reasonably fast pure C (i.e. portable to non-x86 architectures) siever and I'm using the principle of the cycle generated by computing the remainder for n=1 for a given k and p and then iterating this function.

    Given r=(k*2^n)+1 mod p

    The remainder for n+1 would be:-

    (2*(r-1))+1 mod p

    or 2r-1 mod p.

    If r is ever equal to 1 then it will remain at 1 for ever more. You would not bother computing it any further. p will never factor k*(2^n)+1 whatever n you use.

    If r is ever equal to 0 then you have a factor. (You can also just set r=p-1 for the next iteration).

    The algorithm looks like this:-

    Code:
    Consider a given k and p.
    Compute initial_r=k*2+1 mod p. (This is the remainder for n=1)
    r = initial_r; /* Thanks for spotting this missing line Joe O */
    if( r == 1 ) {
       stop. p will never be a factor for any n.
    }
    set n=1
    while( n <= upper_bound_n)
       if( r == 0 ) {
            factor found at n. record f1=n. stop.
       }
       r=r<<1;  /* left shift r by one bit == multiply by 2 */
       r--; /* decrement r by 1 */
       if( r >= p ) {
           r=r-p; /* simple subtraction is quicker than modulus */
       }
       n++;
       if( r == initial_r ) {
           stop. in a non-factoring loop of period n-1
       }
    }
    (You don't have to stop when r=0, you can continue by setting it to p-1 and waiting for it to reach r=initial_r. When it does you know that the remainder runs in a cycle of period=n-1. This also tells you that p factors k*2^n+1 for n = f1, f1+period, f1+2*period, f1+3*period, etc...

    As far as an FPGA goes, the code in the while loop is relatively easy to do. It only involves comparisons, a left shift, decrement by one (r is guaranteed to be >1 so this is simple), a 'complex' subtraction (FPGA's have ALU's to support this) and increment by one.

    My FPGA has 200,000 gates, which gives 4320 logic cells equivalents.

    As a wild guess, I think the design would fit into ~100 logic cells, which would mean roughly 40 concurrent tests could be performed at the same time.

    The kit I have would interface to a controlling PC via Parallel port direct to the I/O pins, or via the onboard RS-232 Serial port.

    The controlling computer would do the initial maths that FPGA's aren't good at:-
    - pick k value
    - pick p value
    - compute intial_r=k*2+1 mod p
    - send these to a piece of code on the board that listens on the port and stores the data in the 1M-byte of onboard memory.

    The individual sievers speak to this controller code and ask for a chunk of work. The work is marked as given out and off the siever goes.

    When it finishes it tells the controller code the result and it stores this in the memory. This can be retreived by the controlling computer via the serial/parallel port.
    The controlling computer would simply be sending it work and grabbing back the results.

    What you have to get your head around when dealing with FPGAs is that everything can happen at once, it's not a processor following instructions. For example, the 3 steps involving r (left shift, decrement and comparison) will have to executed one after each other, but at the same time it can be decrementing n.

    Now imagine 40 sievers on the same board, all running at the same time. For a bit more money you can get larger and larger FPGAs.

    Also the on board control code (which handles the interface between controlling computer and memory, and sievers and memory) does not take up any 'time' of the FPGA's processing, only chip 'real-estate' (i.e. gates, logic cells).

    The board is only 50MHz, but with 40 sievers you're getting the same performance as a 2GHz single instance siever on the same chip. The actual level of performance is completely unknown until I implement it.

    If you have a look at some of the other things that FGPA's are good at (encryption) you'll see that $1000 spent on FPGA's performs hundreds, if not thousands, of times faster than $1000 of PC.

    The recent FPGA hack on the encryption used on RFIDs was a good example. http://rfidanalysis.org/. Cracking one key would have taken 2 weeks using 10 very fast PCs. A single $200 FPGA was able to do the same work in just over 10 hours. With 16 FPGAs (cost ~ $3500) they cracked 5 unique keys in 2 hours (compared to 10 weeks for 10 very fast PCs).

    The board also has a few other nifty features (push button switches, 8 LEDs, 4 7 segment displays) and a VGA-port . Not sure what I'd do with that as the VGA controller would take up lots of space on the FPGA but it could be amusing. At least a basic screen using blocks of colour to indicate progress/status.

    I will let you know of the progress although my time is a bit limited in the next few months due to fact I have to go watch us England get beaten by the Australians at cricket over and over again (but you can't beat a whole weekend drinking in the sun).

    Anyway, if anyone has any FPGA experience or would like to comment/suggest/advise please let me know.
    Last edited by Greenbank; 06-30-2005 at 06:21 AM.

  2. #2
    Certainly looks like a cool thing to do. I have fiddled around with FPGA board but nothing of this scale. I would really like to know how did your experimentation go with the board if you do something.

    Another place where FPGA's are beating everybody is chess. The Hydra just beat Adams 5.5-0.5 in a 6 game match. And Adams is supposed to be one of the best players against computers.
    http://tournament.hydrachess.com/ahydra.php

    jaat

  3. #3
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    I posted a few questions (and then a long detailed explanation of Proth Sieving) to the comp.arch.fpga newsgroup:-

    http://groups-beta.google.com/group/...92aced4c599b2a

    All responses seem very positive, including the fact that it's a good project for an FPGA novice.

    They've also told me that there's no reason that it should be limited to 50MHz, each
    siever could run much faster than this (300MHz possibly).

    I'll be having an initial go at it this evening. The FPGA designing software is mind-bogglingly complex but it comes with a modelling program which is much more friendly. The 'program' to implement a single siever should only be 30 lines or so and will work for any k,p value I want (I'll just have to hard-code them at first).

  4. #4
    Moderator Joe O's Avatar
    Join Date
    Jul 2002
    Location
    West Milford, NJ
    Posts
    643
    Shouldn't this test
    Code:
    if( r == 1 ) {
       stop. p will never be a factor for any n.
    }
    be in the while loop as well?

    or are you saying that if initial_r != 1 then r will never be 1?

    and shouldn't there be an r=initial_r before the first test for r?
    Joe O

  5. #5
    forgive my igorance but....

    is not a number of 2^10,000,000 (ie about current prp tests) 10,000,000 bits long?

    10,000,000 bits = 1,200,000 bytes or 1.2MB

    so my calclation says your going to not have enough room on your 1MB memory.

    or have i missed something here?

  6. #6
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    Joe_O wrote: "or are you saying that if initial_r != 1 then r will never be 1?

    and shouldn't there be an r=initial_r before the first test for r?"


    Both.

    If initial_r != 1 then r will never reach 1. No matter what n you go to.

    And yes, it should say r=initial_r before the first test for r. Not sure if I can edit that post and fix it.

  7. #7
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    ShoeLace wrote:
    "forgive my igorance but....

    is not a number of 2^10,000,000 (ie about current prp tests) 10,000,000 bits long?

    10,000,000 bits = 1,200,000 bytes or 1.2MB

    so my calclation says your going to not have enough room on your 1MB memory.

    or have i missed something here?"


    2^10,000,000 is 10,000,001 bits long. (2^0 = 1 (1 bit), 2^1 = 2 (2 bits)...) :-) but anyway...

    I'm not keeping track of N (N=k*2^n+1). I'm keeping track of the remainder of N mod p, which I reference as r.

    Taking the range I'm sieving now (882000G to 883000G) those are roughly 50 bit numbers.

    So if p is a 50-bit number, the largest r could be is p-1 (still 50 bits). Then we double it (could be 51 bits here), then we subtract p if it is greater or equal to p. (back to 50 bits max).

    Lather, rinse and repeat all the way to n=50,000,000.

  8. #8
    Moderator Joe O's Avatar
    Join Date
    Jul 2002
    Location
    West Milford, NJ
    Posts
    643
    Originally posted by Greenbank
    Lather, rinse and repeat all the way to n=50,000,000.
    No need to stop at 50M<GRIN> I'm keeping track of all factors found. Just send them to factrange at yahoo dot com

    So how is your project progressing?
    Joe O

  9. #9
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    I've only spent about 1 hour with the FPGA development software and I can't even get a simple 8-bit counter working (I've had it working before).

    I'm going to try and get a friendly person on comp.arch.fpga to give me a hand as I'm pretty much out of my depth on this. One of the good things about the internet is that there's always people out there willing to help out for free.

    As for stopping at n=50M, it'll probably be 2^26 (67108864). It's easier to check for a single bit being set than doing a 26 bit comparison after each round.

    If you really wanted to push it then there's no point stopping at all, you can check all the way up to n=p-1 and find any loops in the remainders.

    But I'm not sure it helps the cause to know that:-

    8820000103 is a factor of 4847*2^589573502. (n ~ 590M)

    It's also a factor of 4847*2^4999573553+1. Yup, n ~ 5G.

    This means that it is a factor of 4847*2^(589573502+(4410000051*x))+1 for integer x >= 0.

  10. #10
    Greenbank,

    i've seen you posts here and elsewhere about your advancments in sieveing/factoring understading..

    i was wondering if you made any progress with the FPGA?

  11. #11
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    Nothing yet, I've been a bit busy at work and also outside work (spent the weekend at Lord's watching us get thrashed by Australia).

    Now I understand it more I can modify the algorithm to remove the decrement:-

    k*2^n+1 = 0 (mod p)
    k*2^n = -1 (mod p)
    2^n = -1 * (k^-1) mod p

    The server-PC side of the system will calculate -1 * k^-1 (mod p) for each of the k values, remove any candidates by checking the quadratic residues (as described by akruppa) and then pass these to the FPGA.

    This means that I just have to keep left shifting and possibly subtracting p and also checking that the value is one of the 9 search values.

    I'll see what I can do in the next month but free time is pretty rare...

    [EDIT] Further to this, I should be able to do something to help Riesel sieve at the same time, and possibly even PSP as each of these reduces to solving the Discrete Log equation 2^n = <some_value> (mod p).

    Due to the inherent parallel nature of FPGAs I can check for the current LHS value against many values without any performance hit. The problem becomes checking for many values as this may take up more 'space' on the FPGA reducing the number of sievers that can be placed on one FPGA and therefore reducing the overall performance of the FPGA as a whole.

    [EDIT 2] I had a chance meeting in a pub with someone who works with FPGAs and was able to pick their brains about an FPGA proth/riesel sieve.

    Everything looks good, I have a few pointers on things and two recommended books on order from Amazon. I should have a prototype (that would run a hardcoded p value) in a couple of weeks. From then I have to implement the memory interface, the serial interface on both the FPGA and the controlling PC and then see how many sievers I can get on one FPGA.

    I may even be able to implement the Baby-Steps Giant-Steps algorithm in an FPGA and that would be insanely fast. It's a completely different way of thinking about things when you have some much parallel processing available.
    Last edited by Greenbank; 07-26-2005 at 08:04 AM.

  12. #12
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    Minor update.

    I successfully *SIMULATED* (i.e. no real FPGA action yet) a very simple siever last night.

    I wasn't really sure what I was doing but I'd implemented the algorithm using the standard VHDL INTEGER type (signed 32bit) and watched it run for a while with p=101 and k=19249. Not big I know, but still a start.

    But it correctly threw n=21 to the output port.

    No idea how fast it could run as it was purely being simulated. An interesting thing to note is that I don't have to put any timing constraints in the main loop so it should run as fast as possible.

    The only timing constraints will be reading in the appropriate values (p and the 9 target values to search for) and writing them back out after the main loop has executed.

    I'm supposed to be going out on the booze with some friends tonight so I doubt I'll get any more time to hack at it until Friday.

  13. #13
    Here

    http://www.mersenneforum.org/showthread.php?t=4414

    you might find some other interested people. They talk about factorization, but...
    H.

  14. #14
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    Ta.

  15. #15
    From recollection the integer type can only be used for simulation and is not synthesisable. You made need to convert it to a vector of 32 bits. Nothing too bad, but something you should keep in mind is that you will need to change the code before synthesis and so probably should resimulate.

  16. #16
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    Indeed, I'm currently playing around with a 64 bit-vector although I might scale it down (to 51 bits, then 52, etc) if gains can be made. I'm also reading up on pipelining tricks etc to see what can be done (especially with a 64-bit subtractor).

  17. #17
    Sceptical Member
    Join Date
    Jun 2004
    Location
    Oslo, Norway
    Posts
    44
    How's it going with the FPGA siever project Greenbank?
    I must say that this is a very exciting project.
    I have many vacant pci-slots which is dying to mate with a siever (or PRP) card

    Sceptic
    Violently sceptical!

  18. #18
    Member
    Join Date
    Dec 2002
    Location
    new york
    Posts
    76
    Ummm...wow. Okay. Greenbank, this thread hasnt been updated in a couple months. Did you stop because you got bored (need more outside interest) or because you don't think this is possible?

    Some of the things you said about FPGA are astounding. If we can get any of these algorithms into hardware then I suppose we shift to light speed.

  19. #19
    Senior Member
    Join Date
    Jun 2005
    Location
    London, UK
    Posts
    271
    I've been meaning to give an update, thanks for prompting me.

    Short story:

    No, it's not feasible with the, relatively small, FPGA I have.

    Longer story:

    Several problems:

    The naive algorithm (just iterating n, shifting up to double the number and conditionally subtracting p) is extremely slow.

    It would require massive parallelisation to even get close to the speed of an average PC.

    Unfortunately the number of gates required to implement even the naive algorithm means that a limited number of sievers can be placed on a single FPGA, several orders of magnitude below what would be necessary.

    And an FPGA with enough gates to would cost 10 times as much as a barebones PC to perform the same amount of work.

    I started to look down the FPGA before I understood sieving properly (see the Math thread for details of how it actually works).

    Theoretically it is possible to implement the various algorithms (BSGS and SPH) in an FPGA but these algorithms do not lend themselves well to the parallel processing world of an FPGA.

    One example would be the hash table used within BSGS. If I make it generic to any prime being tested < 2^64 then I would need a hash table with storage for 8192 (2^13) 64-bit values and the corresponding 8192 16-bit exponents.

    8192 * ((64+16)/8) = 81920 bytes.

    Memory is limited on the FPGA, and memory bandwidth is much lower than that of a typical PC, especially as I have to "write" the memory access code in the FPGA myself (taking up precious gates).

    Another problem is candidate p generation. The current proth_sieve code uses a modified Eratosthenes sieve algorithm which uses up relatively large (in the FPGA world) amounts of memory. The current proth_sieve code requires 2^22 bits of memory. This is 1/2MB of memory, and the FPGA dev kit I have only has 1MB.

    The remaining 1/2MB limits the number of concurrent sieve processes to a maximum of 6 (5 realistically with correct boundaries and leaving space for other bits of memory required).

    Clock frequencies are much lower on FPGAs so I wouldn't even be coming close to an average PC.

    So, after I learnt a lot more about how sieving actually works, I can't find a good way to use an FPGA to implement any of it.

    I can't even farm out some computationally expensive bit of the algorithm to the FPGA as the host computer to FPGA communication latencies are so big that it would be quicker to just use the CPU to do that work.

    The problem is that sieving involves relatively little work per p value. If you consider that at a prime density of 1 in 23 around 882T then 300 kp/sec means that proth_sieve is testing 300,000/23 =~ 13000 different p values per second.

    Transferring 13000 64-bit values a second would require 1Mbit/sec comms. Putting something that can handle 1Mbit/sec comms on an FPGA would again consume even more valueable realestate (gates) leaving even less space for the sievers.

    --

    FPGAs may make their mark in other areas when large amounts of computation are required on a small set of input data. I know that people are working on ECM factorisation within FPGAs but most of these are limited to small (sub 100 digit base 10) numbers.

    FPGAs do work very well when you have something that is massively parallelisable, or where computations do not rely on the values from previous iterations. This is why they work very well for symmetric encryption and/or hashing algorithms. Generation of candidate values is easy (just an increasing binary counter) and the steps/rounds of the encryption can be pipelined to very good effect.

    I haven't abandoned the FPGA completely. If I can find one part of the sieving algorithm that has minimal input/output requirements, has a relatively large computational complexity but does not have dependencies on the results of previous iterations, then I can consider it.
    Quad 2.5GHz G5 PowerMac. Mmmmm.
    My Current Sieve Progress: http://www.greenbank.org/cgi-bin/proth.cgi

  20. #20
    Former QueueMaster Ken_g6[TA]'s Avatar
    Join Date
    Nov 2002
    Location
    Colorado
    Posts
    184
    It occurred to me that, if sieving with an FPGA was out, factoring with one might still work. Since I don't know much about Lenstra's Elliptic Curve Method, I looked for Pollard Rho instead. As it turned out, that was a fortuitous choice. Googling that showed me that Pollard Rho is actually a DLP solver that's often used for factoring. The really nice thing is that Pollard Rho uses only O(1) space!

    So I would suggest trying Pollard Rho for solving the DLPs, in particular with Brent's loop-finding algorithm. Unfortunately, that Wikipedia entry is for the factoring algorithm, because the DLP algorithm is just a stub.

    Also, there's an interesting PDF here that describes how Pollard Rho can solve for several k's faster than testing them individually. I would guess what you could do is synchronize 8 instances and have them look at each other's saved Brent points.
    Proud member of the friendliest team around, Team Anandtech!
    The Queue is dead! (Or not needed.) Long Live George Woltman!

  21. #21
    Former QueueMaster Ken_g6[TA]'s Avatar
    Join Date
    Nov 2002
    Location
    Colorado
    Posts
    184
    Any progress on this?

    I just saw an interesting Slashdot article about an FPGA that has access to all of system memory. It might be interesting if anyone who's doing sieving ever gets one. But it costs $5000, so I won't be an early adopter.
    Proud member of the friendliest team around, Team Anandtech!
    The Queue is dead! (Or not needed.) Long Live George Woltman!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •