Results 1 to 13 of 13

Thread: dnetc RC5-72 + nvidia 8800 GTX = 84,343,980 keys/sec

  1. #1
    Free-DC's Prime Search
    Join Date
    Apr 2004
    Posts
    2,518

    dnetc RC5-72 + nvidia 8800 GTX = 84,343,980 keys/sec

    I purchased an 8800 GTX a little while back (after watching Ian Buck's presentation on CUDA {Stanford Univ. EE380 video}).

    I initially focused my efforts toward accelerating xvid. I implemented the half-pel and quarter-pel interpolation algorithms and found that the overhead of moving data to and from the GPU was killing the performance gain.

    So, I started looking at the motion compensation routines (where xvid spends most of it's time). The current MC code has a large number of conditionals and I was weary to attempt any kind of implementation without having a reasonably good ideal of what all of the conditional paths are for.

    I decided to look for an algorithm that has a relatively small kernel and is seriously compute bound. RC5 fit the bill. I started hacking a CUDA core into dnetc on Sat. afternoon and finally got things working smoothly an hour ago.

    The CUDA core is totally UN-optimized and still manages to be well over 12x the performance of the next fastest core on my E4300 (stock speed).

    I have posted the sources in my mercurial repo.

    http://dungeon.darktech.org/hg/dnetc_cuda/

    Because I used the public dnetc snapshot, it is not possible to build an official client with this code. Also, I hacked up the configure script, so I doubt it is even sane on any archs other than x86-linux with CUDA and nvcc present. But, if you are an enterprising hacker, have fun with the code.


    paul@sr71 ~/code/dnetc_cuda $ ./dnetc -test RC5-72 10

    distributed.net client for Linux Copyright 1997-2006, distributed.net
    Please visit http://www.distributed.net/ for up-to-date contest information.


    dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
    Please provide the *entire* version descriptor when submitting bug reports.
    The distributed.net bug report pages are at http://www.distributed.net/bugs/

    [Mar 12 04:15:42 UTC] Automatic processor type detection did not
    recognize the processor (tag: "6547:06F2")
    [Mar 12 04:15:42 UTC] RC5-72: using core #10 (CUDA 1-pipe).
    [Mar 12 04:15:42 UTC] RC5-72: Test 01 passed: C9:0C0353C04E1FE85-C9:0C0353C04E1FE85
    [Mar 12 04:15:42 UTC] RC5-72: Test 02 passed: DE:EE0C6279:BF66F898-DE:EE0C6279:BF66F898
    [Mar 12 04:15:42 UTC] RC5-72: Test 03 passed: 0F:556979E7:6C009260-0F:556979E7:6C009260
    [Mar 12 04:15:42 UTC] RC5-72: Test 04 passed: 9E8B648C6:00003A3C-9E8B648C6:00003A3C
    [Mar 12 04:15:42 UTC] RC5-72: Test 05 passed: C8:B3631100:0000EAF0-C8:B3631100:0000EAF0
    [Mar 12 04:15:42 UTC] RC5-72: Test 06 passed: FE:40080000:00006F64-FE:40080000:00006F64
    [Mar 12 04:15:42 UTC] RC5-72: Test 07 passed: 28:69000000:0000204D-28:69000000:0000204D
    [Mar 12 04:15:42 UTC] RC5-72: Test 08 passed: 6E:00000000:0000172F-6E:00000000:0000172F
    [Mar 12 04:15:42 UTC] RC5-72: Test 09 passed: C6:E9386A44:C0F9D107-C6:E9386A44:C0F9D107
    [Mar 12 04:15:42 UTC] RC5-72: Test 10 passed: 2B:E01C5B9D65CCAD7-2B:E01C5B9D65CCAD7
    [Mar 12 04:15:42 UTC] RC5-72: Test 11 passed: 97:2C0F244D:EFC54E4F-97:2C0F244D:EFC54E4F
    [Mar 12 04:15:42 UTC] RC5-72: Test 12 passed: A8:8960B40B:1F46AD1F-A8:8960B40B:1F46AD1F
    [Mar 12 04:15:42 UTC] RC5-72: Test 13 passed: B1:FFE95917:B38E4396-B1:FFE95917:B38E4396
    [Mar 12 04:15:42 UTC] RC5-72: Test 14 passed: C6:46E7E19D:9CD65C85-C6:46E7E19D:9CD65C85
    [Mar 12 04:15:42 UTC] RC5-72: Test 15 passed: E3686400B:7EFB2180-E3686400B:7EFB2180
    [Mar 12 04:15:42 UTC] RC5-72: Test 16 passed: 85:EA3678CF:91DB0D2C-85:EA3678CF:91DB0D2C
    [Mar 12 04:15:42 UTC] RC5-72: Test 17 passed: D6:BE71026E:348165EE-D6:BE71026E:348165EE
    [Mar 12 04:15:42 UTC] RC5-72: Test 18 passed: 5F:71AD1E37:82BC4D50-5F:71AD1E37:82BC4D50
    [Mar 12 04:15:42 UTC] RC5-72: Test 19 passed: 11:4134BDB0:175A077F-11:4134BDB0:175A077F
    [Mar 12 04:15:42 UTC] RC5-72: Test 20 passed: 94:888FF8CB:282E6E5F-94:888FF8CB:282E6E5F
    [Mar 12 04:15:42 UTC] RC5-72: Test 21 passed: D9:48A2E6E4:CD610000-D9:48A2E6E4:CD610000
    [Mar 12 04:15:42 UTC] RC5-72: Test 22 passed: E5:71448E830860001-E5:71448E830860001
    [Mar 12 04:15:42 UTC] RC5-72: Test 23 passed: 3E:ED6D9F85:A6D70002-3E:ED6D9F85:A6D70002
    [Mar 12 04:15:42 UTC] RC5-72: Test 24 passed: 2504F6B0E:16AD0003-2504F6B0E:16AD0003
    [Mar 12 04:15:42 UTC] RC5-72: Test 25 passed: 05:45C2E10D:273D0000-05:45C2E10D:273D0000
    [Mar 12 04:15:42 UTC] RC5-72: Test 26 passed: 56:30E19DF4:8C460000-56:30E19DF4:8C460000
    [Mar 12 04:15:42 UTC] RC5-72: Test 27 passed: 85:3B37FFD3:9F140000-85:3B37FFD3:9F140000
    [Mar 12 04:15:42 UTC] RC5-72: Test 28 passed: 80:B75263C5:41660000-80:B75263C5:41660000
    [Mar 12 04:15:42 UTC] RC5-72: Test 29 passed: 03:52A1DF428A30000-03:52A1DF428A30000
    [Mar 12 04:15:42 UTC] RC5-72: Test 30 passed: 87:23A58F8F5940000-87:23A58F8F5940000
    [Mar 12 04:15:42 UTC] RC5-72: Test 31 passed: CC:9661BA34:7604002A-CC:9661BA34:7604002A
    [Mar 12 04:15:42 UTC] RC5-72: Test 32 passed: 21:E765D2F6:C6110000-21:E765D2F6:C6110000
    [Mar 12 04:15:42 UTC] RC5-72: 32/32 Tests Passed (0.064004 seconds)





    paul@sr71 ~/code/dnetc_cuda $ ./dnetc -bench RC5-72

    distributed.net client for Linux Copyright 1997-2006, distributed.net
    Please visit http://www.distributed.net/ for up-to-date contest information.


    dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
    Please provide the *entire* version descriptor when submitting bug reports.
    The distributed.net bug report pages are at http://www.distributed.net/bugs/

    [Mar 12 04:11:47 UTC] Automatic processor type detection did not
    recognize the processor (tag: "6547:06F2")
    [Mar 12 04:11:47 UTC] RC5-72: using core #0 (SES 1-pipe).
    [Mar 12 04:12:07 UTC] RC5-72: Benchmark for core #0 (SES 1-pipe)
    0.00:00:17.08 [3,716,277 keys/sec]
    [Mar 12 04:12:07 UTC] RC5-72: using core #1 (SES 2-pipe).
    [Mar 12 04:12:27 UTC] RC5-72: Benchmark for core #1 (SES 2-pipe)
    0.00:00:17.25 [6,228,036 keys/sec]
    [Mar 12 04:12:27 UTC] RC5-72: using core #2 (DG 2-pipe).
    [Mar 12 04:12:45 UTC] RC5-72: Benchmark for core #2 (DG 2-pipe)
    0.00:00:16.59 [4,967,345 keys/sec]
    [Mar 12 04:12:45 UTC] RC5-72: using core #3 (DG 3-pipe).
    [Mar 12 04:13:05 UTC] RC5-72: Benchmark for core #3 (DG 3-pipe)
    0.00:00:16.57 [6,231,719 keys/sec]
    [Mar 12 04:13:05 UTC] RC5-72: using core #4 (DG 3-pipe alt).
    [Mar 12 04:13:24 UTC] RC5-72: Benchmark for core #4 (DG 3-pipe alt)
    0.00:00:17.46 [5,665,622 keys/sec]
    [Mar 12 04:13:24 UTC] RC5-72: using core #5 (SS 2-pipe).
    [Mar 12 04:13:43 UTC] RC5-72: Benchmark for core #5 (SS 2-pipe)
    0.00:00:16.30 [5,274,208 keys/sec]
    [Mar 12 04:13:43 UTC] RC5-72: using core #6 (GO 2-pipe).
    [Mar 12 04:14:03 UTC] RC5-72: Benchmark for core #6 (GO 2-pipe)
    0.00:00:17.11 [6,207,954 keys/sec]
    [Mar 12 04:14:03 UTC] RC5-72: using core #7 (SGP 3-pipe).
    [Mar 12 04:14:22 UTC] RC5-72: Benchmark for core #7 (SGP 3-pipe)
    0.00:00:16.63 [6,567,384 keys/sec]
    [Mar 12 04:14:22 UTC] RC5-72: using core #8 (MA 4-pipe).
    [Mar 12 04:14:42 UTC] RC5-72: Benchmark for core #8 (MA 4-pipe)
    0.00:00:16.95 [5,364,069 keys/sec]
    [Mar 12 04:14:42 UTC] RC5-72: using core #9 (MMX 4-pipe).
    [Mar 12 04:15:01 UTC] RC5-72: Benchmark for core #9 (MMX 4-pipe)
    0.00:00:16.64 [4,298,758 keys/sec]
    [Mar 12 04:15:01 UTC] RC5-72: using core #10 (CUDA 1-pipe).
    [Mar 12 04:15:19 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe)
    0.00:00:16.28 [84,343,980 keys/sec]
    From Ars Forum

  2. #2
    Senior Member
    Join Date
    Jan 2005
    Location
    Wisconsin USA
    Posts
    129
    Too bad o windoze client...

    I would have given it a try, I have 1 of them here.
    Since you can read this, Thank a Teacher.

    Since you are reading this in English, Thank a Veteran.




    Personal Web Page
    http://www.maqs.net/~jmerlin/

  3. #3
    Free-DC's Prime Search
    Join Date
    Apr 2004
    Posts
    2,518
    Latest optimization:

    ./dnetc -bench RC5-72 10

    distributed.net client for Linux Copyright 1997-2006, distributed.net
    Please visit http://www.distributed.net/ for up-to-date contest information.


    dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
    Please provide the *entire* version descriptor when submitting bug reports.
    The distributed.net bug report pages are at http://www.distributed.net/bugs/

    [Mar 17 22:00:06 UTC] Automatic processor type detection did not
    recognize the processor (tag: "6547:06F2")
    [Mar 17 22:00:06 UTC] RC5-72: using core #10 (CUDA 1-pipe).
    [Mar 17 22:00:24 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe)
    0.00:00:16.26 [113,784,744 keys/sec]

  4. #4
    good lord, that's a block every 40 seconds or so.


  5. #5
    almost retired the-mk's Avatar
    Join Date
    Jan 2003
    Location
    KI/OOE/Austria
    Posts
    1,921
    Blog Entries
    6
    I need more GPU power

    Impressive number!

    Is this thing able to do OGR-25 work?
    the-mk

  6. #6

    Thumbs up Interesting

    How would this work on a 6600GT SLI?

  7. #7
    Free-DC's Prime Search
    Join Date
    Apr 2004
    Posts
    2,518
    Latest optimization

    ./dnetc -bench RC5-72 10

    distributed.net client for Linux Copyright 1997-2006, distributed.net
    Please visit http://www.distributed.net/ for up-to-date contest information.


    dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
    Please provide the *entire* version descriptor when submitting bug reports.
    The distributed.net bug report pages are at http://www.distributed.net/bugs/

    [Mar 24 03:41:20 UTC] Automatic processor type detection did not
    recognize the processor (tag: "6547:06F2")
    [Mar 24 03:41:20 UTC] RC5-72: using core #10 (CUDA 1-pipe).
    [Mar 24 03:41:39 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe)
    0.00:00:17.10 [124,925,397 keys/sec]

  8. #8
    Dungeon Master alpha's Avatar
    Join Date
    Mar 2002
    Location
    Norfolk, UK
    Posts
    1,700
    Quote Originally Posted by the-mk
    Is this thing able to do OGR-25 work?
    Apparently:

    I agree that OGR is much more interesting but reading the documentation surrounding the core algorithm, the non-constant execution times may present a significant challenge WRT getting good performance out of a GPU implementation. But, I haven't looked too closely at the underlying code, so my comments may be irrelevant.
    The work this guy has done so far is astounding so if anything like this can be done for OGR it would be great! The sooner we get OGR-25 finished with, the sooner we can start OGR-26.

  9. #9
    yep, it is really an eye-opener. He mentioned that he has been in contact with the distributed.net folks, so hopefully it may ocme to pass for the rc5 effort anyway. The price of those video cards isn't cheap though, but the performance would be well worth it if it evolves into an official client.

  10. #10
    Senior Member
    Join Date
    Apr 2004
    Location
    Florianopolis - Santa Catarina - Brazil
    Posts
    114
    Quote Originally Posted by alpha
    The work this guy has done so far is astounding so if anything like this can be done for OGR it would be great! The sooner we get OGR-25 finished with, the sooner we can start OGR-26.
    "Future Projects
    RSA Prime Factoring:

    The inability to quickly factor large composite numbers into its prime factors is one of the underlying assumptions of many cryptographic systems. RSA Labs is sponsoring a series of challenges to factor successively larger numbers, each with an increasing prize amount."

    http://www.distributed.net/projects.php

  11. #11
    Free-DC's Prime Search
    Join Date
    Apr 2004
    Posts
    2,518
    Latest optimization

    Just today, I optimized the result calculation and I am now seeing ~ 144 Mkeys/sec on my 8800 GTX.

  12. #12
    Unholy Undead Death's Avatar
    Join Date
    Sep 2003
    Location
    Kyiv, Ukraine
    Posts
    907
    Blog Entries
    1
    too bad that rc5 seem goes to close
    wbr, Me. Dead J. Dona \


  13. #13
    almost retired the-mk's Avatar
    Join Date
    Jan 2003
    Location
    KI/OOE/Austria
    Posts
    1,921
    Blog Entries
    6
    the-mk

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •