PDA

View Full Version : dnetc RC5-72 + nvidia 8800 GTX = 84,343,980 keys/sec



em99010pepe
03-12-2007, 03:31 AM
I purchased an 8800 GTX a little while back (after watching Ian Buck's presentation on CUDA {Stanford Univ. EE380 video}).

I initially focused my efforts toward accelerating xvid. I implemented the half-pel and quarter-pel interpolation algorithms and found that the overhead of moving data to and from the GPU was killing the performance gain.

So, I started looking at the motion compensation routines (where xvid spends most of it's time). The current MC code has a large number of conditionals and I was weary to attempt any kind of implementation without having a reasonably good ideal of what all of the conditional paths are for.

I decided to look for an algorithm that has a relatively small kernel and is seriously compute bound. RC5 fit the bill. I started hacking a CUDA core into dnetc on Sat. afternoon and finally got things working smoothly an hour ago.

The CUDA core is totally UN-optimized and still manages to be well over 12x the performance of the next fastest core on my E4300 (stock speed).

I have posted the sources in my mercurial repo.

http://dungeon.darktech.org/hg/dnetc_cuda/

Because I used the public dnetc snapshot, it is not possible to build an official client with this code. Also, I hacked up the configure script, so I doubt it is even sane on any archs other than x86-linux with CUDA and nvcc present. But, if you are an enterprising hacker, have fun with the code.


paul@sr71 ~/code/dnetc_cuda $ ./dnetc -test RC5-72 10

distributed.net client for Linux Copyright 1997-2006, distributed.net
Please visit http://www.distributed.net/ for up-to-date contest information.


dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://www.distributed.net/bugs/

[Mar 12 04:15:42 UTC] Automatic processor type detection did not
recognize the processor (tag: "6547:06F2")
[Mar 12 04:15:42 UTC] RC5-72: using core #10 (CUDA 1-pipe).
[Mar 12 04:15:42 UTC] RC5-72: Test 01 passed: C9:0C0353C0:D4E1FE85-C9:0C0353C0:D4E1FE85
[Mar 12 04:15:42 UTC] RC5-72: Test 02 passed: DE:EE0C6279:BF66F898-DE:EE0C6279:BF66F898
[Mar 12 04:15:42 UTC] RC5-72: Test 03 passed: 0F:556979E7:6C009260-0F:556979E7:6C009260
[Mar 12 04:15:42 UTC] RC5-72: Test 04 passed: 9E:D8B648C6:00003A3C-9E:D8B648C6:00003A3C
[Mar 12 04:15:42 UTC] RC5-72: Test 05 passed: C8:B3631100:0000EAF0-C8:B3631100:0000EAF0
[Mar 12 04:15:42 UTC] RC5-72: Test 06 passed: FE:40080000:00006F64-FE:40080000:00006F64
[Mar 12 04:15:42 UTC] RC5-72: Test 07 passed: 28:69000000:0000204D-28:69000000:0000204D
[Mar 12 04:15:42 UTC] RC5-72: Test 08 passed: 6E:00000000:0000172F-6E:00000000:0000172F
[Mar 12 04:15:42 UTC] RC5-72: Test 09 passed: C6:E9386A44:C0F9D107-C6:E9386A44:C0F9D107
[Mar 12 04:15:42 UTC] RC5-72: Test 10 passed: 2B:E01C5B9D:D65CCAD7-2B:E01C5B9D:D65CCAD7
[Mar 12 04:15:42 UTC] RC5-72: Test 11 passed: 97:2C0F244D:EFC54E4F-97:2C0F244D:EFC54E4F
[Mar 12 04:15:42 UTC] RC5-72: Test 12 passed: A8:8960B40B:1F46AD1F-A8:8960B40B:1F46AD1F
[Mar 12 04:15:42 UTC] RC5-72: Test 13 passed: B1:FFE95917:B38E4396-B1:FFE95917:B38E4396
[Mar 12 04:15:42 UTC] RC5-72: Test 14 passed: C6:46E7E19D:9CD65C85-C6:46E7E19D:9CD65C85
[Mar 12 04:15:42 UTC] RC5-72: Test 15 passed: E3:D686400B:7EFB2180-E3:D686400B:7EFB2180
[Mar 12 04:15:42 UTC] RC5-72: Test 16 passed: 85:EA3678CF:91DB0D2C-85:EA3678CF:91DB0D2C
[Mar 12 04:15:42 UTC] RC5-72: Test 17 passed: D6:BE71026E:348165EE-D6:BE71026E:348165EE
[Mar 12 04:15:42 UTC] RC5-72: Test 18 passed: 5F:71AD1E37:82BC4D50-5F:71AD1E37:82BC4D50
[Mar 12 04:15:42 UTC] RC5-72: Test 19 passed: 11:4134BDB0:175A077F-11:4134BDB0:175A077F
[Mar 12 04:15:42 UTC] RC5-72: Test 20 passed: 94:888FF8CB:282E6E5F-94:888FF8CB:282E6E5F
[Mar 12 04:15:42 UTC] RC5-72: Test 21 passed: D9:48A2E6E4:CD610000-D9:48A2E6E4:CD610000
[Mar 12 04:15:42 UTC] RC5-72: Test 22 passed: E5:71448E83:D0860001-E5:71448E83:D0860001
[Mar 12 04:15:42 UTC] RC5-72: Test 23 passed: 3E:ED6D9F85:A6D70002-3E:ED6D9F85:A6D70002
[Mar 12 04:15:42 UTC] RC5-72: Test 24 passed: 25:D04F6B0E:16AD0003-25:D04F6B0E:16AD0003
[Mar 12 04:15:42 UTC] RC5-72: Test 25 passed: 05:45C2E10D:273D0000-05:45C2E10D:273D0000
[Mar 12 04:15:42 UTC] RC5-72: Test 26 passed: 56:30E19DF4:8C460000-56:30E19DF4:8C460000
[Mar 12 04:15:42 UTC] RC5-72: Test 27 passed: 85:3B37FFD3:9F140000-85:3B37FFD3:9F140000
[Mar 12 04:15:42 UTC] RC5-72: Test 28 passed: 80:B75263C5:41660000-80:B75263C5:41660000
[Mar 12 04:15:42 UTC] RC5-72: Test 29 passed: 03:52A1DF42:D8A30000-03:52A1DF42:D8A30000
[Mar 12 04:15:42 UTC] RC5-72: Test 30 passed: 87:23A58F8F:D5940000-87:23A58F8F:D5940000
[Mar 12 04:15:42 UTC] RC5-72: Test 31 passed: CC:9661BA34:7604002A-CC:9661BA34:7604002A
[Mar 12 04:15:42 UTC] RC5-72: Test 32 passed: 21:E765D2F6:C6110000-21:E765D2F6:C6110000
[Mar 12 04:15:42 UTC] RC5-72: 32/32 Tests Passed (0.064004 seconds)





paul@sr71 ~/code/dnetc_cuda $ ./dnetc -bench RC5-72

distributed.net client for Linux Copyright 1997-2006, distributed.net
Please visit http://www.distributed.net/ for up-to-date contest information.


dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://www.distributed.net/bugs/

[Mar 12 04:11:47 UTC] Automatic processor type detection did not
recognize the processor (tag: "6547:06F2")
[Mar 12 04:11:47 UTC] RC5-72: using core #0 (SES 1-pipe).
[Mar 12 04:12:07 UTC] RC5-72: Benchmark for core #0 (SES 1-pipe)
0.00:00:17.08 [3,716,277 keys/sec]
[Mar 12 04:12:07 UTC] RC5-72: using core #1 (SES 2-pipe).
[Mar 12 04:12:27 UTC] RC5-72: Benchmark for core #1 (SES 2-pipe)
0.00:00:17.25 [6,228,036 keys/sec]
[Mar 12 04:12:27 UTC] RC5-72: using core #2 (DG 2-pipe).
[Mar 12 04:12:45 UTC] RC5-72: Benchmark for core #2 (DG 2-pipe)
0.00:00:16.59 [4,967,345 keys/sec]
[Mar 12 04:12:45 UTC] RC5-72: using core #3 (DG 3-pipe).
[Mar 12 04:13:05 UTC] RC5-72: Benchmark for core #3 (DG 3-pipe)
0.00:00:16.57 [6,231,719 keys/sec]
[Mar 12 04:13:05 UTC] RC5-72: using core #4 (DG 3-pipe alt).
[Mar 12 04:13:24 UTC] RC5-72: Benchmark for core #4 (DG 3-pipe alt)
0.00:00:17.46 [5,665,622 keys/sec]
[Mar 12 04:13:24 UTC] RC5-72: using core #5 (SS 2-pipe).
[Mar 12 04:13:43 UTC] RC5-72: Benchmark for core #5 (SS 2-pipe)
0.00:00:16.30 [5,274,208 keys/sec]
[Mar 12 04:13:43 UTC] RC5-72: using core #6 (GO 2-pipe).
[Mar 12 04:14:03 UTC] RC5-72: Benchmark for core #6 (GO 2-pipe)
0.00:00:17.11 [6,207,954 keys/sec]
[Mar 12 04:14:03 UTC] RC5-72: using core #7 (SGP 3-pipe).
[Mar 12 04:14:22 UTC] RC5-72: Benchmark for core #7 (SGP 3-pipe)
0.00:00:16.63 [6,567,384 keys/sec]
[Mar 12 04:14:22 UTC] RC5-72: using core #8 (MA 4-pipe).
[Mar 12 04:14:42 UTC] RC5-72: Benchmark for core #8 (MA 4-pipe)
0.00:00:16.95 [5,364,069 keys/sec]
[Mar 12 04:14:42 UTC] RC5-72: using core #9 (MMX 4-pipe).
[Mar 12 04:15:01 UTC] RC5-72: Benchmark for core #9 (MMX 4-pipe)
0.00:00:16.64 [4,298,758 keys/sec]
[Mar 12 04:15:01 UTC] RC5-72: using core #10 (CUDA 1-pipe).
[Mar 12 04:15:19 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe)
0.00:00:16.28 [84,343,980 keys/sec]


From Ars Forum (http://episteme.arstechnica.com/eve/forums/a/tpc/f/122097561/m/766004683831)

Merlin45
03-12-2007, 11:23 AM
Too bad o windoze client...

I would have given it a try, I have 1 of them here.

em99010pepe
03-18-2007, 03:47 PM
Latest optimization:


./dnetc -bench RC5-72 10

distributed.net client for Linux Copyright 1997-2006, distributed.net
Please visit http://www.distributed.net/ for up-to-date contest information.


dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://www.distributed.net/bugs/

[Mar 17 22:00:06 UTC] Automatic processor type detection did not
recognize the processor (tag: "6547:06F2")
[Mar 17 22:00:06 UTC] RC5-72: using core #10 (CUDA 1-pipe).
[Mar 17 22:00:24 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe)
0.00:00:16.26 [113,784,744 keys/sec]

Nitrousine
03-18-2007, 04:16 PM
good lord, that's a block every 40 seconds or so.

:eek:

the-mk
03-18-2007, 05:04 PM
I need more GPU power :D

Impressive number!

Is this thing able to do OGR-25 work?

KriZp
03-22-2007, 10:03 AM
How would this work on a 6600GT SLI?

em99010pepe
03-24-2007, 01:35 PM
Latest optimization


./dnetc -bench RC5-72 10

distributed.net client for Linux Copyright 1997-2006, distributed.net
Please visit http://www.distributed.net/ for up-to-date contest information.


dnetc v2.9012-497-CFR-06032022 for Linux (Linux 2.6.20).
Please provide the *entire* version descriptor when submitting bug reports.
The distributed.net bug report pages are at http://www.distributed.net/bugs/

[Mar 24 03:41:20 UTC] Automatic processor type detection did not
recognize the processor (tag: "6547:06F2")
[Mar 24 03:41:20 UTC] RC5-72: using core #10 (CUDA 1-pipe).
[Mar 24 03:41:39 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe)
0.00:00:17.10 [124,925,397 keys/sec]

alpha
03-24-2007, 02:37 PM
Is this thing able to do OGR-25 work?

Apparently:



I agree that OGR is much more interesting but reading the documentation surrounding the core algorithm, the non-constant execution times may present a significant challenge WRT getting good performance out of a GPU implementation. But, I haven't looked too closely at the underlying code, so my comments may be irrelevant.


The work this guy has done so far is astounding so if anything like this can be done for OGR it would be great! The sooner we get OGR-25 finished with, the sooner we can start OGR-26.

Brucifer
03-25-2007, 02:25 PM
yep, it is really an eye-opener. He mentioned that he has been in contact with the distributed.net folks, so hopefully it may ocme to pass for the rc5 effort anyway. The price of those video cards isn't cheap though, but the performance would be well worth it if it evolves into an official client.

Guilherme
03-25-2007, 03:05 PM
The work this guy has done so far is astounding so if anything like this can be done for OGR it would be great! The sooner we get OGR-25 finished with, the sooner we can start OGR-26.

"Future Projects
RSA Prime Factoring:

The inability to quickly factor large composite numbers into its prime factors is one of the underlying assumptions of many cryptographic systems. RSA Labs is sponsoring a series of challenges to factor successively larger numbers, each with an increasing prize amount."

http://www.distributed.net/projects.php

em99010pepe
04-08-2007, 08:08 AM
Latest optimization


Just today, I optimized the result calculation and I am now seeing ~ 144 Mkeys/sec on my 8800 GTX.

Death
05-21-2007, 07:32 AM
too bad that rc5 seem goes to close

the-mk
05-21-2007, 01:41 PM
Why they close:
http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=bovine