PDA

View Full Version : DF and SMP



pelligrini
03-29-2002, 03:45 PM
What I've noticed is that the production rate of the client running multiple instances on a multiple cpu system is not as good as one client on a single processor box. The production falloff gets much worse as the amount of processors goes up.

I have three P3-1000 one by itself and two on a VP6, the production per CPU (one client each chip) was about 30% less on the dual setup. That's a 60% hit in production for using dual instead of single chip machines. There are many variables in the comparison, one's Win98 the other's Win2k. The single CPU is on a 115Mhz FSB, the Dual is at 145, the dual has 256MB of ram and the other is 384.

After about 12 hours:
Single - 55006
Dual - 38680, 39900

I did some more tests with 4 different dual P-Pro ALR boards.
All w/ 2x 512k cache P-Pro200 @233, 2x 64mb SIMMS, Win2k-Pro
After about 12hrs of crunching:
ALR1 - 10953 (1xDF & 1G@H)
ALR2 - 7103 & 7000 (2xDF)
ALR3 - 11800 (1xDF)
ALR4 - 8128 & 8213 (2xDF, w/ affinity set)

That's about a 20-30% reduction in production per SMP client.

pelligrini
03-29-2002, 03:49 PM
These are some posts I made earlier on other boards:

I'm not too sure about running it on an SMP machine. I've run into some strange and disapointing production numbers. Production on a hex and even a dual is <unusual>. Yesterday I noticed that my hexes are doing poorly. Over 16-18 hrs my hexes produced around 8k per client, my dual P-Pros did about 12-13k, and my single P-Pro boards did 26-28k. The real <frustration> is that the regular P-Pros are runniing at 233Mhz, 3 of the Hexes have 333Mhz PIi overdrive chips, yet the 233 tripled their output? Even a P200 Overdrive did 11.6k. On a one of the hexes, I stopped 2 clients so it will run just 4 instances, Production wasn't that much better. One of my other Hexes crashed over night as well as a dual. I haven't got time to play around with the affinity and such.

After around 2 1/2 hours of crunching: (all P-Pros @233 Mhz)
3300 - Single P-Pro, win98 (I don't have a single CPU machine under W2K)
2200 - Dual P-Pro w/ One instance, W2k-Pro
2700 - Dual P-Pro w/ One instance & 1 G@H, W2k-Pro
1500ea - Dual P-Pro w/ Two instances, W2k-Pro
1700ea - Hex P-Pro w/ 3 DF and 3 G@H, W2k Adv-server

These three Hexes are running PII-Overdrives @333Mhz:
2400ea - 2 instances & 3 G@H
2x1600 & 1800 - 3 instances & 3 G@H
2000ea - 3 instances of DF only

I think all my SMP machines are going to go back on G@H. Too many wasted cycles. At least the hexes are going, until I can figure out <what> is going on with them.

<edited for language> :D

Darkness Productions
03-29-2002, 04:24 PM
Maybe the context switches are killing it. Or maybe because it keeps switching procs? Have you tried proc affinity yet? That might/might not help...

rayson_ho
03-30-2002, 11:38 PM
This is due to the limited memeory bandwidth...

Intel SMP architecture is different from RS/6000, Sun Fire, or even AMD Althon MP... Intel SMP sucks, all processors share the same bus, and it is often limited by the memory bus bandwidth. In the case of the AMD Althon MP and other architectures, each processor has a seperate channel to the memory, and thus the performance is much better. (that's why not that many people buy Intel SMP boxes!!)

Also, processor affinities also count, but AFAIK, Linux and WinNT/2K consider that already.

Rayson

mnx
03-31-2002, 10:18 PM
I've got access to a IBM 2 x 933MHZ P3 system with RDRAM. Should I run 1 or 2 instances of the client?

mnx

DATA
03-31-2002, 11:21 PM
Originally posted by mnx
I've got access to a IBM 2 x 933MHZ P3 system with RDRAM. Should I run 1 or 2 instances of the client?

mnx

i run two instances on all my twin systems

pelligrini
04-01-2002, 12:41 AM
Originally posted by rayson_ho
This is due to the limited memeory bandwidth...

Intel SMP architecture is different from RS/6000, Sun Fire, or even AMD Althon MP... Intel SMP sucks, all processors share the same bus, and it is often limited by the memory bus bandwidth. In the case of the AMD Althon MP and other architectures, each processor has a seperate channel to the memory, and thus the performance is much better. (that's why not that many people buy Intel SMP boxes!!)

Also, processor affinities also count, but AFAIK, Linux and WinNT/2K consider that already.

Rayson
Sounds like a plausible explanation. I'm pretty sure that if it is the case, the problem still relies within the client and its use of memory bandwidth. I've been running the Genome@Home client for about a year now, and it works nicely in any SMP arrangement (one instance per CPU). I have not noticed any difference in production rates between a single processor system, dual, quad, or a Hex.

I've also noticed the reduction on a VIA chipset as well, still using Intel CPUs. I've got an ASUS dual with two XP-1800s. If I get the time and effort I'll see if there is a reduction or not on it too.

Scoofy12
04-01-2002, 09:22 AM
After playing a bit with the linux client (dual p3/800 boxes) with only 1 per box so far, it also appears to me that the DF client makes a LOT of system calls or something like that. Top reports the processor time to be about 80% user and 20% system when running 1 client, and there seems to be more than one user thread because sometimes this 80% is distributed between both processors (and sometimes not. the system usage follows the user usage back and forth between processors). this is compared to almost no system usage by F@H, which is the only other client i've tried. I haven't done any benchmarking or anything. but maybe i should soon.

Brian the Fist
04-01-2002, 11:15 AM
Just for the record, the distributed folding client makes NO system calls (to my knowledge, anyways :p )
As for running it on Dual-CPU machines, we do it all the time, but haven't bothered to check whether we get 200% out of it or less, so Ill take your words for it on that one.

Scoofy12
04-01-2002, 02:54 PM
hrm... yeah it didnt really make sense to me either, for something ported to so many platforms... does anyone know what the "system" cpu usage in top reports? it might be some kind of overhead to maintain the process, maybe it does a lot of cpu jumping. maybe ill fire it up on all the #2 processors in the lab and do some benchmarking :)

rayson_ho
04-02-2002, 12:00 AM
On Linux, we can monitor which system calls DF client calls by "strace", and on Solaris, we can use "truss".

There are many system calls issued by the client, output of strace on my machine:

open("foldtrajlite.lock", O_RDONLY|O_LARGEFILE) = 3
close(3) = 0
open("foldtrajlite.lock", O_RDONLY|O_LARGEFILE) = 3
close(3) = 0

We check the state of the lock file very often -- I believe we have more than 10 sys calls every second!!

Rayson

bwkaz
04-02-2002, 07:46 AM
Do you allocate (malloc()) memory, ever? Do you then free() it?

Those, too, are system calls.

Brian the Fist
04-02-2002, 09:26 AM
If you call those system calls, then yes, of course I malloc and of course we fopen, fclose, etc. I was referring to actual system() calls i.e. see UNIX 'man system' when I said we dont use those

rayson_ho
04-02-2002, 12:00 PM
Can we decrease the frequency of checking the lock file?? This can translate to better throughtput...

Thanks,
Rayson

ulv
04-02-2002, 04:01 PM
Tyan 2466MPX- 2 x XP1800 (1,53 Ghz): 165000 on each,
Abit KG7- Tbird 1,4Ghz: 206000.
Same memory (512Mb- same brand), same videocard, started within 1 minute.

pelligrini
04-02-2002, 07:03 PM
Thanks ulv, I thought I was going to have to build a single AMD XP-1800 to do a little testing on. :rolleyes: :D (Maybe I'll just have to see for myself ;) )

That's about a 25% decrease in production per instance, right along the lines as my Intels. Would you mind doing another test running just one client on the dual?

guru
04-02-2002, 07:39 PM
I've noticed that I get 2x the results on my Dual processor Sun Ultra 60's systems. I'm getting 6x on the E4500 with 6 processors.

I think the limit on the Intel SMP systems is the shared bus and small cache. My Sun systems have anywhere from 2MB cache to 8MB cache per processor.

guru

rayson_ho
04-02-2002, 10:19 PM
The server machines (Sun, IBM, etc) scales better (linear memory bandwidth scaling), and Solaris scales better too.

I remember the Linux kernel has (or used to have) a big lock for the file system code, but I am not 100% sure. And since the DF client checks the lock file so often, running DF on an SMP machine does not scale linearly.

A better way to check if the user wants to stop the client is via signals. This way, we don't need to poll the OS for the lock file status.

Brian, can you compare the difference (between checking and not checking on a uni-CPU machine)??

Thanks,
Rayson

ulv
04-03-2002, 02:22 AM
Pelligrini: Don't have much time today, so a quick one. Killed one client on the Tyan dual- XP1800. Let the KG7- Tbird 1,4 run 10000, the Tyan had completed 9600.........? Don't know why it did less than the Tbird, but it made me reboot, check the bios, everything OK, so I started up again with two clients. Can do some better testing tomorrow.