Results 1 to 13 of 13

Thread: Performance hit w/ multiple clients on Linux SMP

  1. #1
    Junior Member
    Join Date
    Aug 2003
    Location
    Concord, CA
    Posts
    15

    Question Performance hit w/ multiple clients on Linux SMP

    I don't know if anyone else has noticed a degradation in performance when running multiple Linux clients. I noticed this first when running on a dual AthlonMP system. If I started the SB client in one directory I would get a rate of 1 block per 13 minutes. If I then started another client in a separate directory, the rates of both clients dropped to 1 block per 16 minutes.

    The situation is far worse on a dual Xeon machine where rates go from 1 block per 7.3 minutes to 1 block per 13.5 minutes!!! This kills any benefit of having an SMP machine. Just thought I would bring up the issue because it doesn't seem like many folks are using the Linux client.
    Last edited by MereMortal; 08-12-2003 at 12:43 PM.

  2. #2
    How much RAM you got in those things? Each CPU have it's on Swap file....and so on...

  3. #3
    I've never tried it so the following is purely theoretical:

    As far as I can figure out, the bottleneck on all of these Big-FFT clients (GIMPS, S@H, SoB, etc) is memory -- the whole operation doesn't fit into memory so the process waits for memory pages to be swapped in and out. In a SMP machine, both processors share the memory bus and thus the problem gets worse as each processor can access memory only while the other one isn't.

    If I had a SMP machine I'd run a few trial-runs with some client that is CPU limited rather than memory limited (say the ECC2 client) and see what that gives me and whether it interacts as badly with another instance of itself or with a SoB process (and if not I'd simply use these two concurrently).

    But of course who can afford a SMP system ...

  4. #4
    Junior Member
    Join Date
    Aug 2003
    Location
    Concord, CA
    Posts
    15
    The Xeon systems have 2 GB each, and the Athlon systems have 4 GB each, so I don't think that the amount of RAM is a problem.

    I'm not familiar with the algorithm that is used for the FFT's here, so I don't know what array sizes to expect. It is an odd problem, because I don't think it exists when running dual windows clients.

  5. #5
    Next question...which of the myriad of versions of Linux are you using? Seems a shame to have so much power and not be running BSD on those badboys

  6. #6
    Junior Member
    Join Date
    Aug 2003
    Location
    Concord, CA
    Posts
    15
    Well, I may be at the birthplace of BSD, but we're running RedHat.

    Athlons: RH 7.3, 9.0; SB v1.02
    Xeons: RH 7.3; SB v1.00

  7. #7
    Hmm...wonder with that much memory...if you could have SB running on one CPU and P-1 factoring on the other...and pull some benchmarks...because P-1 factoring is such a memory hog it could truly point out if it's a memory issue.

  8. #8
    Junior Member
    Join Date
    Aug 2003
    Location
    Concord, CA
    Posts
    15
    I don't currently have the time to do that sort of benchmarking (especially because I know nothing about the P-1 factoring effort).

    I did run a quick test with a WinXP SMP system, and the same type of effect shows up. Running the service w/ 1 cpu nets 276 kcEM/s, but when I run two cpus, the rate drops to 218 kcEM/s (adding about 4 minutes to each block). This system had 768 MB of RAM, using SB v1.10.

    It is disappointing, to say the least...

  9. #9
    Have you tried the smp-client? It isn't listed on the download-page, but can be found at:

    http://www-personal.engin.umich.edu/~lhelm/sb-smp.exe

    I guess it's windows only, but it might reveal something.

  10. #10
    I doubt it is the amount of memory that is the problem. My guess is that it is the amount of memory bandwidth (the FSB) that is the bottleneck. I'm not sure about the newer XEONs but I know that the Athlons don't have dedicated memory bandwidth to each CPU.

    On my Dual Athlon 1800+ I get 261KcEM/sec with one process and 218KcEM/sec each for two processes. That is using version 1.1.0 and running the client as a service.

  11. #11
    Originally posted by MereMortal
    The Xeon systems have 2 GB each, and the Athlon systems have 4 GB each, so I don't think that the amount of RAM is a problem.
    Oh, but the amount of RAM is entirely besides the point: the bottleneck is between the processor(s) and the RAM, not between the RAM and anywhere else. There's so-and-so much code and data that fits into the cache and the processor can operate on it in native speed and when that is exhausted then you have to page in memory from the RAM at the (possibly MUCH lower) speed at which your memory subsystem operates.

    On my Athlon Thunderbird 1.2, for example, the core runs at 1.2GHz but that is 9x133, i.e. it takes (more than) nine time as long to get a particular byte INTO the processor than to do any one thing to it (this is all oversimplified of course). In general you can use the core speed of your processor only on data sets that are small enough to fit into the cache and since these FFTs operate on large (16MB in the case of Seti, dunno about SoB) chunks of data, the processor spends much of its time twiddling its thumbs waiting for memory, so to speak.

    That's why I suggested trying one instance of ECC2 and one instance of SoB in parallel, as the ECC2 client is all of 86k or so and can utilize one CPU without putting much additional strain on the memory bus (there's always going to be a little loss in a SMP system due to overhead, but this kind of approach would minimize that loss IF my understanding of the problem is right).

    That's also the reason why all these projects take to RDRAM so well: the faster you can shovel data into/out of the processor, the faster the overall computation gets done, because the computational core itself isn't the limiting factor. Compared to, say, ECC2 where memory speed becomes secondary to sheer number-crunching capability (where the Athlons do better).

  12. #12
    Originally posted by MereMortal
    I don't currently have the time to do that sort of benchmarking (especially because I know nothing about the P-1 factoring effort).

    I did run a quick test with a WinXP SMP system, and the same type of effect shows up. Running the service w/ 1 cpu nets 276 kcEM/s, but when I run two cpus, the rate drops to 218 kcEM/s (adding about 4 minutes to each block). This system had 768 MB of RAM, using SB v1.10.

    It is disappointing, to say the least...
    Are you saying that the *total* rate went down with 2 CPU's running? That's not what I see at all on mine (dual 2.2GHz Xeon hyperthreaded running XP)

    I see a drop in the per processor rate when adding a second instance (presumably due to chipset contention as Lagardo mentioned above), but the total is still a little less than twice the single processor rate. In fact, I still get a slight overall improvement when adding a third instance (because hyperthreading is enabled).

    My benchmarks were as follows (all of this was done quite a while ago when I was writing the service handler -- the numbers would be higher now due to the drift of the cEM/sec measurement).

    One client: 250 cEM/sec
    Two clients: 220 cEM/sec each [total of 440]
    Three clients: first two run 180 cEM/sec and third runs 120 cEM/sec [total of 480]

    In order to get these results the CPU affinity has to be set appropriately on all the clients (at least in the Win XP case), but the service handler should do that by default.

    Edit: I just remembered something that may be affecting what you're seeing. XP (at least in certain configurations) is totally useless for benchmarking unless you let things settle down for a ridiculously long time and then restart the clients - if you set the service handler to restart at 2:00 a.m. or so, start the machine during the day and check the speed the *next* day you get reliable numbers.

    I'm currently working on a scheme to automatically normalize this on startup - I think I've got it worked out now and I do it by hand on my SMP XP machine currently after a reboot (takes about 30 minutes elapsed time to get it normalized). I had just gotten so used to doing this that I forgot about this issue completely!

    I've got one machine running XP that this affects and one that it doesn't affect. However, the big guy (dual Xeon, 2G RAM) *is* affected. If I do nothing, right after a restart it runs with a total of about 350 cEM/sec whereas it's clicking along at about 540 cEM/sec as I type this...

    Last edited by MathGuy; 08-16-2003 at 10:48 AM.

  13. #13
    Junior Member
    Join Date
    Aug 2003
    Location
    Concord, CA
    Posts
    15
    Ok, some clarifications.

    The *total* rate goes up (if that didn't happen by this point in the project, I'd be worried). I only have one windows dualie, but I have never noticed the "rates"(cEM/sec) go up in time. I put rates in quotes because that is a useless benchmark to me.

    I understand wall clock/block benches much better. And I of course realize that running two processes optimized for serial efficiency won't realize perfectly parallel performance. I brought the topic up because in the case of the dual Xeons, the "parallel" performance is poor. I suppose that I live in a fantasy world where the DC movement has progressed far enough to realize that there is something to gain by exploiting and optimizing clients for the shared memory architecture. Contrary to what Lagardo thinks, there are plenty of dualies out there.

    Anyway, as a rough performance meter, I've made an SMP rating for my machines that measures the realtive time to complete two tests. The results are normalized to a given system where the time to complete one test of a single instance of the client = 1. The range of the rating is then 1 (perfectly parallel; both tests completed in the time it takes a single client to complete a test) to 2 (perfectly serial; the amount of time it takes a single client to complete two test sequentially). Yes, I'm bored.

    Athlon MP 1900+: 1.23
    Xeon 2.4, 533: 1.84

    What this boils down to is that for the Athlon system, you save 2.7 days/2 tests by running the clients concurrently than by running a single instance sequentially. For the Xeon system, you save a measly 8 hours (the time to complete a single test is halved, however). Since this seems to be a memory bandwidth issue, there is little/no advantage to having a dualie with fast CPUs as opposed to slower CPUs (within reason). Oh, well.

    All right, I'll go back into my hole now and let all of you fine people get back to crunching.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •