DF on a HT CPU? [Archive]

Platinum [JSI]

04-14-2003, 05:41 PM

Ive been running DF on my 3,06 for a while now, and just recently noticed that in task manager it was only using around 68% cpu time, task manager showing that CPU 1 was under full load and cpu 2 was under average load i then loaded up a second instance of Distributed folding and sure enough the cpu usage went up to 100% and both cpus showed up as full load, Both clients seemed to be speeding along still, I was under the impression that Hyperthreading didnt give that much raw cpu power more that it optomised it more?

Platinum [JSI]

04-14-2003, 05:45 PM

Also with HT enabled im getting slow benchmark scores, 1.08 structures per second? where as my AMD rig at 1.7GHz gets 1.17 Structures Per second :confused:, the cpu is clocked at 3.84GHz so should be a lot faster?

m0ti

04-14-2003, 06:09 PM

I've tried out DF on an HT machine (dual Xeon 1.8Ghz), didn't see much of an improvement at all... I think the bus was already saturated from 2 running processes, let alone 4.

Platinum [JSI]

04-14-2003, 06:15 PM

Any Idea why my 3.84GHz PIV is no faster than my slower AMD machines?

IronBits

04-14-2003, 06:26 PM

You are using the -rt switch on it right?

Grumpy

04-14-2003, 06:37 PM

AMD cpus are faster than P4s for Folding, it is that simple. My XP1800 outfolds a 2.4 P4 . A Duron 1300 gives a 2 Ghz P4 a nasty surprise...how fast is the AMD that is faster than the 3.84 P4, that sounds a bit funny. Time to spank the computer..works for me :haddock:

IronBits

04-14-2003, 06:55 PM

:|ot|:
Platinum [JSI]
Change your Setiqueue -- I'm shutting mine down very shortly (dBestern.com) .

Welnic

04-14-2003, 07:33 PM

Originally posted by Platinum [JSI]
Also with HT enabled im getting slow benchmark scores, 1.08 structures per second? where as my AMD rig at 1.7GHz gets 1.17 Structures Per second :confused:, the cpu is clocked at 3.84GHz so should be a lot faster?

Well, the P4 is running 2 clients so its total output is 2.16 folds per second. That sounds about right to me for how much faster per cycle the Athlon is compared to the P4.

Platinum

04-15-2003, 08:33 AM

Originally posted by IronBits
:|ot|:
Platinum [JSI]
Change your Setiqueue -- I'm shutting mine down very shortly (dBestern.com) .

Cheers

Angus

04-15-2003, 10:35 AM

I've been running DF (both 'real' and beta) on a farm of dual Xeon 2.8s with HT - Platinum's observations seem to concur with what I see.

Running a second process on the HT CPUs gives about 1.7 times the output, not fully double.

The beta is really bad for benchmarking, since there are some pretty wild variations in how each protein folds. Some take days longer to get to the 250 generations, and the folds per second vary greatly from generation to generation.

So, the Intels may be a little slower than the AMDs, but they sure are stable :)

Grumpy

04-15-2003, 11:22 AM

If you are implying some correlation between Intel & Stable, or AMD & Not Stable, I have but 1 reply.....

:moon:

bwkaz

04-15-2003, 12:30 PM

Originally posted by Angus
Running a second process on the HT CPUs gives about 1.7 times the output, not fully double. And this concurs with my understanding of how HT actually works (rather than the marketing hype). There are not, in fact, actually two independent fully functional CPUs in the HT core. There are two CPUs, sort of, but they have interdependencies (in the instruction rewriting stage, and probably in others) that make it more like you have two slightly dependent CPUs in one package.

There are also, I believe, issues with floating point and HT -- you don't get the full benefit that you would get with 2 CPUs if both clients are using floating point. Perhaps the FP unit is shared? I don't remember for sure.

But yeah. About 1.7x the output makes perfect sense when you consider that you don't, in fact, have 2 CPUs in one package.

Scoofy12

04-15-2003, 07:31 PM

it's not just the FP units that are shared... all the execution units are shared. thats why it's still only one processor even though the OS thinks its two. havent heard about the FP thing though, but DF doesnt use much (any?) in the way of FP anyway.

Grumpy

04-15-2003, 08:02 PM

I wonder if there will be a big difference between Dual Channel and Single Channel on HT enabled systems with 2 Clients.

Angus

04-16-2003, 01:32 AM

Right, it's just one CPU unit.

I just read an article by some Intel guy explaining in English how it works. I'll try to find the link again.

Basically, the CPU core speeds are so much higher than the memory bus speeds, that the CPU ends up waiting for data, so the processor switches to it's other virtual CPU which should already have data waiting, and so on...

Standard disclaimer:
Not my explanation, and I don't work at Intel.

DViD

04-20-2003, 05:52 PM

Originally posted by Angus
I've been running DF (both 'real' and beta) on a farm of dual Xeon 2.8s with HT - Platinum's observations seem to concur with what I see.

Running a second process on the HT CPUs gives about 1.7 times the output, not fully double.

The beta is really bad for benchmarking, since there are some pretty wild variations in how each protein folds. Some take days longer to get to the 250 generations, and the folds per second vary greatly from generation to generation.

So, the Intels may be a little slower than the AMDs, but they sure are stable :)

Sorry, please tell me the date when you last used AMD config... K5 times? Or never? :rolleyes: :sleepy:

robi2106

04-23-2003, 08:51 PM

Another possible reason:

Intel systems have a much deeper instruction queue (many instructions linned up) so that if a branch instruction is not as predicted, the contents of the instruction queue must be flushed and new entries loaded, based on the actual results of the branch calculation.

AMD systems have a shallow but wide queue (many instructions in parallel) so that the penalty for an incorrect branch prediction is less queue reloading operations than in a P4 system.

As usual IANACoE (I am not a computer engineer). I just read a lot of white papers.

robi

Scoofy12

04-23-2003, 11:47 PM

well, IAACoE (well, minus 12 credit hours, but they are all electives :) )

<lecture>
and yeah, thats close... by "instruction queue" you mean "pipeline" in which the instructions are not so much lined up waiting, but actually in various stages of execution. but what you are saying is basically correct. a P4 system executes instructions in 20 very small steps. these steps are short, so you can do them very fast (hence the high clock rates) but it still takes 20 stages, which, as you said, exacts a high toll in the event of a branch misprediction (the pipeline now has a bunch of instructions in it that it shouldnt be executing, so now it has to start them over). also you really need a fast FSB to keep this pipeline full, because "bubbles" in the pipeline (spaces where you didnt have an instruction to issue) hang around a lot longer. the Athlon XP pipeline is about 10 stages ( i think) so it each step does more work and takes longer (hence the slower clock speeds but more instructions per clock), but if you guess wrong on a branch its not so catastrophic.

also as to executing many instrcutions in parallel, both systems do this as much as they can, having multiple Arithemetic units, floating point units, etc and try to keep them all busy (this is called superscalar architecture)
In fact, the whole point of HT is to try to keep these multiple execution units more busy by feeding them 2 threads at once. essentially the whole pipeline before the execution units is duplicated and now the scheduler has 2 threads to use to keep those execution units busy. essentially the parallelism is moved from the instruction level to the thread level (well its still on the instruction level too). in theory anyway. it works well in some applications, and not in others.
</lecture>