PDA

View Full Version : Check Your Boxen FDC



Bullet2urBrain
04-08-2006, 01:49 PM
Check your boxen for WU's from the 4.97 version.. there is almost a 100% Failure rate with some of these.


check out XS's full topic here.

http://www.xtremesystems.org/forums/showthread.php?t=95458

and the R@H Coverage

http://boinc.bakerlab.org/rosetta/forum_thread.php?id=1106



cheers.
XS Bullet2urbrain

Bok
04-08-2006, 02:02 PM
Yeah, I noticed it too. What a PIA, I was just letting a number of the machines finish up their stash of Malaria Control and QMC to switch back to Rosetta as well. The linux boxes are ok as they aren't onto 4.97. Think I'll switch them all back to QMC until this is resolved.

Bok

PCZ
04-08-2006, 02:45 PM
Well i have just wasted a day trying to get my new Duallies running Rosseta WU's without errors only to discover that the WU's are BAD :swear:

Chuck
04-08-2006, 02:55 PM
I am getting the bad WUs as well too....... TONS of them!!!

I have had run anywhere from 30-40 seconds up to 60 minutes
before failure.

All boxen, regardless of OS are failing on single and dualies.

C.

Paladin*
04-08-2006, 04:44 PM
:confused: I'm running the v4.97 Ralph Wu's with out any Errors so far, I wonder what could be the difference between the Ralph & Rosetta Wu's in the same version. Luckily all my Rosetta Wu's are still v4.83's ... :banana:

But one thing I notice is the Ralph Wu's I'm running are BARCODE_30's & the most complaints in the Rosetta Forum are about the HBLR Wu's ... I have some HBLR Ralph Wu's too but won't get to them for a few days yet, I'll have to keep an eye on them to see if they are a problem or not ...

IronBits
04-08-2006, 06:14 PM
Got it up and running finally, what a PITA!!! :bang:
I'll create another post in the Hardware Forum when it's time...

Anyways, doing DNET OGR to burn it in at stock speeds and until Rosetta problems go away.

Saw this in another thread elsewhere - not good !!!


Originally Posted by David Baker
I'm really sorry about these problems. I checked yesterday on RALPH and everything seemed fine, but there clearly is a problem. Unfortunately, I'm just leaving for a family weekend trip so can't figure things out right away. Please bear with us for a couple of days.

IronBits
04-08-2006, 06:23 PM
Well i have just wasted a day trying to get my new Duallies running Rosseta WU's without errors only to discover that the WU's are BAD :swear:Slap em on DPAD or something then. :thumbs:
Whilst we all wait for a couple days on Rosetta to get their shite together. :cry:

PCZ
04-08-2006, 06:26 PM
Actually IB i turned em off to save Electric :D

LAURENU2
04-08-2006, 07:41 PM
I just posted over at Rosetta and got this back

Moderator9
Forum moderator
Joined: Jan 22, 2006
Posts: 454
ID: 53254
Credit: 0
RAC: 0
Message 13288 - Posted 8 Apr 2006 23:11:41 UTC

I just got this message from David Kim who is currently addressing this problem.

"I just reverted back to the previous app. You should notice a version
4.98 now, which is really version 4.83 for windows and mac, and 4.82
for linux."

You all should see some relief very soon. If you force an update it should load the new version once the server is set up.
____________
Moderator9
ROSETTA@home FAQ

Chuck
04-09-2006, 12:13 AM
As of this moment... 2300 CDT 8-Apr-06...

I did a regrettibly full reset of all machines, hence dumping the WUs.
I had too many fail within 20 minutes of completion.

Since the reset, the new exe downloaded as promised and the WU's are starting to run. I will advise if I see anything other than complete success.


C.

IronBits
04-09-2006, 12:19 AM
Watching the new Quad fight for bandwidth to download WUs and with the Dimes clients running it's not helping I'm sure...
But I got enough work to start running it. ;)
:banana:

Chuck
04-09-2006, 12:36 AM
I only have about 8 hrs of work / cpu at this point also.... I'm sure we're going to drain their WU generator and, as you cited, saturate their bandwidth.

I did shorten my Queue to 24 hours worth of work / machine. It seems to be helping.

IB, how is your quad behaving? It's naturual tendency to swizzle isn't starving anything is it? let me know if there is anythign I can do to help tune.

C.

Jeff
04-09-2006, 10:07 AM
Watching the new Quad...

Yummmm... :notworthy

IronBits
04-09-2006, 10:42 AM
Only had one bad WU error out on just one boxen so far :)
Just bumped the Quad to 2GHz and still kicking, no voltage bumps required yet :thumbs:

LAURENU2
04-09-2006, 12:37 PM
All seems Better now all the red lins are fadding over the hills:thumbs:

IronBits
04-09-2006, 03:04 PM
Survived over here! :D
2.356 GHz running DNET OGR on the Quad, until I'm sure it's stable, then back to Rosetta.
30% OC :banana:
CPU1 reads 117F/48C and CPU2 reads 135F/57C (after two applications of two different thermal grease) :bang:

Chuck
04-09-2006, 05:39 PM
Survived over here! :D
2.356 GHz running DNET OGR on the Quad, until I'm sure it's stable, then back to Rosetta.
30% OC :banana:
CPU1 reads 117F/48C and CPU2 reads 135F/57C (after two applications of two different thermal grease) :bang:


That's a great temp for CPU1, but experience makes me concerned about CPU2. When you switched greases, You did the usuall purging and lapping? What compound are you using? AS-5 w/ aluminum or copper?

I have my machines which typically run under 35C but aren't OC'd a full 30%.

This makes me wonder... to which I ask.... do you think I can push up closer to 30% ??

IronBits
04-09-2006, 08:08 PM
Well, it finally rebooted itself, so I've detuned it down to 2.2 for now.
I noticed the HS is not near as smooth as the one that came on CPU1, so might need to do some lapping to see if it helps.
30% is a typical Opteron OC mark to shoot for as a minimum. ;)

Bok
04-09-2006, 08:12 PM
My opty 170 runs typically at 57C all the time overclocked to 2.6Ghz right now. At 2.8Ghz where it was still stable it was 64C, which is just a bit too hot for my liking...

Using AS5 and the stock cooler which some with the opty's (and I believe the x2's above 3800+, at least it's the same one I got with my 4200+)

Bok

Chuck
04-10-2006, 03:53 AM
I realize this is off-topic, but I have a theory about the .130 vs .090 chips.

May I ask for you gents to send me a PM regarding

a: FSB
b: Core voltage
c: RAM voltage
d: Ram Timing assuming 3-3-3-8 (aka... std PC-3200)
e: Other latency timers, etc you tweaked.
f: Any memory tuning you changed (SPD setting vs what you run now)

I'm asking because I have a few chips that are unlocked and are 939 pin,
090 preproduction as well as some 065. Goal is find the limits of the existing
chipset(s) as well as new DDR2 and quad-channel / split bus switch fabric.

I am going to use copper coolers (Gigabyte 3D-Max circular cooler) and
cool using air cooling. I am going to be testing a quad dualie first.
If that works, I am going to test linking the boards and running full 8xx series
CPUs using NUMA link as well as plain Reflective memory.

I'm going to do both SMP and quasi-ASMP (loose-SMP) testing.

When done, I will hopefully be able to share the results privately for future
use, less some proprietary details.


This ties into what we are doing in that, if successfull, it will result in a smaller footprint for all of us.

Also, if anyone knows which client(s) support defining cpu affinity, I
would greatly appreciate it. I know Trux (?) used to support it under 4.x.


TIA,
C

/* edit: Spec goals are: 25% or better OC and maintain CPU at 40-45C under full load with standard ram.
All suggestions welcome... please remember I have the LDT and Numa fabric to contend with on an 8GB memory machine */

PY 222
04-13-2006, 02:53 PM
Guys, sorry for being out of the loop but is everything ok on Rosetta and if not, what should I be looking for?

I am going to bring the clients back up now so any advice would be helpful.

Bok
04-13-2006, 02:56 PM
It was a minor glitch with one version (4.97) corrected pretty quickly too. You might want to make sure and do a reset on the clients if they haven't been running for some time in case they have existing jobs which have their deadline already passed.

They will download new jobs and the 4.98 version which is running fine.

Bok

PY 222
04-13-2006, 03:03 PM
Thanks Bok for the heads up.

How do I reset the client without getting a new ID for the box? If there is no way thorugh this, then I'll just rerun my script and install a new client on the box.

Bok
04-13-2006, 03:05 PM
boinc -reset_project <url>

will do it, keeps the same id, just dumps all the existing work and gets new work.

Bok

willebenn
04-21-2006, 06:49 AM
The client has been upgraded to 5.01 so keep an eye on your systems.

Chuck
04-21-2006, 06:57 AM
Thank you for the heads up!

Just what we need... MORE CHANGE !

<cross fingers> Hope this works </cross fingers>

:eek:

C.

Digital Parasite
04-21-2006, 09:40 AM
Except this version is supposed to squash some bugs and also help with resets so should be a good improvement. It also adds some new science for hopefully better results.

LAURENU2
04-21-2006, 06:46 PM
OUCH getting a lot of 300Hr long running job I am Aborting all of them:swear:

Digital Parasite
04-22-2006, 01:07 PM
300 hour? Is that with the new 5.01 client? I thought the latest client automatically aborted after 24 hours in case of trouble like that.

DragonOrta
04-23-2006, 12:55 AM
Well, some of the new ones seem to make progress, but go on and on and on. I had one WU today that ran for just over 24 hours and another that ran just over 18 hours.

I'd suggest if you have any FACONTACTS or HLBR WUs to keep an eye on them. It seems that most of the ones that are long runners are of those types.

LAURENU2
04-23-2006, 02:53 AM
Well it seems all is better now that they canceled the bad Wu's after 5.01And I purged the ones I had.

I said 300 because that is what I thought it would take to finish the WU's at there rate of increase.
Full Steam Ahead:bouncy: