PDA

View Full Version : April 28th client



Welnic
04-29-2004, 11:53 AM
I have 12 Athlon procs in my dedicated farm. At the peak on April 28th I had from 600-400 gens buffered on each client. They had all just been running since I restarted them from scratch on Saturday. I replaced two of the clients with the April 28th client. This morning I have 0 buffered structures on all of the clients except for the new ones, which have 245 and 264.

pfb
04-29-2004, 11:54 AM
What's the error.log show for those with 0 buffered gens? Be very surprised if there wasn't a 910 in there...

Welnic
04-29-2004, 12:04 PM
Well yeah. The new clients error logs are also filled with 910s. Out of the ~6 million points that I had buffered, my total credit for the day was about 600000, which also includes all of my borged boxen. I just didn't want to sound whiney.

pfb
04-29-2004, 12:16 PM
haven't noticed any difference with the new client in terms of upload except for the slightly quicker verification process (probably due to the reduced delay time)...

Galuvian
04-29-2004, 12:41 PM
Why doesn't this auto-update?
/sigh

Backwoodz
04-29-2004, 05:12 PM
========================[ Apr 29, 2004 5:09 PM ]========================
Starting foldtrajlite built Apr 28 2004
Thu Apr 29 17:09:44 2004 ERROR: [000.000] {foldtrajlite2.c, line 4741} File .\3x2gqv42_0_3x2gqv42_protein_2_0000004_min.val is corrupt, missing or has been tampered with; cannot continue - replace file and start again, or manually delete filelist.txt
Thu Apr 29 17:09:44 2004 ERROR: [000.000] {foldtrajlite2.c, line 5015} Error during upload: Data file checksum failed

any ideas what this is about?..............fooooock.....I hate this damn thing anymore.

pfb
04-29-2004, 05:20 PM
When I've had that a -purgeuploadlist 1 allows me to upload again...

RandomCritterz
04-29-2004, 06:30 PM
The .val and _min.val files are essentially interchangeable. Look to see if you have a
3x2gqv42_0_3x2gqv42_protein_2_0000004.val
If so, make a copy named
3x2gqv42_0_3x2gqv42_protein_2_0000004_min.val

Backwoodz
04-29-2004, 07:03 PM
Thanks guyz........I know everyonez patients are alittle thin these dayz.
It seemz to be ok again for now.

AMD_is_logical
04-29-2004, 10:02 PM
Originally posted by Backwoodz
========================[ Apr 29, 2004 5:09 PM ]========================
Starting foldtrajlite built Apr 28 2004
Thu Apr 29 17:09:44 2004 ERROR: [000.000] {foldtrajlite2.c, line 4741} File .\3x2gqv42_0_3x2gqv42_protein_2_0000004_min.val is corrupt, missing or has been tampered with; cannot continue - replace file and start again, or manually delete filelist.txt
Thu Apr 29 17:09:44 2004 ERROR: [000.000] {foldtrajlite2.c, line 5015} Error during upload: Data file checksum failed

any ideas what this is about?..............fooooock.....I hate this damn thing anymore. It seems that the client can sometimes try to upload one more generation than it should. I got this after stopping the client by removing the .lock file. It had made the <handle>.....val file for that generation, but it had not yet made the <handle>.....min.val file. After uploading all complete generations with -ut, the client tried to upload this unfinished one, causing errors. When I restarted the client, it behaved normally, and didn't try to upload the unfinished generation.

Fortunately, -purgeuploadlist doesn't have the bug and won't purge the unfinished generation.

When this bug hits, I believe it's safe to simply ignore the message and restart the client.

Of course, we do need to bug Howard into fixing this bug. :bonk:

RandomCritterz
04-30-2004, 02:12 PM
I think you're on to something there. In my case I wasn't running -ut, but had merely switched from -if to -it now that the servers are healthy. It had its error at the end of the big upload, continued folding, and has been fine since.

Fri Apr 30 12:08:18 2004 ERROR: [000.000] {foldtrajlite2.c, line 4741} File .\handle_3_handle_protein_13_0000059_min.val is corrupt, missing or has been tampered with; cannot continue - replace file and start again, or manually delete filelist.txt
Fri Apr 30 12:08:18 2004 ERROR: [000.000] {foldtrajlite2.c, line 5015} Error during upload: Data file checksum failed

deranged128[OCAU]
04-30-2004, 06:26 PM
I ran across this same error on one instance on my dual MP2000. Every folded gen thereafter went through without comment in the error log. It wasn't after a large upload though, rather it had 3 buffered gens and seemed to try and upload one that was still W.I.P.

All is good, it finished that set and is happily crucnhing along.

I did however get this error on another box:
Sat May 01 08:16:36 2004 ERROR: [001.001] {bbox.c, line 445} Tried to BDRemove non-existent (1.#QNAN,1.#QNAN,1.#QNAN)
Sat May 01 08:16:36 2004 ERROR: [001.001] {bbox.c, line 445} Tried to BDRemove non-existent (1.#QNAN,1.#QNAN,1.#QNAN)

The client then stalled. Any suggestions?

tpdooley
05-01-2004, 06:16 AM
.. perform a quick scan of your system for spyware and virii/worms/trojan horses.. (install adaware 6.181 and spybot search & Destroy 1.2; update; and then scan for spyware - and your anti virus package of choice, as well). If they don't cause problems, they waste cpu cycles.. :)

Else, start up a new client in a new directory since Howard and Elena generally won't respond until Monday.. Save the problem directory until they respond, in case the abnormal error is something they want to see your whole directory to help diagnose..

Brian the Fist
05-01-2004, 08:58 PM
Originally posted by deranged128[OCAU]
I ran across this same error on one instance on my dual MP2000. Every folded gen thereafter went through without comment in the error log. It wasn't after a large upload though, rather it had 3 buffered gens and seemed to try and upload one that was still W.I.P.

All is good, it finished that set and is happily crucnhing along.

I did however get this error on another box:
Sat May 01 08:16:36 2004 ERROR: [001.001] {bbox.c, line 445} Tried to BDRemove non-existent (1.#QNAN,1.#QNAN,1.#QNAN)
Sat May 01 08:16:36 2004 ERROR: [001.001] {bbox.c, line 445} Tried to BDRemove non-existent (1.#QNAN,1.#QNAN,1.#QNAN)

The client then stalled. Any suggestions?

What are the specs of this machine, esp. the CPU? This is probably caused by a floating point error. That is most likely if you used hardware similar but not identical to that which we compiled on. It is something that compiler flags may fix/prevent so tell me which version of the client (OS-wise) it was running too. Thanks

Brian the Fist
05-01-2004, 09:00 PM
Originally posted by RandomCritterz
The .val and _min.val files are essentially interchangeable. Look to see if you have a
3x2gqv42_0_3x2gqv42_protein_2_0000004.val
If so, make a copy named
3x2gqv42_0_3x2gqv42_protein_2_0000004_min.val

When you see the client saying 'minimizing energy' (if not in quiet mode), that is making the _min.val from the .val file. If this seems to be a problem, can someone give me steps to reliably reproduce the error? If you can tell us how to reproduce it it is usually easy to fix, otherwise it is very tricky...

deranged128[OCAU]
05-02-2004, 07:27 PM
Originally posted by Brian the Fist
What are the specs of this machine, esp. the CPU? This is probably caused by a floating point error. That is most likely if you used hardware similar but not identical to that which we compiled on. It is something that compiler flags may fix/prevent so tell me which version of the client (OS-wise) it was running too. Thanks
Hi Howard,

The machine was running a Superlocked Barton core XP2500+ on an Abit NF7-S with Corsair XMS2700 RAM at 2-2-2-6. Machine overclocked to 200 FSB (ram at 166) with vcore of 1.75, air cooling via TT TR2-M2, cpu temp 53C with ambient of 17C (it's mid autumn)

OS was Windows XP SP1a using the text based windows client, running as a service.

I've since increased the vcore to 1.85 and have not had a problem with DF since. It may have been the problem, but not sure.

Thanks for checking it out.

Brian the Fist
05-04-2004, 02:13 PM
The BDRemove error is most likely a result of your overclocking messing up the FPU I would say.