PDA

View Full Version : DF Client not stable with Windows Multi-CPU systems



bguinto1
10-21-2003, 07:50 PM
I've been running the DF client for some months now and I'm consistently getting the same "file write" or "faulty-RAM" messages on my Multi-CPU Windows-Intel Servers. It's always the same; over time, if I'm running multiple instances of the client on a multi-CPU system, I will get these error messages on all but one instance of the client. For example, if I'm running 8 instances of the client on a 8-way system, over time I will end up with only 1 instance still running.

I am also using single CPU servers and clients, and I've noticed that I NEVER, EVER have a problem with DF on these systems.

I have 100% confidence in the Memory, I use in my systems, since I've used these systems to test much more memory and CPU intensive application than DF over long durations and have not seen any issues.

Can you please investigate this issue with multi-CPUs? As you can probably guess from my output thus far, I'm running a sizable farm and it's getting tiresome to have to maintain these systems. Thus far with the latest protein, my hourly production has see-sawed from as low as 190K to 500K. Currently, I'm on the lower end of this range not because I removed any clients from DF, but because the DF application has not been stable over time, and I do not have the time to restart all the instances that have "failed".

On a separate topic, are you ever going to optimize this client for Intel based 64-bit CPUs? I've ran this on a 64-way server and it's very slow, even though the entire application can fit in the L2 cache of each CPU.

Thanks.

PCZ
10-21-2003, 08:34 PM
I run DF on multi CPU boxes when I get the chance, and am also seeing the same errors.
They need a lot of nursing to keep running.
I have just checked a quad Xeon and 2 of the instances of DF have died.

One instance had this error at the bottom of the log
FATAL ERROR: [023.024] {trajtools.c, line 2507} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM

The other instance had this error
FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error

This is a brand new dell server with 4 2meg Xeons with 2mb cache on each CPU.
I doubt very much if it has bad ram, it is ECC so it would self repair any errors.
It also has lots of Disc space left.

I have also checked a few older dual P3 servers and a high percentage of them have broken instances of DF.
They have the same 2 errors as posted above.
A couple of them had no error in the log but refused to start until I deleted filelist.txt

Fortunately for me the corporate boxes I run this client on only contribute about 20% of my output.
I usually check them once a week and restart all the stalled instances.

Paratima
10-21-2003, 09:56 PM
Have you guys seen/tried IronBit's directions in this thread (http://www.free-dc.org/forum/showthread.php?s=&threadid=4508) ?

If it works for two CPUs...

Angus
10-21-2003, 10:50 PM
I've experienced all the same issues with multi-CPU W2K servers - in my case, new Dell multi-Xeon rack-mount server class machines.

In my opinion, the client should take care of it's files properly. Using the IronBits workaround may help, but again it's having to make up for deficiencies in the client. I don't mind too awfully much having to install a client for each CPU in it's own folder, but that's as far as I want to go. Having to customize batch files, and make extra TEMP folders is too much.

Ultimately, the client should figure out how many CPUs are in the box , and start up enough processes to run one protein for each CPU, whether it's a 'real' CPU, or a virtual HT CPU. I realize that only XP recognizes the HT virtual CPUs for what they are, but W2K recognizes them as CPUs, just doesn't know they are HT.

Anyway, back to the topic. Yes, it's broken, and it's a PITA to keep them running.

IronBits
10-21-2003, 11:54 PM
I've seen this one on single cpu boxen :(

FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error

Anarchy99
10-22-2003, 12:20 AM
I agree with the multi cpu problem even with IBs fixes it is still happening to me

I dont have any advice to fix it but agree that it should be remedied

Ned
10-22-2003, 08:10 AM
Gentlemen,

I did NOT see anyone complaining that indicated they were using Linux or non-windows operationg systems. This is a windows problem, NOT an application problem! True, the application may be stretching windows limits, but so be it.


I don't mind too awfully much having to install a client for each CPU in it's own folder, but that's as far as I want to go. Having to customize batch files, and make extra TEMP folders is too much.

Set it up once in a generic manner and its a piece of cake... "set TMP=..\tmp" works for me in windows....




Ultimately, the client should figure out how many CPUs are in the box , and start up enough processes to run one protein for each CPU, whether it's a 'real' CPU, or a virtual HT CPU. I realize that only XP recognizes the HT virtual CPUs for what they are, but W2K recognizes them as CPUs, just doesn't know they are HT.

Now, this is NOT the way to go if you are trying to get an application to run on multiple platforms.... Applications should NOT dig into the platform to make decisions about how to run unless the application is running the environment. Here the object of the application is to perform work.

Ask Microsoft to fix its operating system... (I know... lost cause!)

My two cents worth... Ned

:cool:

Brian the Fist
10-22-2003, 11:24 AM
While I have seen these errors before, this is the first time I recall that someone has indicated they may be caused by multi-CPU machines. I agree it is likely a Windows-only issue but could be a problem with a Mutex or something. We will updating the binary next Tuesday, and after this, I would request that someone (how about bguinto since you have an 8-way!?) take a test version from me to get some further information on these errors and why they are happening. It should not be something as simple as interfering filenames as long as each instance is installed to its own directory. Great pains were taken to ensure temporary files are named uniquely.

^7_of_9
10-22-2003, 12:34 PM
I'm running a Dual AMD on Windows 2000 Server is running perfectly w/o problems. It's been running non-stop 24/7 for ages now.

I HAVE seen that error before on a different Windows machine, but only once and not for a long time now.

bguinto1
10-22-2003, 02:42 PM
Howard,

How do I go about getting a test version from you?

Thanks.

BTW, regarding the second half of my question ... regarding IA-64 optimization. Any time line?

Brian the Fist
10-23-2003, 11:34 AM
Originally posted by bguinto1
Howard,

How do I go about getting a test version from you?

Thanks.

BTW, regarding the second half of my question ... regarding IA-64 optimization. Any time line?

Easy, you give me your e-mail (PM if you like), and I give you the test version :D

No IA-64 until we get an IA-64 box in house, its out of my hands, sorry.

bguinto1
10-23-2003, 02:17 PM
Howard,

My email address is bguinto1@yahoo.com.

Thanks.

Brian the Fist
10-24-2003, 12:50 PM
Thanks Ill try to get a test version to you after next weeks update then. :Pokes: Remind me if I should forget...

bguinto1
10-30-2003, 02:00 PM
Originally posted by Brian the Fist
Thanks Ill try to get a test version to you after next weeks update then. :Pokes: Remind me if I should forget...

Howard,

A friendly reminder to send me the test version.

Thanks. :Pokes: ;)

tquade
02-26-2004, 08:45 PM
Has there been any progress on this matter? I have stopped running the client on my dual PIII as a result of this problem.

Ted Quade

Brian the Fist
02-27-2004, 01:02 PM
We made some changes to the code a while back to try to improve this but I do not think it is 100% solved. Please feel free to try it and see if it works any better for you now.

Galuvian
02-27-2004, 06:47 PM
This problem just showed up on the machine I've been running the Beta client on.

Dual Xeon 1.7 GHz
Win2k Pro
I have been just running the default foldit.bat to allow me to watch progress on the client.
The beta client had been running fine for about a week. So I started running the other 3rd party apps to see what would happen. I was running DFMon on another machine and remotely checking the status of the beta client. DFMon had been running for 3-4 days before this problem showed up.

Thu Feb 26 13:06:20 2004 ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
Thu Feb 26 13:06:20 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown
Thu Feb 26 13:23:31 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Feb 26 13:32:19 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Feb 26 13:37:12 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Feb 26 13:42:36 2004 ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
Thu Feb 26 13:42:36 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown
Thu Feb 26 23:48:03 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM

Brian the Fist
03-01-2004, 10:31 AM
These are not true errors except the last one. They only indicate inability to contact the server (network busy or unplugged?)

The last one, as it says, could be caused by faulty RAM more often than not. It refers to a corrupt .trj file. Look at the most recent *.trj file and see if it looks ok - not size zero or anything weird like that. To recover, you can overwrite it with the .trj from the previous generation. e.g. if you have protein_3.trj and protein_4.trj in your directory, protein_4.trj is the messed up one so copy protein_3.trj to protein_4.trj and you should be able to continue. If you repeatedly receive this error, test your RAM of course.

Galuvian
03-01-2004, 10:51 AM
The most recent .trj file is 170kb. There are no previous .trj files in the directory. The beta client continues to refuse to recover. Renaming filelist.txt does make it start over though.

Also, progress.txt contained:
Building structure 6 generation 53
4 until next gen

So whatever corruption occured happened while it was crunching structures, not doing the between-generation calculations.

The live client ran fine all weekend on the same machine. If there is a RAM problem it isn't a huge one.

tpdooley
03-01-2004, 03:49 PM
Building structure 6 generation 53
4 until next gen
----------
Since the structures should add up to 100 or 50 if you're running a rather old client.. it looks like you're missing a digit or two..

Galuvian
03-01-2004, 04:02 PM
No, the beta client is only doing 10 strucs per gen.

Welnic
03-01-2004, 04:37 PM
The default update rate is 5, so it only writes to progress.txt every 5 structures. Since the last one that it wrote was number 6 it probably was doing the end of the generation calculation.

Galuvian
03-01-2004, 05:01 PM
Good catch.