Results 1 to 23 of 23

Thread: DF Client not stable with Windows Multi-CPU systems

  1. #1

    DF Client not stable with Windows Multi-CPU systems

    I've been running the DF client for some months now and I'm consistently getting the same "file write" or "faulty-RAM" messages on my Multi-CPU Windows-Intel Servers. It's always the same; over time, if I'm running multiple instances of the client on a multi-CPU system, I will get these error messages on all but one instance of the client. For example, if I'm running 8 instances of the client on a 8-way system, over time I will end up with only 1 instance still running.

    I am also using single CPU servers and clients, and I've noticed that I NEVER, EVER have a problem with DF on these systems.

    I have 100% confidence in the Memory, I use in my systems, since I've used these systems to test much more memory and CPU intensive application than DF over long durations and have not seen any issues.

    Can you please investigate this issue with multi-CPUs? As you can probably guess from my output thus far, I'm running a sizable farm and it's getting tiresome to have to maintain these systems. Thus far with the latest protein, my hourly production has see-sawed from as low as 190K to 500K. Currently, I'm on the lower end of this range not because I removed any clients from DF, but because the DF application has not been stable over time, and I do not have the time to restart all the instances that have "failed".

    On a separate topic, are you ever going to optimize this client for Intel based 64-bit CPUs? I've ran this on a 64-way server and it's very slow, even though the entire application can fit in the L2 cache of each CPU.

    Thanks.

  2. #2
    Administrator PCZ's Avatar
    Join Date
    Jun 2003
    Location
    Chertsey Surrey UK
    Posts
    2,428
    I run DF on multi CPU boxes when I get the chance, and am also seeing the same errors.
    They need a lot of nursing to keep running.
    I have just checked a quad Xeon and 2 of the instances of DF have died.

    One instance had this error at the bottom of the log
    FATAL ERROR: [023.024] {trajtools.c, line 2507} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM

    The other instance had this error
    FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error

    This is a brand new dell server with 4 2meg Xeons with 2mb cache on each CPU.
    I doubt very much if it has bad ram, it is ECC so it would self repair any errors.
    It also has lots of Disc space left.

    I have also checked a few older dual P3 servers and a high percentage of them have broken instances of DF.
    They have the same 2 errors as posted above.
    A couple of them had no error in the log but refused to start until I deleted filelist.txt

    Fortunately for me the corporate boxes I run this client on only contribute about 20% of my output.
    I usually check them once a week and restart all the stalled instances.

  3. #3
    Ancient Programmer Paratima's Avatar
    Join Date
    Dec 2001
    Location
    West Central Florida
    Posts
    3,296
    Have you guys seen/tried IronBit's directions in this thread ?

    If it works for two CPUs...
    HOME: A physical construct for keeping rain off your computers.

  4. #4
    Ancient Haggis Hound Angus's Avatar
    Join Date
    Jan 2002
    Location
    Seattle/Norfolk Island
    Posts
    828
    I've experienced all the same issues with multi-CPU W2K servers - in my case, new Dell multi-Xeon rack-mount server class machines.

    In my opinion, the client should take care of it's files properly. Using the IronBits workaround may help, but again it's having to make up for deficiencies in the client. I don't mind too awfully much having to install a client for each CPU in it's own folder, but that's as far as I want to go. Having to customize batch files, and make extra TEMP folders is too much.

    Ultimately, the client should figure out how many CPUs are in the box , and start up enough processes to run one protein for each CPU, whether it's a 'real' CPU, or a virtual HT CPU. I realize that only XP recognizes the HT virtual CPUs for what they are, but W2K recognizes them as CPUs, just doesn't know they are HT.

    Anyway, back to the topic. Yes, it's broken, and it's a PITA to keep them running.

  5. #5
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619
    I've seen this one on single cpu boxen
    FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error

  6. #6
    Resident Lunatic Anarchy99's Avatar
    Join Date
    Jan 2003
    Location
    South Dakota
    Posts
    538
    I agree with the multi cpu problem even with IBs fixes it is still happening to me

    I dont have any advice to fix it but agree that it should be remedied
    Entry: lunatic
    Function: adjective
    Definition: crazy
    Synonyms: absurd, baked, balmy, bananas, batty, bonkers, cracked, crazed, daft, demented, deranged, dippy, flaky, flipped out, foolish, freaked out, gone ape, idiotic, insane, irrational, kooky, loco, loony, mad, maniac, maniacal, nonsensical, nuts, nutty, preposterous, psyched out, psycho, psychotic, schizoid, screwy, stupid, unhinged, unsound, wacky, whacko, zany
    Concept: health (poor)





  7. #7
    Ol' retired IT geezer
    Join Date
    Feb 2003
    Location
    Scarborough
    Posts
    92

    Unhappy deficiencies in the client -- NOT !!!

    Gentlemen,

    I did NOT see anyone complaining that indicated they were using Linux or non-windows operationg systems. This is a windows problem, NOT an application problem! True, the application may be stretching windows limits, but so be it.

    I don't mind too awfully much having to install a client for each CPU in it's own folder, but that's as far as I want to go. Having to customize batch files, and make extra TEMP folders is too much.
    Set it up once in a generic manner and its a piece of cake... "set TMP=..\tmp" works for me in windows....



    Ultimately, the client should figure out how many CPUs are in the box , and start up enough processes to run one protein for each CPU, whether it's a 'real' CPU, or a virtual HT CPU. I realize that only XP recognizes the HT virtual CPUs for what they are, but W2K recognizes them as CPUs, just doesn't know they are HT.
    Now, this is NOT the way to go if you are trying to get an application to run on multiple platforms.... Applications should NOT dig into the platform to make decisions about how to run unless the application is running the environment. Here the object of the application is to perform work.

    Ask Microsoft to fix its operating system... (I know... lost cause!)

    My two cents worth... Ned


  8. #8
    While I have seen these errors before, this is the first time I recall that someone has indicated they may be caused by multi-CPU machines. I agree it is likely a Windows-only issue but could be a problem with a Mutex or something. We will updating the binary next Tuesday, and after this, I would request that someone (how about bguinto since you have an 8-way!?) take a test version from me to get some further information on these errors and why they are happening. It should not be something as simple as interfering filenames as long as each instance is installed to its own directory. Great pains were taken to ensure temporary files are named uniquely.
    Howard Feldman

  9. #9
    Member
    Join Date
    Jul 2002
    Location
    Down the road from Mr. Fist :D
    Posts
    76
    I'm running a Dual AMD on Windows 2000 Server is running perfectly w/o problems. It's been running non-stop 24/7 for ages now.

    I HAVE seen that error before on a different Windows machine, but only once and not for a long time now.

  10. #10
    Howard,

    How do I go about getting a test version from you?

    Thanks.

    BTW, regarding the second half of my question ... regarding IA-64 optimization. Any time line?

  11. #11
    Originally posted by bguinto1
    Howard,

    How do I go about getting a test version from you?

    Thanks.

    BTW, regarding the second half of my question ... regarding IA-64 optimization. Any time line?
    Easy, you give me your e-mail (PM if you like), and I give you the test version

    No IA-64 until we get an IA-64 box in house, its out of my hands, sorry.
    Howard Feldman

  12. #12
    Howard,

    My email address is bguinto1@yahoo.com.

    Thanks.

  13. #13
    Thanks Ill try to get a test version to you after next weeks update then. Remind me if I should forget...
    Howard Feldman

  14. #14
    Originally posted by Brian the Fist
    Thanks Ill try to get a test version to you after next weeks update then. Remind me if I should forget...
    Howard,

    A friendly reminder to send me the test version.

    Thanks.

  15. #15
    Has there been any progress on this matter? I have stopped running the client on my dual PIII as a result of this problem.

    Ted Quade

  16. #16
    We made some changes to the code a while back to try to improve this but I do not think it is 100% solved. Please feel free to try it and see if it works any better for you now.
    Howard Feldman

  17. #17
    This problem just showed up on the machine I've been running the Beta client on.

    Dual Xeon 1.7 GHz
    Win2k Pro
    I have been just running the default foldit.bat to allow me to watch progress on the client.
    The beta client had been running fine for about a week. So I started running the other 3rd party apps to see what would happen. I was running DFMon on another machine and remotely checking the status of the beta client. DFMon had been running for 3-4 days before this problem showed up.

    Thu Feb 26 13:06:20 2004 ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
    Thu Feb 26 13:06:20 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown
    Thu Feb 26 13:23:31 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
    Thu Feb 26 13:32:19 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
    Thu Feb 26 13:37:12 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
    Thu Feb 26 13:42:36 2004 ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
    Thu Feb 26 13:42:36 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown
    Thu Feb 26 23:48:03 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM

  18. #18
    These are not true errors except the last one. They only indicate inability to contact the server (network busy or unplugged?)

    The last one, as it says, could be caused by faulty RAM more often than not. It refers to a corrupt .trj file. Look at the most recent *.trj file and see if it looks ok - not size zero or anything weird like that. To recover, you can overwrite it with the .trj from the previous generation. e.g. if you have protein_3.trj and protein_4.trj in your directory, protein_4.trj is the messed up one so copy protein_3.trj to protein_4.trj and you should be able to continue. If you repeatedly receive this error, test your RAM of course.
    Howard Feldman

  19. #19
    The most recent .trj file is 170kb. There are no previous .trj files in the directory. The beta client continues to refuse to recover. Renaming filelist.txt does make it start over though.

    Also, progress.txt contained:
    Building structure 6 generation 53
    4 until next gen

    So whatever corruption occured happened while it was crunching structures, not doing the between-generation calculations.

    The live client ran fine all weekend on the same machine. If there is a RAM problem it isn't a huge one.

  20. #20
    Senior Member
    Join Date
    Jul 2002
    Location
    Kodiak, Alaska
    Posts
    432
    Building structure 6 generation 53
    4 until next gen
    ----------
    Since the structures should add up to 100 or 50 if you're running a rather old client.. it looks like you're missing a digit or two..
    www.thegenomecollective.com
    Borging.. it's not just an addiction. It's...

  21. #21
    No, the beta client is only doing 10 strucs per gen.

  22. #22
    Senior Member
    Join Date
    Apr 2002
    Location
    Santa Barbara CA
    Posts
    355
    The default update rate is 5, so it only writes to progress.txt every 5 structures. Since the last one that it wrote was number 6 it probably was doing the end of the generation calculation.

  23. #23

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •