Page 1 of 2 12 LastLast
Results 1 to 40 of 42

Thread: New Protein Likes To Die Quietly

  1. #1
    OCworkbench Stats Ho
    Join Date
    Jan 2003
    Posts
    519

    New Protein Likes To Die Quietly

    I am finding that this new Protein just dies for no reason and with no error message..it quits before it can write to the log. Very rare on the last couple of Proteins, but this one I am having several deaths a day on various PCs..some 45 KM apart :bs: Reinstall makes no difference. I run all PCs with the window visible now ( still quieyt mode) so I can monitor them And yes, DFGUI has no clue as to their demise either
    I am not a Stats Ho, it is just more satisfying to see that my numbers are better than yours.

  2. #2
    Anyone else have this problem?
    Howard Feldman

  3. #3
    I have seen this happen now and then, but only on my WIN2K servers, running dual procs. It just seems to shut itself down, with no entry to the log. Sometimes it can be restarted, but usually I have to dump the filelist.txt and start it over

  4. #4
    Ancient Haggis Hound Angus's Avatar
    Join Date
    Jan 2002
    Location
    Seattle/Norfolk Island
    Posts
    828
    I have 20 clients running on dual Xeon boxes. Every morning, there are 3 to 5 that have stopped in the last 24 hours. No errors, just stopped at the end of a generation. I most cases, I have had to dump the entire work unit and start over with gen 0.

    All boxes are W2K Adv server, with dual HT Xeons running 1 client per virtual CPU, from different folders. This has not been a problem until this protein.

  5. #5
    Registered User gOhAsE's Avatar
    Join Date
    Jul 2003
    Location
    Germany
    Posts
    24
    I have the same problem.
    the client died two times by now.
    Quietly, no error log and even dfgui thougt it was still running.


  6. #6
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619
    I have all AMD boxen, 3 of which are Duals, and none are exibiting this behavior... yet
    Some win98, mostly w2k, 12 Mandrake boxen and 1.... X P OS s.


  7. #7
    I am still getting random freezing under FreeBSD, all looks fine when you do a ps -ax, but you have to do a kill -9 to the foldtrajlite process to get it to stop, it ignores the removal of the lock file. This happens one several machines, I get one or two every day on a group of 20 different machines. If I run Linux on the same machines the problem does not occur, and they seem to get through a set of 250 faster than the same machine on FreeBSD. This also happend on the previous protein. All machines are running with the -q flag.

  8. #8
    I have 17 AMD Systems running, most overclocked about 125% and 1 P4 system. .. OS are mostly Win2k, some are Win98SE..

    None have exhibited this behavior....
    Too many computers, too little time......

  9. #9
    Happened to me the other day. Im running an AMD Athlon XP 2000+ on Win2K.

    I only found out because my HSF (Volcano 9) is on auto fan speed, so the computer was much quieter than usual

  10. #10
    Could someone possibly provide a screenshot immediately after this mysterious death? And could someone explain why you have to delete all your work and start again, is there an error message of some kind after this happens? What is it? Help me help you.
    Howard Feldman

  11. #11
    I have never seen it show any error message in the logs. All my clients are running as service, so there's no window to display a message in. dfGUI still shows the client as running. I have tried killing the lock file and restarting (which fails) and restarting the service (which fails) and issuing a "Recover" command thru the dfGUI interface (which will sometimes actually restart it, although not very often). It is a this point that I kill filelist.txt and restart it, because nothing else seems to bring it back. Hope this helps some.

  12. #12
    R.I.P GHOST's Avatar
    Join Date
    Mar 2003
    Location
    north dakota
    Posts
    385
    i just found this on a linux box- ERROR: [001.001] {foldtrajlite2.c, line 2026} Caught sig 11
    was running -if -qt -rt. had 127 structures buffered which i was able to upload. down for 24hr

  13. #13
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619
    I've seen sig11, but I chocked it up to a processor running too hot, re-greased it, cleaned up the heatsink and fired it back up.
    Hasn't happened since. It was on only one box.

  14. #14
    Ancient Programmer Paratima's Avatar
    Join Date
    Dec 2001
    Location
    West Central Florida
    Posts
    3,296
    I had one that died with sig11. I regreased the muffler bearings, changed the air in the tires, updated the BIOS, and it's still been folding just fine ever since!
    HOME: A physical construct for keeping rain off your computers.

  15. #15
    R.I.P GHOST's Avatar
    Join Date
    Mar 2003
    Location
    north dakota
    Posts
    385
    if it gives me more problems i will at least kick the tires

  16. #16
    Had some odd things happening to me last night. I have a few screenshots at home that I will upload later.

    I noticed DFGui was showing -1 gen's buffered. I made the client visible, and it showed the same.

    I shut down the client, but it errored and was killed by Dr. Watson (i'll post the dump later too).

    I restarted, all saved generations were gone (lost about 24 buffered gens)

    I woke up this morning and found that the protein had killed itself. Lost 12 hours of work.

    I'll post the stuff when I get home. Hope it helps

  17. #17
    Here is a ZIP containing all the relevant stuff.

    Hope this helps. If there is anything else I can provide, let me know.


    It seems that my upload (around 250 gens) caused this. There are numerous errors in the error.log related to the upload, and I noticed that I didnt recieve points for any of the generations that I uploaded. Maybe thats it...

    Im on a Dial-up, by the way. Its going through another computer on my LAN, via ICS. Also used an HTTP proxy occasionally, but not this time.
    Attached Files Attached Files

  18. #18
    I thought it was just me!! I haven't found anything to show you, Howard, but I find my client turned off every morning when I wake up and every evening when I get home from work.

  19. #19
    Hmm... I thought I didn't have it, but maybe this is it:

    To start with, it seems like the GUI sits at 46/4 "structures complete/structures remaining" for a long time to begin with.... I fold off-line...

    I'm finding systems sitting there forever. Typically they might have 20 or so buffered generations, while other systems are now up to 150 or so buffered...? And they all get up-loaded around the same time of day over the course of 2-3 hours.

    I thought the Overclock was hanging them up, but they will do it even when Underclocked...... Checking CPU temperature, I can tell the client is not running....

    Seems to happen on only 2-3 systems and not all... I used the "distribfold-update.exe" to update them after one system downloaded it.

    When the worst system gets to the "10K run", I'll install the "new" downloaded client package and see what happens....

    And finally, like others have indicated, the error log shows nothing...
    Too many computers, too little time......

  20. #20
    You guys (on windows machines, I think you need to be able run VBScript files as well) might want to try dfDetect. Something I wrote a long time ago. If your client goes down for any reason it'll start it again. You can also set it up to shutdown DF while running other programs on the computer.

    if any1 is interested in the source give me a pm.

    It's very small so it should be pretty easy to port to Linux, too.
    Attached Files Attached Files
    Last edited by m0ti; 10-11-2003 at 10:24 AM.
    Team Anandtech DF!

  21. #21
    Senior Member
    Join Date
    Jul 2003
    Location
    Hamburg/Germany
    Posts
    386
    I agree with Rebels Heaven, it jut seems to sit there forever, altough I can tell by locking at the task manager and MB Monitor that the client takes all the cpu time it can get.
    It then suddenly jumpes to the next generation.
    Didn't have that with the old updated´version, but after i did a fresh install, it did that...
    Don't know whats wrong....

    Greets Thor

  22. #22
    Senior Member
    Join Date
    Mar 2002
    Location
    MI, U.S.
    Posts
    697
    If you set it to use -g 1 (that is, in dfGUI, set "progress update" to 1 and restart the client), I bet it "hangs" at 49 structs (or maybe 50) instead of 46.

    In other words, I bet this is completely normal behavior, because the client is either minimizing energy, or doing whatever else it does between generations.
    "If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."

    -- Originally posted by Paratima

  23. #23
    Hmmm... So Thor, your saying it didn't do it with the up-dated protien. But it does it with the new one...???

    Now that I'm watching, I have one machine that is always falling behind the others.. I happened to crach the OS yesterday, re-loaded 2K, but it's still doing it...... I'm gonna try the new client package when I finish the 250 structures, and I'll report back....

    Too many computers, too little time......

  24. #24
    bwkaz, anything is possable I guess, but the problem is confined to only 1 or 2 out of 18 systems I'm running....
    Too many computers, too little time......

  25. #25
    Any ideas what happened with my client?

    I installed the new windows package, and have yet to see the same thing happen. I'll let you know if it happens again.

  26. #26
    Senior Member
    Join Date
    Mar 2002
    Location
    MI, U.S.
    Posts
    697
    Wait a minute, it seems that I was confused. Oops.

    I thought you were seeing it hang, then start up again. But I see after re-reading some of the old posts that the client just plain dies.

    So never mind. I just don't know what I'm reading, that's all.
    "If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."

    -- Originally posted by Paratima

  27. #27
    Ancient Haggis Hound Angus's Avatar
    Join Date
    Jan 2002
    Location
    Seattle/Norfolk Island
    Posts
    828
    Somehow I missed this earlier.

    All mine that 'quietly stop' are recording an error in the error.log
    FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error
    Not a permissions problem - client can run for days before giving this error, run as admin user

    Not a disk space error - 10 to 20 GB of free space on every box.

    Not every client on every box fails at the same time.

    Always stops at Structure 50 at the end of a generation.

    All boxes are W2K Adv. Server, multi HT Xeon, lots of RAM

    I had 3 client sessions on 3 different boxes fail over the weekend with the same error.
    Last edited by Angus; 10-13-2003 at 01:23 PM.

  28. #28
    Senior Member
    Join Date
    Jan 2003
    Location
    North Carolina
    Posts
    184
    Angus, are you running more than one instance of the client on all the boxes that are having trouble? Did you tell each instance to use a different directory for its temp files?

  29. #29
    Ancient Haggis Hound Angus's Avatar
    Join Date
    Jan 2002
    Location
    Seattle/Norfolk Island
    Posts
    828
    I'm running four instances on each box, each living in and started with it's own dfGui from it's own folder..

    I haven't done anything else, and it's been working fine like this for a long time.

    This is something new with the latest update.

  30. #30
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619

    DFPTEMP

    Create two folders
    mkdir \distribfold1
    mkdir \distribfold2
    extract the client into each folder
    make sure you put handle.txt in each folder
    make sure autoupdate.cfg is in each folder

    edit foldit.bat in each folder, look for the line that looks like this
    .\foldtrajlite -f protein -n native

    change it to this (windows OS - need 256mb ram)
    .\foldtrajlite -f protein -n native -rt -qt

    You should also edit foldit.bat in each folder to add DFPTEMP variable to point the swapfiles someplace different from each other.

    for example...
    set DFPTEMP=\distribfold1\TEMP (in the first case)
    set DFPTEMP=\distribfold2\TEMP (in the second case)

    then
    mkdir \distribfold1\TEMP
    mkdir \distribfold2\TEMP

    This prevents each client from stepping on each others temp scratch files, which has happened before when using duals.

    open a cmd prompt
    Then
    cd \distribfold1
    foldit

    open a cmd prompt
    cd \distribfold2
    foldit

    There is no way that I know of to dedicate each client to a different processor.

  31. #31
    Ancient Haggis Hound Angus's Avatar
    Join Date
    Jan 2002
    Location
    Seattle/Norfolk Island
    Posts
    828
    This is for *WINDOWS* boxes???

    I see nothing in the c:\WINNT\foldtraj.ini or anything in the DF folders would lead one to think that the client is using anything other than it's own .\ folder for temp files.

    What is the client's default TEMP location, and what would the file names be? There's nothing in C:\WINNT\TEMP or C:\TEMP or C:\ - in fact I couldn't find *anything* that looked like a DF TEMP file.

  32. #32
    Senior Member
    Join Date
    Mar 2002
    Location
    MI, U.S.
    Posts
    697
    Angus -- open a command prompt and:

    echo %TEMP%

    $5 says that TEMP is pointing into your profile somewhere (usually TEMP is set to %userprofile%\Local Settings\Temp on 2K / XP boxes).

    The DF temp files are named file<some stuff>.cdx, file<some stuff>.dbf, file<some stuff>.fpt, and file<some stuff> with no extension.
    "If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."

    -- Originally posted by Paratima

  33. #33
    Ancient Programmer Paratima's Avatar
    Join Date
    Dec 2001
    Location
    West Central Florida
    Posts
    3,296
    Angus, I run my DF stuff in RAM disk. This is my DFPTEMP directory as of right now:

    Volume in drive Z is RAMDISKNT
    Volume Serial Number is FE2D-F000

    Directory of Z:\DFPTEMP

    09/09/2003 07:00p <DIR> .
    09/09/2003 07:00p <DIR> ..
    10/13/2003 09:10p 3,072 1442_1168_730.cdx
    10/13/2003 09:10p 15,202 1442_1168_730.dbf
    10/13/2003 09:10p 106,004 1442_1168_730.fpt
    10/13/2003 09:10p 57,341 1443_1168_767
    4 File(s) 181,619 bytes
    2 Dir(s) 12,032,000 bytes free

    Hope this helps.
    HOME: A physical construct for keeping rain off your computers.

  34. #34
    Senior Member
    Join Date
    Mar 2002
    Location
    MI, U.S.
    Posts
    697
    Hmm...

    Well, I am using the Linux version, but I thought the files would be the same on both platforms. Guess not...
    "If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."

    -- Originally posted by Paratima

  35. #35
    Ancient Haggis Hound Angus's Avatar
    Join Date
    Jan 2002
    Location
    Seattle/Norfolk Island
    Posts
    828
    I'll see if I can get some of the boxes set to use specific TEMP folders tomorrow - if not, then they'll have to slowly die by thenselves as I'm away on a trip for a week.

    None of this explains why this problem started with this update.

  36. #36
    I'm not sure it started with this update-- I believe I've seen it prior to this, but I put it down to a local machine glitch since there was no error output. It does seem to happen more frequently with the current update though.

  37. #37
    Originally posted by Rebels Haven
    bwkaz, anything is possable I guess, but the problem is confined to only 1 or 2 out of 18 systems I'm running....
    As a useful experiment, what if you take a 'broken' client, copy the whole directory to a 'good' machine, and then start it up. Is it still 'broken' or is it fixed? If it works after copying it, likely there is something physically wrong with your machine.
    Howard Feldman

  38. #38
    I've read Ironbits suggestions about making different temp folders. I found out that this does not work for services, or maybe I made a mistake. So my question: how do you state the "DFPTEMP" for services. Is there a switch in service.cfg witch does the same as editing foldrit.bat?

  39. #39
    DFPTEMP should work with service, but you must remember that a service is not normally run under your user account. The variable must be set for All Users, or the Admin user at least...
    Howard Feldman

  40. #40
    Not only have i had this quiet crashing problem on my Win XP boxes with no error output, but i have noticed that after a while(which could be up to 30hrs) the client stops using the extra ram feature.
    Usually the client uses about 95mb ram while folding, but occasionaly it will slow down for no aparent reason and when this happens it only uses about 25mb ram. I have to restart the client to make it use extra ram. This has happens at least 6 times (probably more) over the process of this protein.
    On a different matter just today, one of my Win XP boxes had just finished the 250 generation and had restarted from the begining, it got to the 11th structure and crashed with Win XP giving an the crash error massage asking if i wanted to send the info to Microsoft. Everytime i restarted it i got the same crash, but no message in the error log. I lost 8 hrs from this crash. I deleted the filelist.txt file and all was well again.

    hope this helps.
    Googlybear

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •