Results 1 to 26 of 26

Thread: Repeatable crash @ Gen 338

  1. #1
    Member
    Join Date
    May 2003
    Location
    Portland, OR USA
    Posts
    79

    Repeatable crash @ Gen 338

    I've left a few machines folding offline for the past week or so. So far, every one of them that has reached 337 generations buffered has crashed some time during that 338th generation, leaving me with 45MB of data that I can't upload.

    When I try to upload, the program freezes after saying that it's checking for newer versions.

    I'm running with the -rt switch and either the -if or -ut switch, depending on whether I'm internet-connected or not.

    Any suggestions for recovering this work?
    -djp
    I'm not a Stats Ho either. I just want to go and check to see that all my spare boxen are busy. Hang on a minute....

  2. #2
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137


    i have the exact same problem!!!

    in XP it crashes with a wierd system message, in W2K Server, it just sticks, won't do anything

    i have 3 sets of 337 just like that, it has to be a common problem

    here is the thread i started in the Bug Fix forum
    Use the right tool for the right job!

  3. #3


    I thought there are only 250 generation...

  4. #4
    Member
    Join Date
    May 2003
    Location
    Portland, OR USA
    Posts
    79
    You're right. There are only 250 generations of optimization. After # 250, your computer will generate another random structure and spend 250 generations of tweaking and computing to try and optimize it. Once that's done, the process repeats.

    The process repeats until they make some major change in Toronto, like a new protein or even a new Phase of the project.

    If I leave my machine folding offline for more than generation 250, it will continue folding on a new randomization. (until it crashes in the manner I've reported)
    -djp
    I'm not a Stats Ho either. I just want to go and check to see that all my spare boxen are busy. Hang on a minute....

  5. #5
    Got you. I haven't had that problem yet.

  6. #6
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137
    a guy on my team forum said the following sorta fixes it (you at least can upload your work, but then it starts over at gen 1, oh well). i haven't tried it yet, will tonight when i get home

    all you need to do to recover is to delete the last entry into the filelist.txt - make a backup first though. It uploaded the 336 prior gens but it restarted at Gen 0
    Use the right tool for the right job!

  7. #7
    Member
    Join Date
    May 2003
    Location
    Portland, OR USA
    Posts
    79
    Thanx, FoBoT! I was about to re-read the SneakerNet FAQ and try to see if that was possible. Unfortunately, DF.o's web server was undergoing maintenance when I tried to look at the FAQ earlier.

    -djp
    -djp
    I'm not a Stats Ho either. I just want to go and check to see that all my spare boxen are busy. Hang on a minute....

  8. #8
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137
    uhg, this isn't working so far

    i removed the last two .val .log file entries, that didn't work, it said "filelist.txt tampered with, start over"

    i tried removing the very last line that appears to be an md5sum key, it stuck/crashed again
    Use the right tool for the right job!

  9. #9
    Member
    Join Date
    May 2003
    Location
    Portland, OR USA
    Posts
    79
    uhg, this isn't working so far

    i removed the last two .val .log file entries, that didn't work, it said "filelist.txt tampered with, start over"

    i tried removing the very last line that appears to be an md5sum key, it stuck/crashed again
    Read the SneakerNet FAQ....
    The last line is a checksum, just like you thought.
    The next-to-last line is a pile of statistics. One of those numbers is a counter of how many generations in the buffer have been completed. I will try it when I get a slow spot in the day.

    Here's the FAQ section in case the DF webiste is down tomorrow:
    Is there an easier way to 'sneaker net' than copying the whole directory?
    'Sneaker net' refers to running the client on one or more machines that never connect to the internet, then copying the files (via CD for example) to a different machine which IS on the internet, to upload. The problem is that since uploading occurs on a different machine, the first machine(s) (the one(s) doing the actual work) never know that work has been uploaded, and so you must manually 'tell them'.

    To do this, you need to understand a little about filelist.txt and how phase II works. filelist.txt tracks all the files that have been generated and that need to be uploaded to our server, plus some other information. Whenever you upload, filelist.txt gets modified and some of the files that were listed in it get uploaded, and deleted from disk. However in some cases, you still need to keep a file locally, AFTER it has been uploaded. This can all be figured out from filelist.txt. Consider the following contents for filelist.txt:

    .\fold_0_xxxxxxxx_0_xxxxxxxx_protein_24.log.bz2
    .\xxxxxxx_0_xxxxxxxx_protein_24_0000023.val
    .\fold_0_xxxxxxxx_0_xxxxxxxx_protein_25.log.bz2
    .\xxxxxxx_0_xxxxxxxx_protein_25_0000034.val
    CurrentStruc 0 41 126 25 1 0 10000000.000 10000000.000 .....

    Pay special attention to the fifth number after 'CurrentStruc'. In this case it is a '1'. This tells us the number of file pairs (log and val file pairs) in filelist.txt, starting from the top, that have already been uploaded to the server. Thus fold_0_xxxxxxxx_0_xxxxxxxx_protein_24.log.bz2 and xxxxxxx_0_xxxxxxxx_protein_24_0000023.val have already been sent to the server in this case, while the other 2 files have not yet. When an upload is initiated, all files that have not yet been uploaded will be, except the last file pair which is normally from the current, partially completed generation. These will not be uploaded until the generation has been completed of course. In the above example, nothing is ready to be uploaded yet.

    Also note that while in filelist.txt the val file is always listed as xxx.val, on disk there may be one or more of the following files: xxx.val, xxx_min.val, xxx.val.bz2 or xxx_min.val.bz2. Thus the name may not exactly match what is in filelist.txt, and this is OK. Sometimes more than one of these filenames will be present on disk, in which case all of them go together.

    So at any point in time, what files should be present? The answer is, and rows in filelist.txt that have NOT been uploaded yet (as per the above explanation), as well as the .val file(s) from the last filepair that WAS uploaded, if it is still listed in filelist.txt. In our above example, this would be xxxxxxx_0_xxxxxxxx_protein_24_0000023.val (and/or xxxxxxx_0_xxxxxxxx_protein_24_0000023.val.bz2, xxxxxxx_0_xxxxxxxx_protein_24_0000023_min.val or xxxxxxx_0_xxxxxxxx_protein_24_0000023_min.val.bz2)

    Thus to update the sneaker net machines after an upload, you simply need to copy back the filelist.txt from the internet-enabled machine after the upload, and then delete files that have been uploaded. Do this by looking at the filelist.txt. Only files listed in there are needed, and only use the fifth number after CurrentStruc to figure out if any other files can be deleted as well. But remember to keep the .val files from the last uploaded generation, if any, to avoid breaking the software. Lastly, remember for any .val entry in the filelist.txt, there may be any one or more of four actual filenames in disk, as explained above. The best way to get the hang of it is just look at what the uploader leaves behind after it uploads and see if it matches what you expect based on these instructions. Once you've done this a few times, you shoudl get the hang of it and can probably script a somewhat automated way of updating the no-network machines.
    I have also written a DOS .BAT file to automate the sneakernet process when you have a network of offline machines. I'll post this as soon as I've made it a little more generic and well commented.
    -djp
    I'm not a Stats Ho either. I just want to go and check to see that all my spare boxen are busy. Hang on a minute....

  10. #10
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137
    i must be dumb as a stump, i am not getting it to work

    i'll try to get the other guy to give a more detailed procedure
    Use the right tool for the right job!

  11. #11
    OCworkbench Stats Ho
    Join Date
    Jan 2003
    Posts
    519
    What happens to the data from these higher Gens. Does the Server accept the data and use them like any other result from Gen #1-250. Or does it see that it is above 250 and store them somewhere different. And what happens to the scoring system above 250, does it still work or will it give a reading of 0 for those Generation above 250..just thought I would ask, as I am in the mood for some dumb question submitting .
    I am not a Stats Ho, it is just more satisfying to see that my numbers are better than yours.

  12. #12
    Member
    Join Date
    May 2003
    Location
    Portland, OR USA
    Posts
    79

    this just worked for me...

    This just worked for me:

    I edited the filelist.txt file by deleting the line that starts with CurrentStruct and everything downward. Then I looked at the list of filenames. They all seem to have a substring within the name of "protein_??" where the ?? is some number that increments by 1 for every pair of files. In one of my crashed runs, the last file in the list didn't have a partner with the same protein_?? number, so I removed this widow from the list.

    After I saved the edited text file, I ran the client with a -ut switch and it started uploading 336 files cleanly.

    On a second botched upload, I didn't have an un-paired file at the end, so I just truncated the file at CurrentStruct and it is currently uploading 337 files to the server!

    Oddly enough, after uploading successfully it wrote a fresh filelist.txt file and left a pair of matching *protein_86* files and a trj file. I think I'll move these over to an idle client and see if it will resume cleanly with generation 86 or 87.
    Last edited by djp; 07-01-2003 at 06:12 PM.
    -djp
    I'm not a Stats Ho either. I just want to go and check to see that all my spare boxen are busy. Hang on a minute....

  13. #13
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137
    i got it to work, you remove the last two lines, the checksum line and the info line, leaving just the list of .val/.log files

    it is sending them anyway in W2K server, will try it from XP when i get home tonight

    Grumpy, they are not post 250 generations, they are a second set of 1-250 generations, just another "run"

    i.e. if you run no-net/offline long enough, the client will complete a first "run" of 250 generations. since the client is offline (-if) , it can't upload the cached 250 generations, so it starts a second "run" of 250 generations, the .val/.log files append a _1_ into the file names to distinguish them from the first "run" of 250 generations

    the bug we are seeing is that once it caches 337 generations (250 from the first "run", and 87 from the second "run") , the client halts/crashes and won't run at all. even if you try to upload/send the results back, it won't run. the above workaround appears to upload the results, but then restarts the client at generation 0 vs generation 88 (the next generation of the second "run"). not a huge loss if the cached results are sent

    the bottom line is that you can only run the current client offline/no-net for < 337 generations (1 full "run" of 250 and 87 subsequent generations of the 2nd "run") before sending out results. i encountered this error after being gone for 1 week (actualy 9 days of no-netting) on the following speed pc's: P3 1.26Ghz, AMD 2100+, AMD 1700+ . i am not sure at what point of the nine days they quit, i guess i could look at the time/date stamps of the .val/.log files to figure it out

    bottom line, until the bug is fixed, don't run the client offline without uploading for too long, like maybe only 3-5 days on fast PC's, maybe a week on slower PC's
    Use the right tool for the right job!

  14. #14
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137

    Re: this just worked for me...

    Originally posted by djp
    ...left a pair of matching *protein_86* files and a trj file. I think I'll move these over to an idle client and see if it will resume cleanly with generation 86 or 87.
    i will try it at home, i haven't started a new client folder on those two pc's
    Use the right tool for the right job!

  15. #15
    Stats God in Training Darkness Productions's Avatar
    Join Date
    Dec 2001
    Location
    The land of dp!
    Posts
    4,164
    I just tried this same thing, and I get stuck with a "previous generation missing" error. Thoughts?

  16. #16
    OCworkbench Stats Ho
    Join Date
    Jan 2003
    Posts
    519
    Thanks Fobot, my Cognitive Functions should be back online this year sometime
    I am not a Stats Ho, it is just more satisfying to see that my numbers are better than yours.

  17. #17
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619
    Originally posted by Darkness Productions
    I just tried this same thing, and I get stuck with a "previous generation missing" error. Thoughts?
    Me too
    http://www.free-dc.org/forum/showthr...1987#post31987

  18. #18
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137
    Originally posted by Darkness Productions
    I just tried this same thing, and I get stuck with a "previous generation missing" error. Thoughts?
    my first XP box at home that i am trying it on did that also , hmm... i am searching around, i thought i read a trick to fix that type error

    the other XP box did the same thing
    weird , it works on W2K , maybe i will burn these two folders to cdr and take them to work and try them on W2K

    i dunno
    Last edited by FoBoT; 07-01-2003 at 11:48 PM.
    Use the right tool for the right job!

  19. #19
    Sadly, this new bug will just be added to the list of those we are already in the process of trying to fix.

    Thanks for the informative posts, and for your patience with this new set of bugs.

    I hope we will be able to offer you solutions soon, although I do not yet have a time frame.
    Elena Garderman

  20. #20
    Fixer of Broken Things FoBoT's Avatar
    Join Date
    Dec 2001
    Location
    Holden MO
    Posts
    2,137
    some more info on a possible workaround to get this work uploaded was posted in this thread

    basically, he says to verify every line of the filelist.txt to ensure all those files really exist, remove any invalid ones and try it again (at least that is how i understood the post, i will be trying it later tonight on my home XP boxen)
    Use the right tool for the right job!

  21. #21
    Senior Member
    Join Date
    Apr 2002
    Location
    Oosterhout, Netherlands
    Posts
    223
    Is there any progress in resolving this (imho) HUGE problem. Due to this you're praticly unable to run DFolding on fast, offline clients. A lot of potential CPU-power is lost. Such a shame...
    Proud member of the Dutch Power Cows

  22. #22
    Member
    Join Date
    May 2003
    Location
    Portland, OR USA
    Posts
    79
    If you harvest and upload your work every day or two, your fastest Windows-based machines should still never reach 337 generations buffered. Sure, it's a pain working around this bug, but there is a workaround.
    -djp
    I'm not a Stats Ho either. I just want to go and check to see that all my spare boxen are busy. Hang on a minute....

  23. #23
    Senior Member
    Join Date
    Apr 2002
    Location
    Oosterhout, Netherlands
    Posts
    223
    3 days (a long weekend) on a XP2700+ put me in this position.

    All the workarounds posted didn't help me and this goes for other people as you can read in the several threads...

    Maybe you have another idea?
    Proud member of the Dutch Power Cows

  24. #24
    Senior Member
    Join Date
    Apr 2002
    Location
    Santa Barbara CA
    Posts
    355
    Originally posted by [DPC]Mobster
    3 days (a long weekend) on a XP2700+ put me in this position.

    All the workarounds posted didn't help me and this goes for other people as you can read in the several threads...

    Maybe you have another idea?
    If you have enough RAM you could run two instances of the client, then things will build up half as fast.

  25. #25
    Could someone please e-mail the filelist.txt with 337 buffered generations to trades@mshri.on.ca? Thanks.
    Howard Feldman

  26. #26
    Senior Member
    Join Date
    Apr 2002
    Location
    Oosterhout, Netherlands
    Posts
    223
    Just did that. I hope this will be helpfull....
    Proud member of the Dutch Power Cows

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •