Results 1 to 35 of 35

Thread: Any bugs old or new?

  1. #1

    Any bugs old or new?

    I havent heard any complaints or whining about the latest client (aside from the wrong native.val incident) so does that mean everything is fixed now?

    If you are experiencing any bugs, old or new, please let us know. Specifically, if you are still getting 'missing previous generation trajectory file' or 'missing/corrupt structure from previous generatrion' occurring with the latest client, please let us know. Also errors like 'filelist has been tampered with' etc etc.

    Do NOT include errors you received while the servers were 'overloaded' on Friday unless they permamnently messed up your client or never went away. Thanks, we're hoping it is finally super-stable now!
    Howard Feldman

  2. #2
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619
    The only thing I can think of so far is when a error condition happens (unkown) and when the client is running -qt.
    What appears to be happening is the program is putting out the error condition, and waiting for a <RETURN>, which leaves the client hanging in a zero CPU processing state.
    When I check for the running process, yup it's there alright, but it's not doing anything...
    Sorry, I'm being vague, but I've seen it several times in the past week on several boxen and I haven't been able to put my finger on it so-to-speak.
    I kill the task and restart the client and it takes off...
    It's not often any more and I'm not sure what caused it...
    Other than that

  3. #3
    I think an overloaded Server could cause this error ...
    At a protein changeover or a client update the server goes offline and if you start the client you need to press the "Enter" button. After the blackout the server was not reachable, too, and the same message appeared on my PC: I had to press "Enter".

    A solution would be to remove this (stupid) message and let the client do his work as though the server would be online (you can upload the result after the changeover and i think in any case, is a working client much better then an idleing client, isn) it ?
    The German DC Community : Team Rechenkraft.net - Join now ! Rechenkraft.net

  4. #4
    Senior Member
    Join Date
    Sep 2002
    Location
    Meridian, Id
    Posts
    742


    I had to take my boxen off line last Friday because of the server down problems and I was leaving town. All the clients were updated and running fine. I have had 2 boxen come up with the previous generation missing error when I went to upload stranding over 2500 generations Would you like me to zip this and send it to you


    ./deep breath in through the nose out through the mouth....

    and as always

  5. #5
    Did you try the -purgeuploadlist switch as described in other threads or the readme file ?
    The German DC Community : Team Rechenkraft.net - Join now ! Rechenkraft.net

  6. #6
    Senior Member
    Join Date
    Apr 2002
    Location
    Santa Barbara CA
    Posts
    355
    None of my OSX boxen auto updated. They all downloaded the distrib-update.tar.gz file but then hung. Trying this now once I Control-C out I see a message saying "Authenticity not verified. Install anyways? ('y' or 'n'). If I do hit "y" and return while it is hung nothing happens.

    Trying to open that file manually by using stuffit results in stuffit just quitting. Trying gzip -d it says that the file is not in gzip format.

  7. #7
    Not sure if this was a bug or a hiccup!

    Had 2 boxen detect the update and auto-update. A few days later I had to reboot the 2 boxen and restart the client. Upon restart, it detected an update and went thru the update process again. Even though all the files seemed to be the newer client and all of the work was the newer protein. All work was accepted by the server as the newer client and protein.

    Box #1 did this one time.
    Box #2 did this four times.

    The only thing that I noticed before they stopped this, was that on the final update, I received an update successful message. Then the constant update on restart stopped. I don't think I was getting the "update successful" message before.

    The 2 boxen take their updates from my in house web server, so I don't think it was in anyway related to server overload or blackout problems.

    Bug / Feature / Hiccup -- your call!

  8. #8
    Senior Member
    Join Date
    Sep 2002
    Location
    Meridian, Id
    Posts
    742
    Originally posted by Keller
    Did you try the -purgeuploadlist switch as described in other threads or the readme file ?
    I was "REALLY" hoping not to lose all that work.




  9. #9
    You dont lose your work ...
    BUT: if the last generation on your box is uploaded to the server and you try to upload it agian you receive this message(can be caused by an server overload or an abrupt upload cancel). If you delete the last generation in your buffer (-purgeuploadlist 1) the client does no longer try to upload this gen and proceeds with the next generation. It CAN be a solution but it NEEDNT so make a backup of your directory(s)
    The German DC Community : Team Rechenkraft.net - Join now ! Rechenkraft.net

  10. #10
    Senior Member
    Join Date
    Sep 2002
    Location
    Meridian, Id
    Posts
    742
    Originally posted by Keller
    You dont lose your work ...
    BUT: if the last generation on your box is uploaded to the server and you try to upload it agian you receive this message(can be caused by an server overload or an abrupt upload cancel). If you delete the last generation in your buffer (-purgeuploadlist 1) the client does no longer try to upload this gen and proceeds with the next generation. It CAN be a solution but it NEEDNT so make a backup of your directory(s)
    Thanks

    what thread is this info in?




  11. #11
    The thread is here
    The German DC Community : Team Rechenkraft.net - Join now ! Rechenkraft.net

  12. #12
    Senior Member
    Join Date
    Sep 2002
    Location
    Meridian, Id
    Posts
    742
    Originally posted by Keller
    The thread is here
    I don't know how I missed that thread but THANKS!

    ./me stocks Target Butts fridge with several containers of this with note attached "for the consumption of our good friend keller only!"




  13. #13
    Well, there really is nothing then a bit ( more ) good german beer
    The German DC Community : Team Rechenkraft.net - Join now ! Rechenkraft.net

  14. #14
    Senior Member
    Join Date
    May 2002
    Location
    New Jersey USA
    Posts
    115
    I don't know if it is just me but the Linux clients seem to have (still have?) a memory leak problem. The ICC version after running several hours never gives back enough memory. When run for 12 or more hours it uses memory up so that the system starts using swap file space.
    This is with RedHat 8.0 and 9.0. No xwindows running, console mode only.
    Client switchs as follows
    -qt -if -rt

    Any ideas?
    I use free and top to check on the memory usage. If there is a util that can check memory more thorough I'll try it.

  15. #15
    Originally posted by Welnic
    None of my OSX boxen auto updated. They all downloaded the distrib-update.tar.gz file but then hung. Trying this now once I Control-C out I see a message saying "Authenticity not verified. Install anyways? ('y' or 'n'). If I do hit "y" and return while it is hung nothing happens.

    Trying to open that file manually by using stuffit results in stuffit just quitting. Trying gzip -d it says that the file is not in gzip format.
    This was a bug, apparently specific to MacOSX, which has been fixed (for the next next update it will work right, youll have to get the next one manually still possibly).
    Howard Feldman

  16. #16
    Junior Member
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    27
    Originally posted by Brian the Fist
    This was a bug, apparently specific to MacOSX, which has been fixed (for the next next update it will work right, youll have to get the next one manually still possibly).
    I can attest to that. Besides, updating manually will be worth your time - the client actually works with the trajectory distribution! I'm doing gen. 249 right now... almost done...
    Derek

  17. #17
    I have a Windows df directory I could send you. I had the client running as a -qt. It seemed to be running slow, so I checked in Task Manager and there were two foldtrajlite processes. I stopped the client and tried to upload the 250 or so generations. About 60 went up and left the rest with no entries in the filelist.txt so the remaining ones are stranded with no hope of a home.

  18. #18
    Member
    Join Date
    Apr 2003
    Location
    Germany
    Posts
    59
    I have had the following error today:

    ========================[ Aug 19, 2003 2:07 PM ]========================
    ERROR: [000.000] {foldtrajlite2.c, line 4440} Cannot find structure from previous generation .\XXXXXXXX_1_XXXXXXXX_protein_235_0000006_min.val; find it manually or delete filelist.txt to continue
    ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: Previous generation missing
    .\fold_1_XXXXXXXX_5_XXXXXXXX_protein_235.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_235_0000006.val
    fold_1_XXXXXXXX_45_XXXXXXXX_protein_236.log.bz2
    XXXXXXXX_1_XXXXXXXX_protein_236_0000046.val
    .\fold_1_XXXXXXXX_10_XXXXXXXX_protein_237.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_237_0000011.val
    .\fold_1_XXXXXXXX_46_XXXXXXXX_protein_238.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_238_0000047.val
    .\fold_1_XXXXXXXX_36_XXXXXXXX_protein_239.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_239_0000037.val
    .\fold_1_XXXXXXXX_2_XXXXXXXX_protein_240.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_240_0000003.val
    fold_1_XXXXXXXX_36_XXXXXXXX_protein_241.log.bz2
    XXXXXXXX_1_XXXXXXXX_protein_241_0000037.val
    .\fold_1_XXXXXXXX_6_XXXXXXXX_protein_242.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_242_0000007.val
    .\fold_1_XXXXXXXX_13_XXXXXXXX_protein_243.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_243_0000014.val
    .\fold_1_XXXXXXXX_31_XXXXXXXX_protein_244.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_244_0000032.val
    .\fold_1_XXXXXXXX_8_XXXXXXXX_protein_245.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_245_0000009.val
    .\fold_1_XXXXXXXX_47_XXXXXXXX_protein_246.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_246_0000048.val
    .\fold_1_XXXXXXXX_24_XXXXXXXX_protein_247.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_247_0000025.val
    .\fold_1_XXXXXXXX_35_XXXXXXXX_protein_248.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_248_0000036.val
    .\fold_1_XXXXXXXX_47_XXXXXXXX_protein_249.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_249_0000048.val
    .\fold_1_XXXXXXXX_48_XXXXXXXX_protein_250.log.bz2
    .\XXXXXXXX_1_XXXXXXXX_protein_250_0000049.val
    .\fold_2_XXXXXXXX_73_protein.log.bz2
    .\XXXXXXXX_2_protein_0000074.val
    CurrentStruc 2 245 128 0 1 74 39.990 -389.673 706.316 3.608 10102177.000 0.850 1.500 250.000 -------------------HHHHHHHHHHHHHHHH---------------------
    602c641520eedd5bfc0644c23db550c6
    I also rared the whole directory (about 7 mb)

    I nearly finished my first set of this protein :/
    after trying the purgelist option (without success), i started at zero

    i hope, that my info was little help for you!

    running the following options: .\foldtrajlite -f protein -n native -rt -g 1
    running xp pro with 512 mb ddr

    Chaser

  19. #19
    Senior Member
    Join Date
    May 2002
    Location
    New Jersey USA
    Posts
    115
    Some memory usage numbers. I just started the GCC version Linux client to try overnight.
    RH8.0
    switches -qt -if -rt
    Numbers from free command
    Clean system boot
    used 36mb free 220mb
    run client about a minute
    used 106mb free 149mb
    exit client
    used 49mb free 206mb
    Is this 13mb used increment from a lib or something loading for first time run and not part of the problem?
    Will add more numbers after some run time.

  20. #20
    Senior Member
    Join Date
    Mar 2002
    Location
    MI, U.S.
    Posts
    697
    Are those numbers from free's "+/- buffers/cache" line, or from the other line (the first one)?

    Because if they're from the first line, then they mean almost nothing. The first line includes, in the "used" column, memory that's holding filesystem cache information. This memory is the first to get released if any process needs RAM, so it's not really "in use" as far as userspace is concerned. The value of "used" RAM just keeps increasing, never really coming back down, if you only look at that line.

    But even better than using numbers from free would be using numbers from ps aux or top. Like these (this machine has been running the DF client nonstop since some point after the servers came back online after the power outage, three days ago):

    Code:
    USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
    <user>   26520 99.9 14.5 52420 37288 ?       SN   Aug16 3872:47 ./foldtrajlite -f protein -n native
    The numbers in bold, VSZ and RSS, are the important ones. VSZ is the total amount of virtual memory that's in use, and RSS is the resident set size (approximately the total physical RAM in use, but some strange programs like X can do weird things to both values).

    I am using the ICC client, though. So if it's a compiler thing, then yours still may be leaking -- check with ps aux or top sometime, for the reported VSZ and RSS of the foldtrajlite process.

    (I saw something like a leak with the beta client, which I believe was gcc-only, so this still is very much possible.)

  21. #21
    Senior Member
    Join Date
    May 2002
    Location
    New Jersey USA
    Posts
    115
    Ok, yes those are from the top line, was not sure what all was included. The same numbers appear with the top command (at the top). The mem number shown for foldtrajlite does seem to get restored to the free mem total.
    The only thing that is bothersome is that the free mem goes down until the swap file starts getting used. That is what started me on this whole witch hunt.
    I usually use the ICC version also but thought I'd see if the GCC worked the same.
    Thanks for the explanation.
    Is there a way to limit so the swap file stays inactive?

  22. #22
    Senior Member
    Join Date
    Mar 2002
    Location
    MI, U.S.
    Posts
    697
    You can turn swap off... but then your kernel will just kill DF (if it's what's using the memory).

    The kernel FS cache starts at nothing, and grows from there, as you use the filesystem. So that might be causing this behavior, depending on what you're monitoring.

    And swap being used isn't always necessarily bad, either. If it's being used a lot it is, but occasionally needing to page some memory out that hasn't been accessed in a while (and other memory in that just was accessed) is no big deal, as long as it doesn't happen a lot. Is your disk getting hit hard once this swapping starts?

    What were your VSZ and RSS numbers for foldtrajlite?

  23. #23
    Senior Member
    Join Date
    Sep 2002
    Location
    Meridian, Id
    Posts
    742
    Originally posted by Keller
    You dont lose your work ...
    BUT: if the last generation on your box is uploaded to the server and you try to upload it agian you receive this message(can be caused by an server overload or an abrupt upload cancel). If you delete the last generation in your buffer (-purgeuploadlist 1) the client does no longer try to upload this gen and proceeds with the next generation. It CAN be a solution but it NEEDNT so make a backup of your directory(s)
    update

    1 boxen uploaded after 1x.\foldtrajlite -purgeuploadlist 1
    1 boxen failed to upload after 300x .\foldtrajlite -purgeuploadlist 1
    total lost file sets 650.
    I finally gave up. I what is happening when these "generations" get kid napped but it seems bizarre that the proggie could continue to build filesets when this situation exists?

    Thanks again keller... ( I see the Spaten is all gone )




  24. #24
    Originally posted by Chaser
    I have had the following error today:





    I also rared the whole directory (about 7 mb)

    I nearly finished my first set of this protein :/
    after trying the purgelist option (without success), i started at zero

    i hope, that my info was little help for you!

    running the following options: .\foldtrajlite -f protein -n native -rt -g 1
    running xp pro with 512 mb ddr

    Chaser
    Is there somewhere I could grab the RAR from? (reply to trades@mshri.on.ca so you don't have to publicize it)
    Howard Feldman

  25. #25
    Senior Member
    Join Date
    Jul 2002
    Location
    Kodiak, Alaska
    Posts
    432
    I got this from the latest client on a machine I was testing out overnight..

    ========================[ Aug 20, 2003 2:49 PM ]========================
    ERROR: [000.000] {foldtrajlite2.c, line 4504} File .\myhandle_0_myhandle_protein_27_0000038_min.val is corrupt, missing or has been tampered with; cannot continue - replace file and start again, or manually delete filelist.txt
    ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: Data file checksum failed


    will post more detail in the bugs forum
    www.thegenomecollective.com
    Borging.. it's not just an addiction. It's...

  26. #26
    Member
    Join Date
    Apr 2003
    Location
    Germany
    Posts
    59
    yes, i could upload it on the webspace of a good friend of mine.
    i however could upload it on sunday.
    is that too late? if not, i'll upload it then and send a mail... (i would be happy, if you could send me an email or monika, or whoever, when you received the archive)

    alternativly i also could zip it?!

    cya

  27. #27
    Sure, just send us a mail with the link when its there.
    Howard Feldman

  28. #28
    Here's an odd error message from one of my off-line herd:
    Code:
     ERROR: [001.001] {trajtools.c, line 3512} Unable to open trajectory distribution file handle_protein_251.trj
    FATAL ERROR: [002.003] {foldtrajlite2.c, line 5307} Unable to read trajectory distribution handle_protein_251, please create a new one
    Generation 251?

    Hmmm, I wonder if the 39 C temperature in the computer room had anything to do with it.

  29. #29
    Senior Member
    Join Date
    Jan 2003
    Location
    North Carolina
    Posts
    184
    Originally posted by RandomCritterz
    Generation 251?
    I got that once with the last protein: http://www.free-dc.org/forum/showthr...&threadid=3615

    That was also from an offline cruncher. It looks like a real (but very rare) bug.

  30. #30
    Originally posted by AMD_is_logical
    It looks like a real (but very rare) bug.
    Dang. On the theory that it was heat related errors, I gave the machine a fresh folding directory.

    FWIW, it's an Athlon box running a minimal, text-only Debian install; ICC client with with switches -qt -if -rt -p5 -g50.

  31. #31
    Sounds like a bug, Ill add it to the list. It COULD be because you are using -g50 (and there are only 50 strucs/gen) but maybe not..
    Howard Feldman

  32. #32
    Senior Member
    Join Date
    Jan 2003
    Location
    North Carolina
    Posts
    184
    Originally posted by Brian the Fist
    Sounds like a bug, Ill add it to the list. It COULD be because you are using -g50 (and there are only 50 strucs/gen) but maybe not..
    I was using -g5 when it hit me.

    Everything else was almost exactly the same as RandomCritterz. (Minimal text-only linux using a SuSE 8.0 kernel, running on an athlon node, ICC version of client, flags -rt -if -qt -p0 -g5 .)

  33. #33
    R.I.P GHOST's Avatar
    Join Date
    Mar 2003
    Location
    north dakota
    Posts
    385
    six times in nine days i see this in my windows event viewer. i have had the ' foldtrajlite has encountered a problem and needs to close, please tell microsoft' a few times. i do not think that comes up every time though.



    i got this from windows event viewer:

    0000: 41 70 70 6c 69 63 61 74 Applicat
    0008: 69 6f 6e 20 46 61 69 6c ion Fail
    0010: 75 72 65 20 20 66 6f 6c ure fol
    0018: 64 74 72 61 6a 6c 69 74 dtrajlit
    0020: 65 2e 65 78 65 20 30 2e e.exe 0.
    0028: 30 2e 30 2e 30 20 69 6e 0.0.0 in
    0030: 20 66 6f 6c 64 74 72 61 foldtra
    0038: 6a 6c 69 74 65 2e 65 78 jlite.ex
    0040: 65 20 30 2e 30 2e 30 2e e 0.0.0.
    0048: 30 20 61 74 20 6f 66 66 0 at off
    0050: 73 65 74 20 30 30 30 34 set 0004
    0058: 39 30 31 63 901c

    the second error in the event viewer is quite long so i won't post unless asked, but here is the short version:

    fault bucket 59792471

    if there is another place to look for info let me know.


    this is a box for crunching that i check once a day.
    this is my error log for that time period- starts with

    ========================[ Aug 17, 2003 12:05 AM ]========================

    ========================[ Aug 17, 2003 12:58 AM ]========================
    ERROR: [000.000] {taskapi.c, line 576} Arguments must start with '-' (the offending argument #1 was: 'install')
    FATAL ERROR: [000.000] {foldtrajlite2.c, line 6786} Missing/Invalid arguments. For usage, run program, with no arguments.

    ========================[ Aug 17, 2003 12:58 AM ]========================

    has lots of 'failed to connect' in the middle
    and ends with-

    ========================[ Aug 18, 2003 12:10 AM ]========================

    ========================[ Aug 19, 2003 2:23 AM ]========================
    ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
    ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER

    ========================[ Aug 22, 2003 1:10 AM ]========================
    ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
    ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown
    ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
    ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown

    ========================[ Aug 23, 2003 4:29 AM ]========================

    ========================[ Aug 25, 2003 1:44 AM ]========================
    Last edited by GHOST; 08-25-2003 at 03:07 AM.

  34. #34
    R.I.P GHOST's Avatar
    Join Date
    Mar 2003
    Location
    north dakota
    Posts
    385
    i have this on another box.

    FATAL ERROR: [000.000] {foldtrajlite2.c, line 1462} Upload list has been tampered with, please delete filelist.txt and try again

    date is august 16. probably a victim of the power failure. where i live we just had a 'bump'. i heard my computers turn off and restart instantly. had client running as service so i was not worried. did not look at it for a couple days.

    filelist was blank. client would not purge when i tried purgeuploadlist 1.
    when i deleted filelist and restarted, it made its 10,000, then its normal generation and upload, but left all the old files, like it does not see them.

    this folder has been renamed and put off to the side.
    running a fresh install.

  35. #35
    Member
    Join Date
    Apr 2003
    Location
    Germany
    Posts
    59
    @Howard
    I uploaded the archive and sent an mail!

    Have a nice week!

    edit:
    Your message

    To: trades@mshri.on.ca
    Subject: Error: Previous Generation Missing
    Sent: Mon, 25 Aug 2003 05:46:08 -0400

    did not reach the following recipient(s):

    Elena Garderman on Mon, 25 Aug 2003 05:52:10 -0400
    The recipient was unavailable to take delivery of the message
    MSEXCH:MSExchangeIS:slri:EX
    so shit! i'm driving now into vacations...
    oh an idea.. i send you the email via pm

    cya

    edit2:
    i couldnt pm you
    i now tried to send it to your personal mailadress. i hope that tis works... but i can't wait now longer....
    Last edited by Chaser; 08-25-2003 at 06:48 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •