PDA

View Full Version : Any bugs old or new?



Brian the Fist
08-19-2003, 11:12 AM
I havent heard any complaints :swear: or whining :cry: about the latest client (aside from the wrong native.val incident) so does that mean everything is fixed now?

If you are experiencing any bugs, old or new, please let us know. Specifically, if you are still getting 'missing previous generation trajectory file' or 'missing/corrupt structure from previous generatrion' occurring with the latest client, please let us know. Also errors like 'filelist has been tampered with' etc etc.

Do NOT include errors you received while the servers were 'overloaded' on Friday unless they permamnently messed up your client or never went away. Thanks, we're hoping it is finally super-stable now!

IronBits
08-19-2003, 11:57 AM
The only thing I can think of so far is when a error condition happens (unkown) and when the client is running -qt.
What appears to be happening is the program is putting out the error condition, and waiting for a <RETURN>, which leaves the client hanging in a zero CPU processing state.
When I check for the running process, yup it's there alright, but it's not doing anything...
Sorry, I'm being vague, but I've seen it several times in the past week on several boxen and I haven't been able to put my finger on it so-to-speak.
I kill the task and restart the client and it takes off...
It's not often any more and I'm not sure what caused it...
Other than that :thumbs: :D

Keller
08-19-2003, 12:06 PM
I think an overloaded Server could cause this error ...
At a protein changeover or a client update the server goes offline and if you start the client you need to press the "Enter" button. After the blackout the server was not reachable, too, and the same message appeared on my PC: I had to press "Enter".

A solution would be to remove this (stupid) message and let the client do his work as though the server would be online (you can upload the result after the changeover and i think in any case, is a working client much better then an idleing client, isn) it ?

cygnussphere
08-19-2003, 12:23 PM
:bang: :bang: :bang:

I had to take my boxen off line last Friday because of the server down problems and I was leaving town. All the clients were updated and running fine. I have had 2 boxen come up with the previous generation missing error when I went to upload stranding over 2500 generations:swear: Would you like me to zip this and send it to you:bang:


./deep breath in through the nose out through the mouth....

and as always :cheers:

Keller
08-19-2003, 12:25 PM
Did you try the -purgeuploadlist switch as described in other threads or the readme file ?

Welnic
08-19-2003, 12:28 PM
None of my OSX boxen auto updated. They all downloaded the distrib-update.tar.gz file but then hung. Trying this now once I Control-C out I see a message saying "Authenticity not verified. Install anyways? ('y' or 'n'). If I do hit "y" and return while it is hung nothing happens.

Trying to open that file manually by using stuffit results in stuffit just quitting. Trying gzip -d it says that the file is not in gzip format.

PinHead
08-19-2003, 12:36 PM
Not sure if this was a bug or a hiccup!

Had 2 boxen detect the update and auto-update. A few days later I had to reboot the 2 boxen and restart the client. Upon restart, it detected an update and went thru the update process again. Even though all the files seemed to be the newer client and all of the work was the newer protein. All work was accepted by the server as the newer client and protein.

Box #1 did this one time.
Box #2 did this four times.

The only thing that I noticed before they stopped this, was that on the final update, I received an update successful message. Then the constant update on restart stopped. I don't think I was getting the "update successful" message before.

The 2 boxen take their updates from my in house web server, so I don't think it was in anyway related to server overload or blackout problems.

Bug / Feature / Hiccup -- your call!

cygnussphere
08-19-2003, 12:39 PM
Originally posted by Keller
Did you try the -purgeuploadlist switch as described in other threads or the readme file ?

I was "REALLY" hoping not to lose all that work.

:tempted:

:cheers:

Keller
08-19-2003, 12:42 PM
You dont lose your work ...
BUT: if the last generation on your box is uploaded to the server and you try to upload it agian you receive this message(can be caused by an server overload or an abrupt upload cancel). If you delete the last generation in your buffer (-purgeuploadlist 1) the client does no longer try to upload this gen and proceeds with the next generation. It CAN be a solution but it NEEDNT so make a backup of your directory(s)

cygnussphere
08-19-2003, 01:26 PM
Originally posted by Keller
You dont lose your work ...
BUT: if the last generation on your box is uploaded to the server and you try to upload it agian you receive this message(can be caused by an server overload or an abrupt upload cancel). If you delete the last generation in your buffer (-purgeuploadlist 1) the client does no longer try to upload this gen and proceeds with the next generation. It CAN be a solution but it NEEDNT so make a backup of your directory(s)

Thanks

what thread is this info in?

:tempted:

:cheers:

Keller
08-19-2003, 01:29 PM
The thread is here (http://www.free-dc.org/forum/showthread.php?s=&threadid=3914)

cygnussphere
08-19-2003, 02:00 PM
Originally posted by Keller
The thread is here (http://www.free-dc.org/forum/showthread.php?s=&threadid=3914)

I don't know how I missed that thread but THANKS!

./me stocks Target Butts fridge with several containers of this (http://www.spatenusa.com/3_products/3_3_prod_spectrum/3_2_1_produkt/optimator/content_r.htm) with note attached "for the consumption of our good friend keller only!":thumbs: :notworthy

:tempted:

:cheers:

Keller
08-19-2003, 02:04 PM
Well, there really is nothing then a bit ( more :) ) good german beer
:cheers: :cheers: :cheers:

willebenn
08-19-2003, 02:13 PM
I don't know if it is just me but the Linux clients seem to have (still have?) a memory leak problem. The ICC version after running several hours never gives back enough memory. When run for 12 or more hours it uses memory up so that the system starts using swap file space.
This is with RedHat 8.0 and 9.0. No xwindows running, console mode only.
Client switchs as follows
-qt -if -rt

Any ideas?
I use free and top to check on the memory usage. If there is a util that can check memory more thorough I'll try it.

Brian the Fist
08-19-2003, 02:16 PM
Originally posted by Welnic
None of my OSX boxen auto updated. They all downloaded the distrib-update.tar.gz file but then hung. Trying this now once I Control-C out I see a message saying "Authenticity not verified. Install anyways? ('y' or 'n'). If I do hit "y" and return while it is hung nothing happens.

Trying to open that file manually by using stuffit results in stuffit just quitting. Trying gzip -d it says that the file is not in gzip format.

This was a bug, apparently specific to MacOSX, which has been fixed (for the next next update it will work right, youll have to get the next one manually still possibly).

dtsang
08-19-2003, 02:38 PM
Originally posted by Brian the Fist
This was a bug, apparently specific to MacOSX, which has been fixed (for the next next update it will work right, youll have to get the next one manually still possibly).

I can attest to that. Besides, updating manually will be worth your time - the client actually works with the :bang: trajectory distribution! I'm doing gen. 249 right now... almost done... :thumbs:

HaloJones
08-19-2003, 03:10 PM
I have a Windows df directory I could send you. I had the client running as a -qt. It seemed to be running slow, so I checked in Task Manager and there were two foldtrajlite processes. I stopped the client and tried to upload the 250 or so generations. About 60 went up and left the rest with no entries in the filelist.txt so the remaining ones are stranded with no hope of a home.

Chaser
08-19-2003, 03:50 PM
I have had the following error today:


========================[ Aug 19, 2003 2:07 PM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 4440} Cannot find structure from previous generation .\XXXXXXXX_1_XXXXXXXX_protein_235_0000006_min.val; find it manually or delete filelist.txt to continue
ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: Previous generation missing


.\fold_1_XXXXXXXX_5_XXXXXXXX_protein_235.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_235_0000006.val
fold_1_XXXXXXXX_45_XXXXXXXX_protein_236.log.bz2
XXXXXXXX_1_XXXXXXXX_protein_236_0000046.val
.\fold_1_XXXXXXXX_10_XXXXXXXX_protein_237.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_237_0000011.val
.\fold_1_XXXXXXXX_46_XXXXXXXX_protein_238.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_238_0000047.val
.\fold_1_XXXXXXXX_36_XXXXXXXX_protein_239.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_239_0000037.val
.\fold_1_XXXXXXXX_2_XXXXXXXX_protein_240.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_240_0000003.val
fold_1_XXXXXXXX_36_XXXXXXXX_protein_241.log.bz2
XXXXXXXX_1_XXXXXXXX_protein_241_0000037.val
.\fold_1_XXXXXXXX_6_XXXXXXXX_protein_242.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_242_0000007.val
.\fold_1_XXXXXXXX_13_XXXXXXXX_protein_243.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_243_0000014.val
.\fold_1_XXXXXXXX_31_XXXXXXXX_protein_244.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_244_0000032.val
.\fold_1_XXXXXXXX_8_XXXXXXXX_protein_245.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_245_0000009.val
.\fold_1_XXXXXXXX_47_XXXXXXXX_protein_246.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_246_0000048.val
.\fold_1_XXXXXXXX_24_XXXXXXXX_protein_247.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_247_0000025.val
.\fold_1_XXXXXXXX_35_XXXXXXXX_protein_248.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_248_0000036.val
.\fold_1_XXXXXXXX_47_XXXXXXXX_protein_249.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_249_0000048.val
.\fold_1_XXXXXXXX_48_XXXXXXXX_protein_250.log.bz2
.\XXXXXXXX_1_XXXXXXXX_protein_250_0000049.val
.\fold_2_XXXXXXXX_73_protein.log.bz2
.\XXXXXXXX_2_protein_0000074.val
CurrentStruc 2 245 128 0 1 74 39.990 -389.673 706.316 3.608 10102177.000 0.850 1.500 250.000 -------------------HHHHHHHHHHHHHHHH---------------------
602c641520eedd5bfc0644c23db550c6


I also rared the whole directory (about 7 mb)

I nearly finished my first set of this protein :/
after trying the purgelist option (without success), i started at zero

i hope, that my info was little help for you!

running the following options: .\foldtrajlite -f protein -n native -rt -g 1
running xp pro with 512 mb ddr

Chaser

willebenn
08-19-2003, 04:23 PM
Some memory usage numbers. I just started the GCC version Linux client to try overnight.
RH8.0
switches -qt -if -rt
Numbers from free command
Clean system boot
used 36mb free 220mb
run client about a minute
used 106mb free 149mb
exit client
used 49mb free 206mb
Is this 13mb used increment from a lib or something loading for first time run and not part of the problem?
Will add more numbers after some run time.

bwkaz
08-19-2003, 06:21 PM
Are those numbers from free's "+/- buffers/cache" line, or from the other line (the first one)?

Because if they're from the first line, then they mean almost nothing. The first line includes, in the "used" column, memory that's holding filesystem cache information. This memory is the first to get released if any process needs RAM, so it's not really "in use" as far as userspace is concerned. The value of "used" RAM just keeps increasing, never really coming back down, if you only look at that line.

But even better than using numbers from free would be using numbers from ps aux or top. Like these (this machine has been running the DF client nonstop since some point after the servers came back online after the power outage, three days ago):


USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
<user> 26520 99.9 14.5 52420 37288 ? SN Aug16 3872:47 ./foldtrajlite -f protein -n native The numbers in bold, VSZ and RSS, are the important ones. VSZ is the total amount of virtual memory that's in use, and RSS is the resident set size (approximately the total physical RAM in use, but some strange programs like X can do weird things to both values).

I am using the ICC client, though. So if it's a compiler thing, then yours still may be leaking -- check with ps aux or top sometime, for the reported VSZ and RSS of the foldtrajlite process.

(I saw something like a leak with the beta client, which I believe was gcc-only, so this still is very much possible.)

willebenn
08-19-2003, 08:08 PM
Ok, yes those are from the top line, was not sure what all was included. The same numbers appear with the top command (at the top). The mem number shown for foldtrajlite does seem to get restored to the free mem total.
The only thing that is bothersome is that the free mem goes down until the swap file starts getting used. That is what started me on this whole witch hunt.
I usually use the ICC version also but thought I'd see if the GCC worked the same.
Thanks for the explanation.
Is there a way to limit so the swap file stays inactive?

bwkaz
08-19-2003, 09:23 PM
You can turn swap off... but then your kernel will just kill DF (if it's what's using the memory).

The kernel FS cache starts at nothing, and grows from there, as you use the filesystem. So that might be causing this behavior, depending on what you're monitoring.

And swap being used isn't always necessarily bad, either. If it's being used a lot it is, but occasionally needing to page some memory out that hasn't been accessed in a while (and other memory in that just was accessed) is no big deal, as long as it doesn't happen a lot. Is your disk getting hit hard once this swapping starts?

What were your VSZ and RSS numbers for foldtrajlite?

cygnussphere
08-20-2003, 12:20 AM
Originally posted by Keller
You dont lose your work ...
BUT: if the last generation on your box is uploaded to the server and you try to upload it agian you receive this message(can be caused by an server overload or an abrupt upload cancel). If you delete the last generation in your buffer (-purgeuploadlist 1) the client does no longer try to upload this gen and proceeds with the next generation. It CAN be a solution but it NEEDNT so make a backup of your directory(s)

update

1 boxen uploaded after 1x.\foldtrajlite -purgeuploadlist 1:thumbs:
1 boxen failed to upload after 300x .\foldtrajlite -purgeuploadlist 1
total lost file sets 650.:bang:
I finally gave up. I :dunno: what is happening when these "generations" get kid napped but it seems bizarre that the proggie could continue to build filesets when this situation exists?

Thanks again keller... ( I see the Spaten is all gone :D )

:tempted:

:cheers:

Brian the Fist
08-20-2003, 11:39 AM
Originally posted by Chaser
I have had the following error today:





I also rared the whole directory (about 7 mb)

I nearly finished my first set of this protein :/
after trying the purgelist option (without success), i started at zero

i hope, that my info was little help for you!

running the following options: .\foldtrajlite -f protein -n native -rt -g 1
running xp pro with 512 mb ddr

Chaser

Is there somewhere I could grab the RAR from? (reply to [email protected] so you don't have to publicize it)

tpdooley
08-20-2003, 07:13 PM
I got this from the latest client on a machine I was testing out overnight..

========================[ Aug 20, 2003 2:49 PM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 4504} File .\myhandle_0_myhandle_protein_27_0000038_min.val is corrupt, missing or has been tampered with; cannot continue - replace file and start again, or manually delete filelist.txt
ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: Data file checksum failed


will post more detail in the bugs forum

Chaser
08-21-2003, 05:31 AM
yes, i could upload it on the webspace of a good friend of mine.
i however could upload it on sunday.
is that too late? if not, i'll upload it then and send a mail... (i would be happy, if you could send me an email or monika, or whoever, when you received the archive)

alternativly i also could zip it?!

cya :cool:

Brian the Fist
08-21-2003, 02:50 PM
Sure, just send us a mail with the link when its there.

RandomCritterz
08-22-2003, 06:53 AM
Here's an odd error message from one of my off-line herd:
ERROR: [001.001] {trajtools.c, line 3512} Unable to open trajectory distribution file handle_protein_251.trj
FATAL ERROR: [002.003] {foldtrajlite2.c, line 5307} Unable to read trajectory distribution handle_protein_251, please create a new one
Generation 251? :confused:

Hmmm, I wonder if the 39 C temperature in the computer room had anything to do with it. :)

AMD_is_logical
08-22-2003, 09:34 AM
Originally posted by RandomCritterz
Generation 251? :confused: I got that once with the last protein: http://www.free-dc.org/forum/showthread.php?s=&threadid=3615

That was also from an offline cruncher. It looks like a real (but very rare) bug.

RandomCritterz
08-22-2003, 10:39 AM
Originally posted by AMD_is_logical
It looks like a real (but very rare) bug.Dang. On the theory that it was heat related errors, I gave the machine a fresh folding directory.

FWIW, it's an Athlon box running a minimal, text-only Debian install; ICC client with with switches -qt -if -rt -p5 -g50.

Brian the Fist
08-22-2003, 02:32 PM
Sounds like a bug, Ill add it to the list. It COULD be because you are using -g50 (and there are only 50 strucs/gen) but maybe not..

AMD_is_logical
08-22-2003, 03:09 PM
Originally posted by Brian the Fist
Sounds like a bug, Ill add it to the list. It COULD be because you are using -g50 (and there are only 50 strucs/gen) but maybe not.. I was using -g5 when it hit me.

Everything else was almost exactly the same as RandomCritterz. (Minimal text-only linux using a SuSE 8.0 kernel, running on an athlon node, ICC version of client, flags -rt -if -qt -p0 -g5 .)

GHOST
08-25-2003, 02:28 AM
six times in nine days i see this in my windows event viewer. i have had the ' foldtrajlite has encountered a problem and needs to close, please tell microsoft' a few times. i do not think that comes up every time though.



i got this from windows event viewer:

0000: 41 70 70 6c 69 63 61 74 Applicat
0008: 69 6f 6e 20 46 61 69 6c ion Fail
0010: 75 72 65 20 20 66 6f 6c ure fol
0018: 64 74 72 61 6a 6c 69 74 dtrajlit
0020: 65 2e 65 78 65 20 30 2e e.exe 0.
0028: 30 2e 30 2e 30 20 69 6e 0.0.0 in
0030: 20 66 6f 6c 64 74 72 61 foldtra
0038: 6a 6c 69 74 65 2e 65 78 jlite.ex
0040: 65 20 30 2e 30 2e 30 2e e 0.0.0.
0048: 30 20 61 74 20 6f 66 66 0 at off
0050: 73 65 74 20 30 30 30 34 set 0004
0058: 39 30 31 63 901c

the second error in the event viewer is quite long so i won't post unless asked, but here is the short version:

fault bucket 59792471

if there is another place to look for info let me know.


this is a box for crunching that i check once a day.
this is my error log for that time period- starts with

========================[ Aug 17, 2003 12:05 AM ]========================

========================[ Aug 17, 2003 12:58 AM ]========================
ERROR: [000.000] {taskapi.c, line 576} Arguments must start with '-' (the offending argument #1 was: 'install')
FATAL ERROR: [000.000] {foldtrajlite2.c, line 6786} Missing/Invalid arguments. For usage, run program, with no arguments.

========================[ Aug 17, 2003 12:58 AM ]========================

has lots of 'failed to connect' in the middle
and ends with-

========================[ Aug 18, 2003 12:10 AM ]========================

========================[ Aug 19, 2003 2:23 AM ]========================
ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER

========================[ Aug 22, 2003 1:10 AM ]========================
ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown
ERROR: [777.000] {ncbi_socket.c, line 1258} [SOCK::s_Connect] Failed pending connect to www.distributedfolding.org:80 (Unknown) {errno=No such file or directory}
ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to www.distributedfolding.org:80 failed: Unknown

========================[ Aug 23, 2003 4:29 AM ]========================

========================[ Aug 25, 2003 1:44 AM ]========================

GHOST
08-25-2003, 04:41 AM
i have this on another box.

FATAL ERROR: [000.000] {foldtrajlite2.c, line 1462} Upload list has been tampered with, please delete filelist.txt and try again

date is august 16. probably a victim of the power failure. where i live we just had a 'bump'. i heard my computers turn off and restart instantly. had client running as service so i was not worried. did not look at it for a couple days.

filelist was blank. client would not purge when i tried purgeuploadlist 1.
when i deleted filelist and restarted, it made its 10,000, then its normal generation and upload, but left all the old files, like it does not see them.

this folder has been renamed and put off to the side.
running a fresh install.

Chaser
08-25-2003, 06:39 AM
@Howard
I uploaded the archive and sent an mail!

Have a nice week!

edit:

Your message

To: [email protected]
Subject: Error: Previous Generation Missing
Sent: Mon, 25 Aug 2003 05:46:08 -0400

did not reach the following recipient(s):

Elena Garderman on Mon, 25 Aug 2003 05:52:10 -0400
The recipient was unavailable to take delivery of the message
MSEXCH:MSExchangeIS:slri:EX



so shit! i'm driving now into vacations...
oh an idea.. i send you the email via pm :)

cya

edit2:
i couldnt pm you :(
i now tried to send it to your personal mailadress. i hope that tis works... but i can't wait now longer....