PDA

View Full Version : FATAL Error didn't kill client



Paratima
08-06-2002, 08:05 AM
Got the following in error.log this morning:

FATAL ERROR: [001.010] {randwalk.c, line 990} Unable to find atom 2HD1 in dictionary entry 31

Unfortunately, it didn't kill the client process - it just hung the program. Therefore, my keepalive daemon didn't know to restart.

Paratima
08-07-2002, 06:51 PM
Originally posted by Paratima
Got the following in error.log this morning:

FATAL ERROR: [001.010] {randwalk.c, line 990} Unable to find atom 2HD1 in dictionary entry 31

Unfortunately, it didn't kill the client process - it just hung the program. Therefore, my keepalive daemon didn't know to restart. Hello? Anyone there? If it gets an error sufficiently severe to call "Fatal", the program should "die", yes? No?

IronBits
08-07-2002, 08:31 PM
Yes :D

Kosh
08-08-2002, 12:45 AM
========================[ Aug 6, 2002 1:27 AM ]========================
ERROR: [001.000] {crease.c, line 187} Internal inconsistency during crease energy calculation
ERROR: [000.010] {foldtrajlite.c, line 4322} An error occurred in CalcCreaseEnergy

I got this and I definitely had a problem where foldit was still listed as a process but wasn't using any cpu time. I'm not sure if this error message is related to the problem, maybe it was just the OS rejecting is new kernel :p, I didn't pay much attention to it until I was your post. Of course mine was just a regular error so I guess the program is allowed to remain in a coma.

btw what do you use as the keepalive daemon?

wirthi
08-08-2002, 10:10 AM
Pperhaps you could tell your keepalive-demon to restart when there is no change in the filelist.txt or the progress.txt for some reasonable time ...

or let it restart when the OS fails to use all of the available CPU-time (and thus, the client is not working full-speed) - of course you have to take care that the client isn't interrupted during an upload or an upgrade ...

Paratima
08-08-2002, 07:32 PM
Hope you're not expecting anything too slick. I go out of my way to keep it simple. :D

This is Linux, where you can do almost anything with scripts. First, the crontab. I built a file called mycron in my home directory:

00 * * * * /home/les/dofold
15 * * * * /home/les/dofold
30 * * * * /home/les/dofold
45 * * * * /home/les/dofold

"crontab mycron" loads it up. The first field is the minutes, next is hours, and so on.
This will check on folding every 15 minutes. The last field is what to do, which is "dofold".

"dofold" looks like this, and is marked executable with chmod:

cd /home/distribfold #This is where the client lives...
if [ -a foldtrajlite.lock ]
then
exit 0
else
exec ./foldit
fi

All it does is look for the lock file. If it doesn't find it, it tries to start folding. Now, there's lots more neat stuff that you CAN do, like looking for update times and the like, as wirthi mentioned. But simple is how I like it best.

Now my real beef with the client, HOWARD or ELENA is that the blasted thing FAILED and DIDN'T DIE.
And I think that should be corrected, which is why I started this thread in the first place.

And no, Kosh, the client shouldn't "go into a coma", ever. Do, or do not; there is no "try". -Yoda

Kosh
08-08-2002, 10:24 PM
Right cron ... I've heard of that but haven't played with it.

I was orginally thinking of a making a script like while 0; do ... this is much nicer. Thanks!

Right now what I have running on startup is:
if [ -e foldtrajlite.lock ]; then
rm foldtrajlite.lock
./foldtrajlite -f protein -n native -ut
fi
./foldit &
if [ "`ps -A| grep -i fold`" == "" ]; then
echo error in foldit startup
fi

Using ps and searching for the process fold might just catch some case where the file is created but the client dies. I hope this is of some use to you, your suggestion was certainly of use to me.

Paratima
08-08-2002, 10:30 PM
Thanks, Kosh! Don't mind if I do!! :thumbs:

Brian the Fist
08-09-2002, 11:25 AM
Paratima, I think you had best calm down first. :smoking:
FATAL errors are not to be taken lightly. They should NEVER occur, and if they do, something is seriously wrong. In your case, that error suggests either corruption of bstdt.val, skel.prt, or perhaps a full disk. There is also a small chance a random memory read error occurred just by fluke.

Either way, a FATAL error generally means it cannot continue, until some action is taken by you, the computer administrator. Thus restarting automatically would be unwise in this case.

If you continue to receive this message, try re-downloading to ensure the files are not corrupt and if it still fails let me know.

As for the crease energy error, this one should in theory never occur either. That one would not freeze up the program though as it is not fatal. Again, if this error occurrs more than once please let me know as I'd let to figure out how it is possible. Any other error messages not already mentioned in the FAQs or 'known bugs' on the web site should also be reported to us ASAP so we can deal with them. Thanks.

Paratima
08-09-2002, 06:35 PM
Om...mani padme...om. :D

Never happened before or since. I will notify you if/when it does again.