PDA

View Full Version : clients randomly stopping



statsman
03-20-2003, 12:37 PM
Ever since the Jan 31st version of the text client, which I run on a variety of Win 2k/XP systems, I have had the client finish a WU (5000 iterations), write the fold*.bz2 and filelist.txt files and then just stop. No messages in the error.log file. All of these machines are connected to always on Internet connections (some DSL, some Cable), and the problem never coinsides with any network outage. None of the machines are overclocked, the client is running as a service on all of these machines. I have tried to stop the service and/or remove the lock file, but the process will not die. I have to set the service to be started manually, reboot the machine and then run foldit.bat manually to transmit the unit. Then I stop the foldit.bat with the 'Q' key, reset the service to start automatically and restart the service. Then it will run and produce and transmit many units. I have over 35 machines running DF and this happens totally randomly. In the past if there was a problem transmitting (or a network problem at all), there would be an error.log entry and the client would just start another unit. It would retry all stored units at the end of each new unit. Something happened in the code between the Jan 6th build and the Jan 31st/Feb 25th builds. The new builds have never produced any error.log output, nor saved off a completed unit and gone on to a new one.

Please Brian, if you know the fix to this don't wait until the next protien change over, please provide a fixed client sooner than that.

Brian the Fist
03-20-2003, 02:34 PM
Okay...
1st, how do you know it stops at the end of 5000 structures if it is running as a service - could it be stopping in random places?

2nd, are you running any tasks/doing anything that may periodically wipe the TEMP directory, or is your drive with the TEMP dir on it nearly full?

The most likely cause is a fatal error has occurred. When this happens for the service, it can get locked into a state as you described, waiting for the user to press Enter (which you obviously cannot do when running a service).

I have fixed this particular problem in the current beta, so you may wish to try downloading it onto a few machines and see what happens (instructions are in the main thread of this forum).

However it still doesn't explain why you are getting a fatal error - could this be related to question2 above?

statsman
03-20-2003, 04:03 PM
Originally posted by Brian the Fist
Okay...
1st, how do you know it stops at the end of 5000 structures if it is running as a service - could it be stopping in random places?

2nd, are you running any tasks/doing anything that may periodically wipe the TEMP directory, or is your drive with the TEMP dir on it nearly full?

The most likely cause is a fatal error has occurred. When this happens for the service, it can get locked into a state as you described, waiting for the user to press Enter (which you obviously cannot do when running a service).

I have fixed this particular problem in the current beta, so you may wish to try downloading it onto a few machines and see what happens (instructions are in the main thread of this forum).

However it still doesn't explain why you are getting a fatal error - could this be related to question2 above?

1. the progress.txt file says that it has completed 5000 structures, the date and time of the fold*.bz2 and filelist.txt files coorespond to the timestamp of the progress.txt file, and no new <handle>_protein_#######.val.bz2 file is created.

2. no there are no processes that wipe the TEMP directory (at least none that I have put in) and all systems have several GB's of free disk space.

Again prior to the 1/31/03 download package, I never had this problem.

The problems I am having are VERY similar to the ones in the "01/31/03 build DF Win2k Service will not start folding without internet connection" thread, except I always have an Internet connection. I suspect that at random times, there is some minor hiccup in the networking code when transmitting the finished unit and the process just hangs.

HaloJones
03-20-2003, 04:37 PM
I have just installed 19 PCs with the text client installed as a service with DFGui to monitor them. Each is a Win2K Pro box. No other software is running. Overnight, fully half of the machines stopped with DFGui showing 5000 structures as if the box was trying to upload. The service will not stop. Running foldit.bat from a text box says that the server is down. This is with an always-on corporate connection and an identical proxy.cfg on each box.

Rebooting the box with the service set to manual, using DFGUI to start the service does not work. It is necessary to uncheck the "Service" box in DFGui, save the config. Then foldit.bat works.

All the machines are currently running with the text client running but not as a service. I will post as to how many are still running tomorrow morning.

Both my home machines have had more outages with this protein than ever before. Sometimes they just seem to stop...

There are no errors in any of the error logs.

\EDIT: ten minutes after posting this, my home W2K box decided to stop. Tried to re-start it and it wouldn't saying that it was already started. foldtrajlite.exe was running but there was no .lock file or accompanying .bz* files.

Could not kill the foldtrajlite nor stop the service. Had to re-boot.

While re-booting, the XP machine which connects through the W2K box, finished 5000 and tried to upload. It didn't and it didn't time out either. Just sat there. Could not stop the service. Re-booting now.

Brian the Fist
03-20-2003, 05:15 PM
If you want to get this problem fixed please do what I asked.

Try the beta, and see if it has the problem or not. See the beta threads in the main info for links to get it and instructions to install (you may have to go back to the beta 1 or 2 thread to see the instructions, they havent changed).

Secondly, or additionally, please try running it NOT as a service. Just for testing purposes. Run it with '-qf' (quiet mode off) so it runs in a DOS box. Then watch and see if the same thing happens, and if a specific error message appears IN the DOS box rather than the error.log. This may identify the problem for us, allowing me to fix it.

Let me know as soon as you've tried either of the above, and what you find. Thanks.

Insidious
03-22-2003, 04:04 PM
could this be related to question2 above?

Sounds the same to me :bang:

-Sid

m0ti
04-08-2003, 02:00 AM
This may be related to the following (not sure): I've had DF hang up when set to upload only, with the following error:

========================[ Apr 4, 2003 8:33 AM ]========================
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 13002 chars.

========================[ Apr 4, 2003 8:36 AM ]========================

========================[ Apr 4, 2003 8:37 AM ]========================
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 11561 chars.

========================[ Apr 4, 2003 8:39 AM ]========================
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 11561 chars.
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 213 chars.

========================[ Apr 4, 2003 8:46 AM ]========================
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 0 chars.

========================[ Apr 4, 2003 8:49 AM ]========================
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 213 chars.


The final one (with the 213 chars) appears the most. It didn't hang every time, it just started doing that (about once a day), and then stopped.

Brian the Fist
04-08-2003, 11:32 PM
Can you tell me exactly what you've got there? Is this the beta? Can you tell me the steps you take to reproduce the problem, and describe what, if anything you see on the screen? I can't tell from your post.

m0ti
04-09-2003, 06:45 AM
No, that was the non-beta client, set with -u t (upload only) switch and nothing else. After the error listed, DF would completely freeze; deleting foldtrajlite.lock does not cause it to exit, the process has to be forcibly killed.

was just going to say that it hasn't happened since the protein switch over... but it has:

========================[ Apr 8, 2003 7:27 PM ]========================
ERROR: [010.003] {taskapi.c, line 1199} [ReadServerResponse] Timeout waiting for response, got 0 chars.

Brian the Fist
04-09-2003, 09:53 AM
The code for the current client is 'frozen' as it will soon be replaced by the 'beta'. If you continue to experience this problem with the beta (if you have tried it) let us know, but we have fixed a number of issues in it which may include this one.

Grumpy
05-15-2003, 05:24 AM
Well, my dual machine has gone berko..the Beta wont work and now the current Client is doing the same..BUT..I got error messages :bang:

========================[ May 15, 2003 7:09 PM ]========================
ERROR: [-04.000] {rotlib.c, line 100} Bzip decompression of SCWRL library error occurred
FATAL ERROR: [001.008] {foldtrajlite.c, line 2646} Cannot open rotamer library

========================[ May 15, 2003 7:11 PM ]
========================

And this

========================[ May 15, 2003 5:38 PM ]========================
ERROR: [001.023] {rotate.c, line 641} Invalid PMMD given to GetRMSD
ERROR: [001.010] {foldtrajlite.c, line 3385} Abnormally small RMSD: 0.000000 (157 1 1)

========================[ May 15, 2003 6:52 PM ]========================
ERROR: [000.000] {mmdbapi1.c, line 5018} Biostruc ASN.1 Internal Indexing Failure
Freeing Biostruc

========================[ May 15, 2003 7:06 PM ]========================



:help:

I have deleted and reinstalled countless times..I shall look through the posts for answers, but if you can give me a quick reply of "Your PC Sux" it will help :(

Update: Here is another

========================[ May 15, 2003 8:44 PM ]========================
ERROR: [001.023] {rotate.c, line 641} Invalid PMMD given to GetRMSD
ERROR: [001.010] {foldtrajlite.c, line 3385} Abnormally small RMSD: 0.000000 (157 1 1)

========================[ May 15, 2003 8:47 PM ]========================

========================[ May 15, 2003 8:55 PM ]========================

Darn, thought I had it beat :-

========================[ May 16, 2003 6:29 AM ]========================
ERROR: [000.000] {mmdbapi1.c, line 5018} Biostruc ASN.1 Internal Indexing Failure
Freeing Biostruc

Damn, looks like the ram is stuffed...sigh :(