PDA

View Full Version : Clients getting stuck



PY 222
11-01-2005, 04:03 PM
Alright people, I am not a happy camper.

So far I've put R@H on 9 machine using the optimized client that Bok compiled. 4 of them just died on me since Oct 30th. I had to restart BOINC to get it running again.

Has anyone had this problem?

I also read that BOINC will re-benchmark the client every 5 days or so. Could this be the reason?

Bok
11-01-2005, 04:11 PM
Make sure you have boinc set to stay in memory from your general preferences, that might be the problem. There have been noted problems when running benchmarks and the appliaction is switched out of memory.

I've had no problems except on a couple of machines which gvie problems with most projects. (need disassembling and re-assembling)

Is this linux or windows ? Anything in the logs ?

Bok

PCZ
11-01-2005, 04:29 PM
Yes

I have that problem on my linux boxes.
The benchmarks run and Rosetta stops.

Boincview shows all well because boinc is still running. :swear:

I have to nurse my linux boxes through the benchmarking every 5 days.

Bok
11-01-2005, 04:39 PM
PCZ,

do you have it set to stay in memory ? I don't get those problems at all on my boxen since I did that.

Bok

PY 222
11-01-2005, 04:45 PM
Bok, no I didn't checked the logs but I still have two more machines that are stuck in a different location without SSH and I'll have to walk to that facility to check it out. Will let you know more when I get around to that one.

PY 222
11-01-2005, 05:37 PM
Confirmed. The other two clients that were stuck was due to BOINC running benchmarking.

I've since changed my General Preference to leave the application in memory. I'll report back next week.

Bok
11-01-2005, 07:28 PM
hmmm,

I've been reading up on this and it appears that indeed this is the issue.

When running benchmarks, or switching applications, Rosetta can fail and require a restart.

Setting 'keep in memory' will fix this, BUT some versions of the boinc client ignore that when running the benchmarks and switch out of memory anyway.

I'll try and find or compile some definitive versions which don't do this.

Perferably I'd like to get some 5.x versions optimized, haven't found any on the net yet..

Bok

LAURENU2
11-11-2005, 10:01 AM
I have a stubborn client it will not download work it has 256 mem and a 2 gig HD with 1.2 gig free
I get this when it tries to DL
11/11/2005 8:36:29 AM||Starting BOINC client version 5.2.6 for windows_intelx86
11/11/2005 8:36:29 AM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
11/11/2005 8:36:29 AM||Data directory: C:\PROGRAM FILES\BOINC
11/11/2005 8:36:29 AM||Processor: 1 AuthenticAMD AMD Athlon(tm) XP 2400+
11/11/2005 8:36:29 AM||Memory: 223.32 MB physical, 150.00 MB virtual
11/11/2005 8:36:29 AM||Disk: 1.96 GB total, 1.18 GB free
11/11/2005 8:36:29 AM|rosetta@home|Computer ID: 60502; location: home; project prefs: default
11/11/2005 8:36:29 AM||General prefs: from rosetta@home (last modified 2005-11-08 20:46:28)
11/11/2005 8:36:29 AM||General prefs: no separate prefs for home; using your defaults
11/11/2005 8:36:29 AM||Remote control not allowed; using loopback address
11/11/2005 8:36:29 AM|rosetta@home|Deferring communication with project for 23 hours, 55 minutes, and 51 seconds
11/11/2005 8:43:07 AM|rosetta@home|Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
11/11/2005 8:43:07 AM|rosetta@home|Reason: To fetch work
11/11/2005 8:43:07 AM|rosetta@home|Requesting 17280 seconds of new work
11/11/2005 8:43:16 AM|rosetta@home|Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
11/11/2005 8:43:16 AM|rosetta@home|Message from server: No work sent
11/11/2005 8:43:16 AM|rosetta@home|Message from server: (there was work but you don't have enough disk space allocated)
11/11/2005 8:43:16 AM|rosetta@home|Message from server: Not enough disk space (only 147.4 MB free for BOINC). Review preferences for minimum disk free space allowed.
11/11/2005 8:43:16 AM|rosetta@home|No work from project
11/11/2005 8:43:22 AM|rosetta@home|Deferring communication with project for 23 hours, 59 minutes, and 53 seconds

What am I doing wrong :cry:

Bok
11-11-2005, 10:07 AM
Check your general preferences on your account on the Rosetta home page, especially in the area of Disk and Memory settings..

fyi this is what I have



Disk and memory usage
Use no more than 4 GB disk space
Leave at least 0.05 GB disk space free
Use no more than 98% of total disk space
Write to disk at most every 600 seconds
Use no more than 95% of total virtual memory



I think it is probably the 'leave at least' setting which is giving you problems, decrease that some.

Bok

LAURENU2
11-11-2005, 11:25 AM
Thanks Bok that helped It is working now:kiss:

MerePeer
11-11-2005, 12:33 PM
Nice thread...I thought about posting this issue earlier in the week when I saw a node seem to hang for a couple hours and then convinced myself that Rosetta might have kicked back in (boinc was still running) if I had been more patient. Meanwhile I actually switched the in-memory config setting back to not-allowed earlier in the week because I saw on a 256MB pxe node that it seemed to be keeping both Rosetta and WCG in memory.
A bit of background: I was thinking it would be nice to have a backup boinc project that would run ONLY when Rosetta wasnt. That doesnt seem to be supported by boinc configurations yet, so what I did was set the WCG allotment to 1%, thus leaving 99% for ROsetta -- and boincview confirms this. I'm hoping that if Rosetta is taken down AND I run out of work that WCG will kick into full time? I also set the application-switch timer to 20 minutes so it wouldnt spend a full hour on WCG whenever it did fill out the 1% each day. But when I saw both in a "ps -ef" it seemed like it was that 'keep memory resident' setting controlling it (?) so I changed it. Now it sounds like I'll need to revert to yes-in-memory until we can get this issue under control. I could just deattach the node from wcg for now.

:rolleyes:

PY 222
11-11-2005, 12:43 PM
I just saw that one of my nodes getting stuck again. This time its not the benckmark issue but something else:


2005-11-10 16:52:38 [rosetta@home] Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
2005-11-10 16:52:38 [rosetta@home] Reason: To report results
2005-11-10 16:52:38 [rosetta@home] Requesting 0 seconds of work, returning 1 results
2005-11-10 16:52:44 [rosetta@home] Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2005-11-10 16:52:50 [rosetta@home] Deferring communication with project for 1 minutes and 55 seconds
2005-11-10 16:52:50 [rosetta@home] Deferring communication with project for 1 minutes and 55 seconds


It has been sitting there for the past 17 hours without doing anything. I know for a fact that it has workunits to crunch and it had done its benckmarking a day before getting stuck.

So what else could be the reason?

Bok
11-11-2005, 01:05 PM
OS? boinc version ?

PY 222
11-11-2005, 01:10 PM
OS is RedHat 9.

BONIC version is the optimized version that you gave me.

Bok
11-11-2005, 01:18 PM
hmm,

not sure,

can you get me the rest of the log ?

Or can I get on the box via ssh at all?

Bok

PY 222
11-11-2005, 03:16 PM
Not to worry Bok, I'll try to keep an eye on this box for now. If this crops up again next week, I'll let you know.

Angus
11-11-2005, 05:19 PM
@ PY222

Was it stuck on one of the 1% things, or just simply not doing anything?

I seem to still get about one of the "stuck at 1%" WUs a day, across the ~ 20 concurrent WUs being crunched - but all on Windoze boxes.

Since I've gotten the return_results_immediately flag working again, it's real easy to catch it by looking at the last connect times in the computers list on the Rosetta site. Any box not reporting for more than a couple of hours is suspect. Unfortunately BoincView isn't an option in my environment.

PY 222
11-17-2005, 06:03 PM
Not sure if the same problem but one more:


2005-11-15 03:37:15 [---] Suspending computation and network activity - running CPU benchmarks
2005-11-15 03:37:15 [rosetta@home] Pausing result 1hz6A_abrelaxmode_random_length20_jitter02_omega_sim_aneal_17127_0 (removed from memory)
2005-11-15 03:37:15 [rosetta@home] Pausing result 1hz6A_abrelaxmode_random_length20_jitter02_omega_17413_0 (removed from memory)
2005-11-15 03:37:18 [---] Running CPU benchmarks
2005-11-15 03:37:26 [---] Failed to stop applications; aborting CPU benchmarks
2005-11-15 03:37:26 [---] Failed to stop applications; aborting CPU benchmarks
2005-11-15 03:37:27 [---] Resuming computation and network activity
2005-11-15 03:37:27 [---] request_reschedule_cpus: Resuming activities
2005-11-15 03:37:27 [---] ACTIVE_TASK_SET::check_app_exited(): pid 17307 not found
2005-11-15 03:37:27 [---] ACTIVE_TASK_SET::check_app_exited(): pid 17307 not found
2005-11-15 03:37:29 [---] ACTIVE_TASK_SET::check_app_exited(): pid 17306 not found
2005-11-15 03:37:29 [---] ACTIVE_TASK_SET::check_app_exited(): pid 17306 not found
2005-11-15 20:40:26 [rosetta@home] Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
2005-11-15 20:40:26 [rosetta@home] Reason: To report results
2005-11-15 20:40:26 [rosetta@home] Requesting 0 seconds of work, returning 4 results
2005-11-15 20:40:31 [rosetta@home] Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2005-11-15 20:40:36 [rosetta@home] Deferring communication with project for 3 minutes and 57 seconds
2005-11-15 20:40:36 [rosetta@home] Deferring communication with project for 3 minutes and 57 seconds
2005-11-16 13:37:35 [---] Suspending work fetch because computer is overcommitted.
2005-11-16 13:37:35 [---] Using earliest-deadline-first scheduling because computer is overcommitted.

So did the overcommit stop the client? :confused: