PDA

View Full Version : More work to maintain R@H using BOINC



PY 222
11-14-2005, 07:09 PM
This is not a beatdown but more of a :rant: on my part.

Ok, I am not sure if its Rosetta's client or the BOINC client (most likely BOINC) but it seems like I have spent more time babysitting 9 boxen on this project as compared to the 180 boxen on FaD.

I understand that R@H is still a new client and they have been very responsive in trying to resolve a few issues as quickly as they can but I am seriously scared of deploying so many servers on this project as I will be overwhelmed with the number of restarts that I'll need to make on so many boxens.

I hope that they'll be able to get all the problems resolved before Dec 16.

For those of you that were from FaD, are you all experiencing this?

Again,... this post is not meant as a flame towards the project, just an observation from a corporate pharmer.

black_civic55
11-14-2005, 07:17 PM
i just received my first frozen WU in like a month or so. So yea i guess there are still problems but they seem to have gone down a lot.

Bok
11-14-2005, 07:45 PM
These still on linux PY ?

I'm getting a few stopping after running CPU benchmarks still too.

How about I script something to check for the condition and auto-restart the client ?

Or someone else could do this before I get to it...

Bok

LAURENU2
11-14-2005, 08:22 PM
Yes this project is not a set and send out client like FAD was
And does suck up a lot more resources
I to have a few that get stuck and need a kick to get them going again:trash:

PY 222
11-14-2005, 08:22 PM
Yeap... all on Linux.

I've got one client that died over the weekend and I've just been able to get to it.

What I saw on the screen was "Client overcommitted" or something along those lines. I killed the client and restart and it is now running fine.

Don't know what overcommitted means but it shouldn't have stopped the client. :(

Bok
11-14-2005, 08:28 PM
Hopefully we'll have optimized 5.3.1 clients in the next few weeks too, I think that will solve a lot of the problems on linux.

Bok

Angus
11-14-2005, 09:44 PM
The freezing Rosetta WUs happen on Windows boxes as well.

I'm pretty sure it's a Rosetta client issue, since the other ready-for-prime-time BOINC projects I've run don't seem to have the problem. All bets are off for the alpha and beta projects.

I think your overcommitted message is the BOINC scheduler getting too smart - if it adjusts the average crunch time up as the result of a longer than normal WU, it will think that the rest of the WUs in your queue have overcommitted the box (assigns the new avg crunch time to all WU in the queue), and will quit asking for new work. As you continue to work through the queue the overcommit status should go away.

The scheduler still needs work.

Bok
11-14-2005, 10:06 PM
Actually, the linux client never freezes a wu, that's a windows only problem (the 1% for many hours)..

Bok

PY 222
11-14-2005, 10:21 PM
I just hope that the latest BOINC client and their server upgrades will help ensure a stable project for us all.

The last thing I want is to restart my BOINC clients during lunch. :jester:

PCZ
11-15-2005, 04:32 AM
PY

You are right rosetta needs a lot more babysutting than FAD,
It is getting better thiugh,

To ease the pain of babysitting you can use task scheduler or cron depending on the OS.

Basically you restart boinc every 24 hrs.

ronbo54
11-15-2005, 09:33 AM
Windows only here running as a service. Yesterday I had 8000's at this time, today it is only in the 2000's. Still doing the work, just wasn't reporting in. Workload was set at 2 days. Is it now only going to automatically upload every 2 days? Wasn't the case for the last week.:confused:

n7vxj
11-15-2005, 09:44 AM
My linux box appears to be stopping after running benchmarks as well, and I have to restart it.On my windoz boxes I force an update a couple of times a day to get completed work sent in.

Bok
11-15-2005, 10:14 AM
I do that with windows too. There is a fix out there to allow the return_results_immediately though. I *think* it was with the trux clients too. IIRC, you just added the option in to the remote_hosts.cfg file. I'll try and find it again as it would help me too..

It might be worth testing with the 5.3.1 client albeit not totally optimized to see if it has the same problems. I'll set one that I compiled up off on a box and monitor it.

Bok

Angus
11-15-2005, 10:41 AM
The return_results_immediately requires a patched client. BOINC removed that functionality back around the 4.2x time. The trux Windows 5.3.1 client restores the functionality.

If you can stand to read it, there's a thread on this (http://setiathome.berkeley.edu/forum_thread.php?id=21944) over on the SETI board.

PCZ
11-15-2005, 12:17 PM
Boinc developers know best.

Us volunteers are a lower form of life who couldn't possibly be intelligent enough to use the return_results_immediately switch properly.

I am so pleased that some higher form of life from berkley removed the functionality for me.


Now how do we get it back in the linux client.

Digital Parasite
11-15-2005, 03:54 PM
Originally posted by PY 222
The last thing I want is to restart my BOINC clients during lunch. :jester:

If you let me borg your boxen, I will restart them at lunch for you. :smoking:

MerePeer
11-15-2005, 04:11 PM
Looking over threads as well as my node's message logs I find:

Results are always "uploaded" as soon as they are complete. What *sometimes* happens at that point is a request for "new work" which seems to always include ", and reporting nnn results". However new work is not requested unless the "work buffer" (boincview column on Projects tab) has fallen below the threshold on the Rosetta acct web page under "Connect to network about every ". So if you had 2 days of work, you processed a 1 hour job, then it would upload the 1 job and then since you only have 1 day, 23 hrs of buffered work it would immediately request new work -> and while it was at it, report the 1 result.

The other way that results are reported is if no results have been reported for the entire period you have set in "Connect to network about every", then it will go ahead and report after that period. I saw this today for a node which had gone 1.5 days (my current web setting) without "reporting", but which had consistently "uploaded" throughout that period: it reported all 13 results.

What seemed to happen recently is the calculations for "work buffer" got inflated somehow (not sure how). This meant that if a node thought it had 1.5 days in the "work buffer", it was now seen as 3.5 days. Thus no new work is going to be requested for a couple days while it reduces this work buffer -- until it goes below your threshold (in my case, 1.5 day threshold).

So it all seems to be working according to the rules, but I do think something odd with the work buffer calcs changed because I've NEVER had my cache set higher than 1.5 days yet I am seeing "work buffer" values across my nodes in the 2-5 day range. To research this further I have just reduced my web setting to 1 day, perhaps less tomorrow, to experiment.

PCZ
11-15-2005, 04:55 PM
Forget all the technical mumbo jumbo.

return_results_immediately should do what it says on the tin.
It used to and now it doesn't.

If i use ver 4.45 all is fine results are returned and reported at the same time.
Use any later version and boinc reports when it feels like it.

The longer you set the connect to network the worse it gets.
I use 0.1 days so it doesn't really affect me to badly.

If i increase this to a few days to get a bigger buffer to se me through server outages then i see pages and pages of results ready to report.

I cant speak for the rest of you but i want results reported as SOON as they are done.

No amount of boinc developer bullshit would ever convince me otherwise.

Longbow
11-15-2005, 06:46 PM
Listening to all this makes me glad that I continue to stay away from Boinc. I was beginning to look at it seriously when all the Rosetta hype started building, but I keep coming back to the four things that keep me away. 1) problematic 2)resource hoggish 3)no options for sneaker-netting and 4) I hate the points system.

PCZ
11-15-2005, 07:20 PM
Don't let the crappy boinc framework put you off doing Rosetta.

It's a very worthy project. :thumbs: