PDA

View Full Version : server gone south



graeme
04-16-2005, 12:00 PM
The server is not responding to any queries as of 10:30 am on Sat. It's been a difficult week with communication problems, and I propose a weekend off from the chaos. I'll reset things first thing on Monday.

IronBits
04-16-2005, 12:04 PM
:cry: Sorry to hear that! Oh well, enjoy your week-end :cheers:

72CJ5
04-16-2005, 12:08 PM
ahhhhh rats!!! Thank you for posting though Graeme. :)

IronBits
04-16-2005, 01:36 PM
Need a free host to run this project on? Maybe that will help ;)

72CJ5
04-17-2005, 03:31 PM
Man.......... having some **severe** withdrawal pains here..............


:)

graeme
04-18-2005, 11:53 AM
It looks like a partially failed disk in the raid system was forcing constant attempts to resync the data. Things might be a little spotty today, but we'll have the bad disk out, get things going again, and have it replaced later this week.

em99010pepe
04-18-2005, 05:58 PM
Server is up and running.

Carlos

black_civic55
04-18-2005, 06:23 PM
im the only one with positive amount of points and im not in the top 10 graph!!!

birdman2584
04-18-2005, 06:39 PM
i guess they took away a day or so worth of work...shucks...haha:bang:

graeme
04-18-2005, 09:08 PM
I'm not sure how the free-dc stats updated on the 17th. The machine was down all day, so there were no work units completed. Work units were lost between midnight on the 16th and 10:30 am.

Bok
04-18-2005, 09:27 PM
They didn't... 'yesterday' is a misnomer actually, it's just the last day the stats changed :/

AMDave
04-19-2005, 04:39 AM
Thanks graeme.

It sounds like time well spent.

sir spuddly buddly
04-22-2005, 07:47 AM
Seems like the server is sleeping again.....:(

graeme
04-22-2005, 10:11 AM
There was a power outage on campus. Things will be back to normal in an hour.

AMDave
04-24-2005, 08:19 AM
graeme,

I now see the server has been offline for more than 12 hours.
I recall you were going to do some testing.
How is it going ?
Will it be back up tomorrow ?

em99010pepe
04-24-2005, 04:53 PM
Server is up.

Omer, YGPM.

Carlos

graeme
04-24-2005, 05:42 PM
If it's not one thing, it's another. The server was being moved to a UPS to avoid power outages. Thing were looking good at first until the ethernet line was accedentally pulled out. It's been reconnected.

AMDave
04-27-2005, 08:22 AM
graeme,

Time for some feedback ... The last couple of changes seem to have done the trick. I/O errors are down, sleep states are shorter and the WU return rate is up.

Great job !

I notice we are still dealing with the larger WU, so the overall results returned rate is still lower than it used to be. As I cannot see it from this end, I thought it pertinent to ask: Has the overall growth calculation rate come back up to / beyond previous levels for the project ?

graeme
04-27-2005, 09:51 AM
I'm not sure how close to the limit we are now. The machine is connected on a new port, and the load is lower both due to the longer work unit and maybe fewer clients with all the down time we have had. I guess it would be good to test the limit, but for now I'm enjoying the calm.

AMDave
04-29-2005, 09:44 AM
vgriffin seems to have the engines for stress testing though !
My word ! :notworthy:

sir spuddly buddly
04-30-2005, 03:07 PM
Can someone please kick the server again?! :(

sir spuddly buddly
04-30-2005, 04:02 PM
Thank you! :)

KWSN_Dagger
05-01-2005, 11:36 AM
Looks like she's down again. Almost 24 hrs now.

"Could not connect to MySQL:Too many connections" as reported on the main site when i tryed to look at the stats. Might be why the stats on the Free DC page aren't updating.

graeme
05-01-2005, 12:24 PM
Thanks, I've restarted mysqld. The stats should update properly.

Mustard
05-01-2005, 02:11 PM
Originally posted by graeme
Thanks, I've restarted mysqld. The stats should update properly.


hmmm...... still not updating. :(

alok vaid
05-01-2005, 05:53 PM
The stats are back, and updating.

:|party|:

Alok

sir spuddly buddly
05-03-2005, 05:49 AM
There seems to be a lot more sleeping from the server recently. Do we need to buy some coffee to keep it awake, or is it suffering from narcolepsy, or is this a way to slow down the WUs? :(

AMDave
05-03-2005, 06:07 AM
I concur, sir spuddly buddly
Lots and lots of ZZZ

I noticed the slow down in results across the board.
It seems that we are down to 1/3 or even 1/4 of capable output again.

I just ran a couple of traceroute and a ping tests.
--- (eonserver) ping statistics ---
34 packets transmitted, 33 received, 2% packet loss, time 33048ms
rtt min/avg/max/mdev = 237.498/1034.440/3888.186/1166.986 ms, pipe 5

The average rtt is over 1 second again, so the client is sleeping a lot.
the traceroute tests were inconclusive.
the slowdown seems to come and go every few seconds by the look of the full ping trace.

It could be as innocuous as a poor capacitor in a router or NIC somewhere.

V frustrating.

sir spuddly buddly
05-03-2005, 06:33 AM
Thanks for running the check! :)
Somethings amiss, and it started getting slower on Saturday after the long stop. Ho hum...it's a bit hard to get people interested in eOn when it dozes off every 5 minutes! ;)

AMDave
05-03-2005, 08:14 AM
after a couple more tests I can confirm that
the server itself is up,
the web server is up,
the database server is up,
the tcp/ip delays are intermittently clearing,
but we are still getting no WUs.

Could alok / graeme restart the WU splitter please ?

N.V.M.
05-03-2005, 07:59 PM
this project is reminding me of the last 6 months of Distributed Folding. :bs:

AMDave
05-04-2005, 10:36 AM
The project is trotting along again.
Thanks to the project Admins.
Well done chaps.

sir spuddly buddly
05-05-2005, 01:44 AM
I still seem to be getting quite often sleep messages when I try to return work and get new work. Is there some ongoing problem? :(

KWSN_Dagger
05-05-2005, 02:53 AM
I've seen up to 8 - 60 second sleeps for a single WU

sir spuddly buddly
05-05-2005, 03:04 AM
:sleepy: it seems to.....:sleepy: have ground.....:sleepy: to a halt....:sleepy:
Ni!
(edit - some 15 minutes later - it seems we're back underway again :) )

rcoulter
05-07-2005, 07:41 AM
Well, the server is down again. Needs a reboot. Down almpst all of Friday night, Saturday morning from what I see.

Randy

AMDave
05-07-2005, 08:02 AM
the server is up
the database is up
the web service is up
the WU splitter has stopped

could there be a pattern here ?
It seems the WU splitter goes into decline until it stops or kills something else on the server. In my mind I get the impression of a malloc leak or something like that.

greame,
just before you restart the service can you / omer perhaps check the memory availability and the memory useage in this frozen state. Then you could compare with the same just after a cold-reboot in a running state. It's just an itch but it'll feel good if its scratched.

sir spuddly buddly
05-14-2005, 04:39 AM
The time is approx 3.40 Texas time and I'm beginning to see ZZZ messages with most of the recent WUs. Is something amiss? :(

em99010pepe
05-14-2005, 05:02 AM
I'm also getting those ZZZZZZZZZZZZZZZZZZZZZZZZZZZ.
It's funny but that happens always on Saturdays.

Carlos

Mustard
05-14-2005, 01:14 PM
I've noticed the same thing too Carlos. And now I'm getting lots and lots of the send message failed stuff along with lots and lots of zzzzzzzzzzzzzz

em99010pepe
05-14-2005, 01:17 PM
Funny but I restarted eOn and now everything is running well. No zzzzzzz.

Carlos

sir spuddly buddly
05-19-2005, 07:48 AM
From about 06.30 Texas time lots of ZZZ - time is now 06.50. :(

sir spuddly buddly
05-19-2005, 07:59 AM
And the minute I post, it all starts up again....:)
( EDIT - for 5 minutes only! :mad: )

rcoulter
05-19-2005, 09:06 AM
It's 9:11 EDT and we are down. Think you need to kick the server.

Randy

AMDave
05-20-2005, 06:40 AM
graeme

we aint gittin' no work from the server.

the server is up
the web service is up
the database is up
it aint the day of the week
or the net traffic
or the ups
or the raid array
or the NIC / router / gateway

The WU splitter seems to have gone splat again :stretcher

I know it is a work-around as opposed to fixing the problem, but can you cron a wu splitter restart every 24 hours ?

If that works we could go back to the shorter WU's because the availability would be back up again.


it has been over 24 hours since the start of the current issue, so I sent an email to admin

I can add...
after yet a couple more tests I can be more specific.
"wu splitter" is not quite accurate.
killing off the current stalled clients & restarting them mostly results in a successfull connection and receipt of a new wu.
It is only when the completed return is required that the return stalls.

Hence it is not actually the wu splitter that is failing here, but specifically the "WU Collector" that is failing to accept the completed WUs.

As the I/O at the network levels and application level on the server is apparently clean for the WU splitter, this would place the problem squarely in code of the the WU collector service in the application layer.

My suggested cron scheduled restart may treat the symptom in the mean time, but perhaps a code review of the wu collector might prove worthwhile for the long term.

AMDave
05-20-2005, 11:41 AM
It's back.:cool:

graeme
05-20-2005, 01:33 PM
I didn't realize how bad things were getting. I was watching the server yesterday. There were a few errors, but I didn't feel it was catastrophically bad. I've reset it now.

Your comment that a newly started client is more likely to get through than a client reporting results is really interesting. I didn't realize this, and it might help point to a problem section in the code.

Omer is working on restructuring the code right now. Actually, it is not the communication code he is working on, but rather the eon server code. There is a small memory leak in the server which eventually causes the machine to swap, and this is when the communication problems get bad. He is adding proper destructors to classes, and during the process, I hope the leak will be resolved.

Another change is that the work unit time is going to increase again. We are going to do a few runs on a Mg surface, and then move to diffusion on MgO. This ionic material has longer ranged (Coulomb) interactions which are evaluated using a Fourier transform. A side benefit of this system is that the communication to the server will be less frequent.

As we start working on these different system, new problems will likely show up, but I think it will be worth the trouble.

sir spuddly buddly
05-20-2005, 02:16 PM
Thanks for the up-date!
The problem doesn't seem to be the frequency of communication, but the lack of response by the server. I hope this means the "Saturday" problem has been sorted out too, we'll see tomorrow!
Ni! :)

Mustard
05-22-2005, 03:43 PM
Sigh...................................... we aren't getting results tallied again............ :(

graeme
05-22-2005, 03:51 PM
thanks, I think we're good again

Mustard
05-22-2005, 04:19 PM
Rough life huh Graeme, living around and having to deal with a herd of stats whores????? :)

Mustard
05-26-2005, 12:08 AM
It appears that the stats are not incrementing again............. :(

Mustard
05-28-2005, 01:45 AM
Getting a lot of timeouts again......... :(

rcoulter
06-06-2005, 03:24 PM
The server is hung again. Time is 15:30 EDT on June 6th.

Randy

graeme
06-08-2005, 12:56 AM
Thanks for the posts, Andreas did restart the server when we saw your last post. Switching to a different system has been relatively painless. This system seems to produce even faster work units than the last one. Just a warning, though, the next one we get going will be much slower. We are moving to metals on ionic solids, which is important for making efficient catalysts, and for hydrogen storage. The ionic potentials are much more expensive because the intereaction between charged species is very long ranged, and all atoms interact will all other atoms in our simulation cell. I'll post before we make that change.

Mustard
06-08-2005, 01:13 AM
That was a nice post Graeme! :) It is nice to know why the next series will be slower, and what will be accomplished with the effort. Too bad more projects don't pass that kind of information on to their volunteer crunchers. Lets us know that you think enough about our help to keep us in the information loop. You have some volunteers here that are really interested in your project. Thank you very much! :)

Bruce

black_civic55
06-08-2005, 02:56 AM
i for one am very interested and appreciate all of the info we receive from you guys!! Thanks! :thumbs: