PDA

View Full Version : Stats Servers Borked Again?



n7vxj
08-04-2005, 11:33 PM
I just dumped between 800 and 1000 Wu's today.I dumped them in enough time to make the stats for today, but they didn't show-up.The servers crap the bed again, or what? This is really getting old!!!!! :mad:

IronBits
08-05-2005, 01:02 AM
Wait and see what happens tomorrow.
I try to make sure I get them in there before 4:30 PM PST.
You could also use Bok's proxy Server to keep an eye on how it's going all day long as well. ;)

IronBits
08-05-2005, 11:53 PM
http://stats.distributed.net/participant/phistory.php?project_id=8&id=447981
Date Blocks
04-Aug-2005 1,293
02-Aug-2005 1,396
:cheers:

n7vxj
11-03-2005, 10:44 AM
Damn stats must be screwed again!! I've dumped a couple of hundred WU's the last 2 days, and none of them show up in the stats.:mad:

PY 222
11-03-2005, 12:34 PM
Originally posted by n7vxj
Damn stats must be screwed again!! I've dumped a couple of hundred WU's the last 2 days, and none of them show up in the stats.:mad:


Don't worry, the D.net people are working on it right now:

http://www.free-dc.org/forum/showthread.php?s=&postid=94053#post94045

n7vxj
11-03-2005, 01:35 PM
Thanks for the update PY!! I was beginning to get a bit concerned!!

N.V.M.
11-16-2005, 09:13 AM
still down...:geezer:

the-mk
11-17-2005, 12:16 AM
Yes, and no new .plan (http://n0cgi.distributed.net/cgi/dnet-finger.cgi) available... :sleepy:

IronBits
11-17-2005, 12:48 AM
That's why it's important to use the pproxy server at
Free-DC.org on port 2064 ;)
Twice the stats, twice the fun! :D

the-mk
11-17-2005, 12:51 PM
Yeah, that's right :D

I've got some news from decibels .plans (http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel):

:: 17-Nov-2005 09:17 CST (Thursday) ::

Here's the situation so far with stats:

Thanks to poor driver support, we had been running for who knows how long with
3 failing drives in the raid10 array that housed the database. But that wasn't
actually what caused the outage... if a machine with an 8500 in it goes down
unexpectedly (think power failure), the controller can't trust the data on the
drives to be in-sync, so it needs to rebuild the array. Unfortunately, one
of the drives it picked to be authoritative was failing, and decided that it
wasn't going to give up it's data.

Unfortunately we've been unable to recover the array. We tried using spinrite
as a last resort, but at the rate it was going it would have taken something
like a week to recover the drive. This means that when we get back online,
we'll be running from a stats backup taken Nov. 6, about 4 days before the
failure. Any changes made to participant accounts or teams in the meantime will
have been lost.

In an ironic twist of fate, we've been working on getting a new machine in
production that would have allowed replicating user-modifiable tables (ie:
participant accounts and teams) to another machine. Had that been in place we
would have lost very little, if any, of this data.

The current situation is that we've bought 3 new drives and used them to
rebuild the array. We've also taken this opportunity to upgrade to FreeBSD 6.0.
But now any time we try to access the array, the machine reboots.

Once someone is on-site to investigate we'll hopefully know more.

PY 222
11-17-2005, 12:59 PM
Damn... its possible that we might lose some data.

Pray that we don't!

evilfix
11-17-2005, 01:52 PM
does that mean that all the WU we sent in wont be accounted for?

PY 222
11-17-2005, 01:59 PM
Originally posted by evilfix
does that mean that all the WU we sent in wont be accounted for?

It is possible that we might lose a few days of work but lets hope that they can recover everything from the drives.

Mustard
11-17-2005, 07:48 PM
hmmmm..... time to start stock-piling results until they get this array issue fixed and backups running again.

alpha
11-18-2005, 08:40 AM
Glad to hear they are upgrading to FreeBSD 6.0 which offers better performance than previous versions, not to mention the bug/security fixes and other updates.

I hope they will reliably re-issue the work units (packets/stubs) which were issued whilst this problem has been ongoing.

N.V.M.
11-18-2005, 09:25 AM
do i see stats? :eek:

:bouncy:


edit: looks like Nov 5th was the last day.

the-mk
11-18-2005, 11:49 AM
It seems to be that the stats are coming back!

Looking a bit strange (stylesheets seem to be missing or changed) and there are some funny texts (STATS NOTICE: Only variables should be assigned by reference) :D

They currently contain data from Nov 5th, but I think the server will currently compute the stats of the days after the 5th... and that'll take some time...

:cheers:

PY 222
11-18-2005, 12:44 PM
There's still no stats on my end.

http://stats.distributed.net/

the-mk
11-18-2005, 01:19 PM
so then they are gone...

I think they need to update a lot of things until they can go "productive" again with the stats...

N.V.M.
11-18-2005, 06:31 PM
aaaargg. the teasers...:bang: guess they were just testing.

the-mk
11-19-2005, 02:51 AM
News, once again from decibels .plans (http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel):


:: 18-Nov-2005 17:09 CST (Friday) ::

Can someone tell me what's wrong with this picture?

decibel@fritz.1[16:52]~:60>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = 6b3bb0796f7025fc243b2bfe8e9ec8b2c661045b
decibel@fritz.1[16:55]~:61>sha1 fritz-20051002.sql.bz2
SHA1 (fritz-20051002.sql.bz2) = c012f152f05d5e33a88e027948d5e267e7003e2b
decibel@fritz.1[17:01]~:62>

In a nutshell; fritz is throwing random errors when reading from either drive
array. Obviously not a feature one looks for in a database server. I suspect
it's the 3ware controller, but we'll need to do more testing to find out.

The machine is also being moved this weekend, probably on Sunday. Between the
hardware issues and the move, people probably shouldn't expect stats to be back
up until next week at the earliest.

Also, stats were inadvertently turned back on last night. Unfortunately, any
modifications that were made last night will be lost. So, for example, if you
created a team, or changed some of your participant information last night,
that will be gone when we come back.

Hopefully they get this issue ironed out!

the-mk
11-20-2005, 05:25 AM
Here are some other news from nuggets .plans (http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=nugget):


:: 19-Nov-2005 12:32 CST (Saturday) ::

We made good progress this morning in diagnosing the problems with the
stats server. As Decibel mentioned last night, we started seeing random
read errors when pulling data off the drives. Running a SHA1 or MD5 hash
off the PostgreSQL backup file (10GB) twice in a row would never yield
the same hash twice in a row. Quite creepy to see.

At first we thought we might be dealing with an OS issue, since we'd
taken this downtime as a good opportunity to upgrade the server from
FreeBSD 5.x to 6.0-STABLE, so we got a little sidetracked debugging
UFS2 and newfs options (which we'd also experimented with during the
restore). In that experimenting, Leto managed to ferret out a weird
bug in FreeBSD 6 where the system will panic if you copy a large
directory structure to a drive which has been tuned with a large
average filesize parameter. (Sent PR amd64/89202 to the FreeBSD team)

http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/89202

Once we moved past that, though, we were still facing the weird read
errors. This morning I nicked two drives out of the raid10 volume (which
was empty anyway) and plugged them in to a spare 9500S card that we've
got on hand. We're unable to repro the read errors off that card, which
would seem to indicate that the problem is indeed the old 3Ware 8506.

Sadly, the 9500S card is only the four port model, so we can't just
swap it in and start using it, we'll have to order a new card for
the stats server.

I'm quite encouraged that we seem to have isolated the problem to the
controller card. It's under warranty, but it's a depot repair and
the vendor won't just cross-ship us a replacement. We'll have to
order a new card if we want to get the server back up and running in
a reasonable amount of time.

Mustard
11-20-2005, 11:39 AM
Well I'm glad that they *think* they have the issue located......... coming from a hardcore commercial operations background, I sure wouldn't screw around with that setup anymore. If that card is that hard to obtain, then it is time to move to something much more common and mainstream..... really poor thinking for a 24x7 operations center. Business doesn't tolerate outtages well at all....... it costs major sums of money, not to mention credibility.

Meanwhile I'm moving off to another world.

N.V.M.
11-20-2005, 03:14 PM
i'm still good for caching for another week or so because the little machines i have running rc-5 aren't the fastest crunchers,and i always try keep it topped up with my 1000 WUs. i haven't even tried but is work still available? if so, i don't see a reason for switching.

the-mk
11-23-2005, 12:28 AM
nugget's .plans (http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=nugget)

:: 22-Nov-2005 20:31 CST (Tuesday) ::

The new raid controller for statsbox arrived today (3Ware 9550SX-8) and
I've got it plugged up and running. Everything looks great so far,
although the "SX" series cards are a bit new for FreeBSD stable and we'll
have tapdance a bit on startup to get the proper twa driver loaded. I
see that the driver version we need was committed to FreeBSD current
about two weeks ago, so the awkwardness should be short-lived, I'd
expect an MFC into stable before too long.

The universe just keeps piling on, though, and one of the new 300GB
drives we bought died today while I was trying to initialize the
RAID10 volume. I ran to Fry's to pick up a new, new drive and this
one seems fine. Right now I'm working on moving the contents of the
200GB RAID1 system volume (the OS and home directories) onto a new
300GB mirror made from two of the new drives. This will give us an
extra 100GB to play around with in our home directories, which ought
to be nice. Once I've verified that the system volume has copied to
the 300GB drives I'll wipe the old ones and rebuild the RAID10
(database) volume from the six remaining 200GB drives.

I should have all that wrapped up by tomorrow, which means we'll be
in a position to restore the stats database backup and kick off the
catchup runs from all the keymaster log files that have been piling
up during this downtime.

Thanks again for your patience and understanding as we bring stats
back to life. Hopefully this means we'll have gotten the next few
years' worth of problems out of the way all in this one massive crash.

Moo.

the-mk
11-24-2005, 12:37 AM
It should come back soon, decibels .plan (http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel):


:: 23-Nov-2005 23:05 CST (Wednesday) ::

23:02 <+dctievent> (statsbox-iv/r72) Daily processing for 20051106 has
completed

As soon as fritz is moved back into a datacenter we should be all set. In the
meantime, it's playing catchup.

IronBits
11-24-2005, 02:18 AM
:thumbs:

russkris
11-27-2005, 06:35 AM
Has anybody got any more updates on the Servers

the-mk
11-27-2005, 09:04 AM
Originally posted by russkris
Has anybody got any more updates on the Servers

Generally you can find news about distributed.net things about new clients or issues about stats servers there: http://n0cgi.distributed.net/cgi/dnet-finger.cgi

Last update about the stats server: http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel

:: 23-Nov-2005 23:05 CST (Wednesday) ::

23:02 <+dctievent> (statsbox-iv/r72) Daily processing for 20051106 has
completed

As soon as fritz is moved back into a datacenter we should be all set. In the
meantime, it's playing catchup.

I hope the next few days I can see my stats again :D

russkris
11-27-2005, 09:53 PM
Thank you for that link

N.V.M.
11-28-2005, 08:48 AM
stats! :elephant:

PY 222
11-28-2005, 03:31 PM
And stats are back:


:: 28-Nov-2005 13:29 CST (Monday) ::

In case anyone didn't notice... stats are back. :)

Do the happy dance :elephant:

the-mk
11-28-2005, 03:47 PM
That's great!!

:cheers: :elephant:

PY 222
11-30-2005, 02:15 AM
Here we go again:


:: 30-Nov-2005 00:21 CST (Wednesday) ::

Well... when it rains...

Nov 30 05:39:02 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0026): Drive ECC error reported: port=5, unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x002d): Source drive error occurred: unit=1, port=5
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0004): Rebuild failed: unit=1
Nov 30 05:48:01 fritz kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=1, port=3
Nov 30 05:51:47 fritz kernel: twa0: INFO: (0x04: 0x000b): Rebuild started: unit=1

In plain english... another drive has failed. I've heard it's common for drives
from the same manufacturing run to all fail at the same time; I guess this is
proof.

I'm going to turn stats back on again, but I highly recommend you not make any
changes to team or participant information until this is all cleared up. It is
very possible that we will end up losing the entire array again, which right
now would mean reverting to a backup that could be days (or possibly even
weeks, depending on how long this takes).

We've already RMA'd 2 200G drives. Once those come back it shouldn't be much of
an issue for us to deal with drive failures, since we'll have some spares
on-hand. I'm also going to setup replication of critical data so that even if
we do lose the database again loss of user-modified data should be minimal.

Thanks for your patience.

From decibel's .plan

http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel

the-mk
12-07-2005, 01:04 PM
http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel

:: 06-Dec-2005 19:05 CST (Tuesday) ::

*sigh*

Got a background fsck failure on /usr which I wasn't able to handle remotely.
My attempt ended up rendering the box off the net, so we're now stuck until
someone can get to the console, which might well be tomorrow. Ooops.

Sorry for the continued delay...


:: 06-Dec-2005 13:46 CST (Tuesday) ::

Replacement drives are finally here. We're working on getting a backup before
doing the RAID rebuild, which is why stats are down. They should hopefully be
back up in time for statsrun.
Another reason to switch my last few boxen to R@H...

the-mk
03-26-2006, 12:59 AM
http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel

:: 25-Mar-2006 06:44 CST (Saturday) ::

Stats are currently down, and I'm unable to ssh into the box. Since I'm in
Belgium right now, there's not much I can do, but someone in the states should
be up and able to look at it in the next few hours.


:: 24-Mar-2006 09:31 CST (Friday) ::

I'll be updating PostgreSQL on stats shortly; there will be a brief outage.

missing stats :cry:

the-mk
03-27-2006, 04:28 PM
http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=nugget

:: 27-Mar-2006 11:54 CST (Monday) ::

I've got statsbox back up and online, and the raid10 volume is
currently rebuilding. It looks like the drive tray fans on
drives 7 and 8 have stopped working, which may be the source
of the problem. All those SATA drives are crammed in close
together and perhaps the drive weirdness we've seen lately is
the result of heat issues from the failing fans.

I've got the stats website shut off for now while the volume
rebuilds and Decibel can get a chance to nose around and make
sure that all the data looks sane.

EDIT: crap, drive 7 just disconnected itself again.

the-mk
03-28-2006, 11:30 AM
http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel

:: 28-Mar-2006 02:33 CST (Tuesday) ::

Another of the original drives in fritz has died. Fortunately there was no data
corruption like last time, but we've decided to keep stats offline until we can
get a new replacement installed. I'm not sure when exactly that will happen,
since I'm currently 8 time-zones away from the machine. I would expect it to be
this week, however.

IronBits
03-28-2006, 07:41 PM
Thanks for all the updates :thumbs:

Sorry to hear they are having so much trouble with thier raid array :(

the-mk
04-04-2006, 12:15 AM
If anyone didn't notice: stats are back :D

the-mk
04-11-2006, 03:36 PM
and again and again and again and again.... it is offline... no .plans since 28th of March 2006...

did something bad happen to their RAID array? :dunno:

IronBits
04-11-2006, 08:55 PM
I would assume so, as it has been down for several days now.

the-mk
04-12-2006, 06:45 AM
funny thing: it went up again, but per now the stats are only from 7th of April 2006...

I think in a few hours or days we have the actual stats back again :D

from the .plans of nugget:

:: 11-Apr-2006 16:36 CDT (Tuesday) ::

I made good progress on statsbox today and I think we've finally found the
fundamental problem that keeps taking drive 8 offline. Each of the 9
drive bays in fritz's case has a little hotswap backplane board which
connects to the drive's SATA and power connectors on the front, and to
the case power supply and SATA cable on the back side. It looks like
cable tension for the bundle of cables for the last three bays has been
pulling down on those three cables and loosening the connection between
the SATA cable and the backplane board. The cables for all three bays
are really, really loose and bay 7 even has broken plastic.
...
...
more details on http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=nugget (haven't read all details yet)

the-mk
04-13-2006, 06:06 AM
from decibels .plans (http://n0cgi.distributed.net/cgi/dnet-finger.cgi?user=decibel):

:: 12-Apr-2006 15:51 CDT (Wednesday) ::

As nugget mentioned yesterday, we think we've discovered the reason why drives
keep dropping out of the array. Nugget tried to fix the problem, but it looks
like he was unsuccessful as we're back to degraded mode again.

Rather than continue without stats while we try and fix this, we're going to
turn them back on and switch to nightly backups for now. It is possible we
could end up losing some user changes if we lose another drive in the array,
but hopefully that won't happen...