Server problems addressed and grace period extension

**Stardragon** · 03-18-2004, 01:29 PM

The source of the problem is simply an overwhelming number of user connections. Nothing was broken or malfunctioning, but both our web servers and the database were running at the maximum number of connections.

This may be due to large numbers of buffered results, the faster speed of crunching through the new smaller protein, or more than likely a combination of both. Looking through the logs it seems that the connections are starting to level out to a nice balance, so hopefully most of the people have flushed their results, or at least enough have gone through to not constantly overload our resources.

In the meantime, if you cannot connect, please wait a bit and try again later.

As for the grace period for old results, it is extended so that the 24 and 48 hour rules will apply anew starting at 12:00 pm EST today (18/02/2004). Try not to flush enormous amounts of data at once, as that prevents other users from obtaining any connections. If at all possible, upload at some evenly balanced intervals.

We will be keeping an eye on the resources, but feel free to let us know if any new errors start popping up.

**Darkness Productions** · 03-18-2004, 01:38 PM

Do you all need more/bigger hardware for either of these, or is it just an OS configuration issue? I can't imagine how many simultaneous connections you all have, but I don't think it would cause a properly configured server setup to cave like that....

Just a thought.

**HaloJones** · 03-18-2004, 01:45 PM

If ever this was doubted, the current situation cries out for a configurable upload. Not every gen but every ten or every fifty even for a protein this small.

**rofn** · 03-18-2004, 01:48 PM

i cry, too

for my 3rd begging for a personal proxy like dnet is using

or like halojones says configurable upload # of gens

**Ned** · 03-18-2004, 03:23 PM

The source of the problem is simply an overwhelming number of user connections

Sounds like you have reached the limit of your current design. You either
need faster hardware or a more efficient design. If your client gathered
the results from several succesive generations into one data record to
send to the server, you have the opportunity to insert their data into
the DB as one "logical unit of work" with a significant reduction in
resources used compared with your current menthod. If you need to change
your design to do that, so be it. You are currently rewriting the data
upload anyway...

Yes it will take you little longer to process a connection. But you have
a lot less connections....

Ned

**HaloJones** · 03-19-2004, 06:34 AM

With an upload every few minutes, the latencies in making the connection, uploading and breaking the connection rapidly add up. I can't babysit my farm. It needs to be able to upload on a time period which would not then change between proteins.

This latest outage started at around 8am GMT and nothing is available at present. This cannot be user connection related.

EDIT:

I had a directory with 43 gens of the old protein in it. I wrote an upload batch file to keep trying "-u t" until all the files were gone. It took more than two hours to upload them. With 25 P4s running non-stop I got 36,336 points (not gens, points) uploaded in the last two hours!

**allenfinch** · 03-19-2004, 12:37 PM

The fault lays with members of the community, for not bothering to plan a more consistent upload.
Certainly one takes account of the worst case, but it doesn't make sense for them to expend resources to deal with the fact that the servers get slammed once a month because some people horde results.

It's extremely difficult to design capacity to handle that sort of scenario. If users would make it a policy to execute a predictable and consistent upload of results, they could maximize efficient use of resources on their end.

In this case we shouldn't be irritated with them, but with ourselves.

**HaloJones** · 03-19-2004, 01:10 PM

I have not cached anything although I agree others may have. I am deeply unhappy when I see some who have managed to get 2 million points in a two-hour window when I have achieved 100000 but have millions that can't get through. That's not from caching - these are points from the new protein that can't get through.

**willy1** · 03-19-2004, 01:48 PM

Originally posted by allenfinch
The fault lays with members of the community, for not bothering to plan a more consistent upload.
Certainly one takes account of the worst case, but it doesn't make sense for them to expend resources to deal with the fact that the servers get slammed once a month because some people horde results.

It's extremely difficult to design capacity to handle that sort of scenario. If users would make it a policy to execute a predictable and consistent upload of results, they could maximize efficient use of resources on their end.

In this case we shouldn't be irritated with them, but with ourselves.

The community doesn't necessarily choose to buffer every time a protein change occurs.

The clients start buffering on their own when the server is taken down for hours for switching over to a new protein. Those buffered results are trying to upload at the same time the servers are trying to download new client or protein packages.

There are some in the community who do choose to buffer during the protein change - either before getting the update, or after - simply to keep their production rate up instead of wasting cycles waiting for the server to respond or time out.

Those in the community who fold off-line have not much choice but to try to upload all their cached work within the 24 hour period immediately following the update, or upload early and have the machines sit idle while waiting for the update, and perhaps have to visit the offline farm twice - once to harvest or upload, then to install the update.

The project by it's design forces these situations.

willy1

**Scotttheking** · 03-20-2004, 04:35 AM

If you crunch offline before a protein switch, whether for a week or the whole protein run, you aren't benefitting the project. If you haven't uploaded all that you can before the switch, but instead choose to buffer until after, you really should find a different project.

**RaginSteveK** · 03-20-2004, 09:33 AM

i'm on dial up.. I upload every morning, twice every evening- ..

AFTER downloading the new software, within a day I had hundreds of generations waiting on each of 3 machines..
some of them I babysat for hours-- connected..
essentially NOTHING was getting thru....

**Brian the Fist** · 04-19-2004, 12:02 PM

We should be able to handle the load, even with everyone uploading at once. I am starting to think it may be wiser to optimize the database and queries made to it - reduce the total number of connections happening. Unfortunately I am not a database whiz by any means. And this would not be a minor change, so Im just not sure.

**tpdooley** · 04-19-2004, 03:10 PM

Things have changed dramatically since our upload once every 1 or two days..

Is it possible to send all the files for a generation together (zipped or packed with an algorithm that has a really low processing hit for decompressing) if that will help in addition to the optimization of the database access? i.e. grab it all at once (reduce handshaking by 3) - and verify it all at once - instead of doing it once per file.

Thread: Server problems addressed and grace period extension

Thread Tools

Rate This Thread

Display

Server problems addressed and grace period extension

Multi-Generation Input

Posting Permissions