PDA

View Full Version : Some Words About the DF Problems



Dyyryath
03-17-2002, 12:48 PM
Hey guys. Word seems to have gotten around that I'm 'working on the problem' with Howard. That may be a little too strong a phrase.

Yesterday, I sent Howard an email offering to help if I could. He replied (quickly, I might add) that he thought he had the problem fixed. However, he also took the time to answer some questions I had concerning the way the data was handled on their end.

From that conversation, I made some guesses and offered a couple of opinions about what might be done to streamline their upload process, since that appeared to be the bottleneck.

We sent a flurry of emails back and forth where Howard graciously took the time to answer more of my questions and further define their upload process for me.

After getting a good idea of what was *supposed* to happen, I broke out a packet sniffer and spent some time analyzing the data passed between my clients and the servers. I found that the problem wasn't really a network level problem, but rather what appeared to be a logic problem on the part of the upload script. The client *was* sending the data in it's entirety (the first WU), but it wasn't receiving an OK from the server afterward. It would then quit trying to upload data (hence only the first WU would actually go) and give the 'Network Not Available' message.

I sent some data back to Howard on what I was seeing along with a couple of guesses as to what might be causing this. He suggested that I try his uploaded patch and see if I could sniff an upload that worked as it should.

I did this and determined that the original client (which was uploading structures in groups of 5000) was timing out while waiting for the server to return the OK. This was always after 10 seconds. The new client (which was working at the time) not only had the structure cache size decreased, but it also had a longer timeout on the client side; up from 10 seconds to 30 seconds. Howard confirmed that the original timeout was 10 seconds, and that he had increased it on the new client.

Using the new executable, I was able to upload about a dozen of the 5000 structure WUs without issue. It appears that the timeout was more important than the WU size, though together they should have an even greater impact.

What it came down to (as near as I could guess with limited information) was that the server was taking too long to decompress the info, check it for validity, and insert it into the database (which I suspect may be the biggest hold up). While the server *was* doing what it was supposed to do, it wasn't doing it fast enough to finish before the client timed out. This is why the longer timeout helps (as does the shorter WUs which take less time to process and return an OK).

I sent Howard my speculation on all of this and we tend to agree that the problem is a combination of factors causing a delay in returning a confirmation to the client. I've also sent him a couple of suggestions for streamlining the process, but they may or may not be valid given the infrastructure and code they are using.

Howard (quite correctly) doesn't want to give out the source code for his upload process, so there is not much more that I can do.

The problems we've seen today with invalid handles and such, I can't even begin to speculate on since I just woke up and found them.

What I can tell everybody is that Howard spent most of yesterday answering a plethora of questions on my behalf when he didn't have to and he seems genuinely interested in fixing these things and moving on.

Give him some time. I've shut down all my clients for now, but I'm not abandoning the project. I'm sure that he'll get things worked out and when he does, I'll come right back. He's been as willing to work with, listen to, and reply to the users as anyone I've seen in a DC project yet.

ColinT
03-17-2002, 12:55 PM
OK, the User DB was busted this morning. Hence no Handles. Here's what Howard says in the Yahoo Forum:

- Just a general note to users - please be patient when there is a
problem - we will do our best to get problems resolved as quickly as
possible. However, they will get fixed FASTER if I am spending my
time fixing the problem, not wading through e-mails all telling me
the same thing. I have set up a Bug report database in this
Discussion Forum so please enter any bugs you find there, and if the
bug has already been entered, assume we know about it or will
shortly. I am just trying to avoid getting floods of e-mails. Thank
you.

- We have had a short outage of the user database. This has been
rebuilt so if you were experiencing problems logging in please try
again now. You may possibly have to enter your handle into the
software again. The database index was somehow corrupted, that is
all. No need to re-register or anything.

- With the help of some knowledgeable users, we have been able to
pinpoint at least one cause of the problem since the new protein.
Apparently the client was set to timeout after 10 seconds, so if
your upload didn't complete within 10 seconds it would be cutoff
(and thus 'No reponse from server' - wow). It is not a limitation of
our server bandwidth or computational power (although it was
stretched to its limits still). In tomorrows version the timeout has
been extended to 150 seconds to ensure all data gets through.
Combined with a shorter upload interval of 1000 structures at a
time, there should hopefully be no more problems. Note that even if
you have files buffered, the time out applies to each file
individually so don't worry if you have lots of buffered work.

Howard

FoBoT
03-17-2002, 04:37 PM
:cool:

pointwood
03-18-2002, 08:21 AM
Thanks Dyyryath! It's great that you are taking time and use your expertise to give Howard some hints at what could be made better. It's not that I don't think Howard is capable of troubleshooting these problems, I think the history so far has showed he is quite capable of doing that, but nobody can be experts in every fields and it seldom hurts to have an extra expert making qualified hints at possible solutions. I believe it is often the case that getting another person to look at something can point at solutions which you yourself would never have thought of or would have taken considerably longer time find.

I completely agree with you, Dyyryath, that Howard has been very responsive and very open to suggestions and good at working together with the community (us).

dnar
03-18-2002, 12:07 PM
According to the changelog (http://www.distributedfolding.org/news.html) a new version was released 18 March with increased timeout periods. You can find this client as the full install on the download page. The "upgrade" link is bogus.

I have fully upgraded all machines, only to discover they still report network errors!

dnar
03-18-2002, 12:23 PM
Dyyryath - I have run a packet sniff on the transfer, and guess what? The first file is sent ok, but there is no acknowledgement from the server. .. Looks to me like the client timeout period is not an issue (at least NOW it is not).

guru
03-18-2002, 12:42 PM
Then new client does seem to be helping. After I upgraded my systems with the upgrade patch they seem to be uploading more work then they have been. It's still slow going as the server is still overloaded with all the requests from the old clients. Please upgrade now. This will help everyone in the long fun. Do expect it to sit idle while it tries to upload the work.

guru

xj10bt
03-18-2002, 01:14 PM
I'm still having the same upload problems with the new version. I'm stopping until things get straightened out.

Dyyryath
03-18-2002, 01:40 PM
I had a feeling that the increased timeouts wouldn't completely solve the problem. I haven't looked at the actual connection stream yet today, but it would appear from casual observation of the client that this won't be completely resolved without some rather major backend procedural changes, or more powerful servers.

They do intend to upgrade servers which will help. They also expect the next set of proteins to be longer which may or may not help as their user count continues to grow.

While Howard seems extremely competent (more so than most DC project programmers I've seen lately), he's said that he doesn't have a great deal of experience in network programming. With this in mind, he's made a good choice in using CGI to handle uploads. It's proven, and stable. Unfortunately, this does incur some extra overhead, and when you combine that with the checking and storage routines that the upload process has to handle, it adds up to a pretty decent load for every connection.

However, without using multiple systems, some fairly serious backend changes, or some type of dedicated, streamlined upload process, I'm not sure that even a newer, large server is going to permanently fix this problem. There's an awful lot of overhead associated with each upload in their current design.

Howard's strikes me as a smart, capable fellow, though, so let's give him some time to work on this. I'm sure in the end, he'll come up with something that works.

Guru is correct in pointing out that the more people that upgrade to the shorter WU clients, the better off we'll all be.

xj10bt: I'm shut down now, too. I'll come back once this all gets fixed. I'm sure they'll work it out, but I don't have the patience to be constantly babysitting 20 or 30 clients.

guru
03-18-2002, 02:04 PM
It sounds like they need to split the server into three servers. One that accepts the requests then the other two that actually do the work of uncompressing and storing the data. This would allow them to scale up without rebuilding the main server. They could simply add servers.

guru

Dyyryath
03-18-2002, 02:17 PM
That's a good idea (and quite close to one of my original suggestions) but part of the upload process is to checksum the information before accepting it from the client. This is to tell the client that the data is corrupted (possibly in transport) or has some other problem and to display an error message.

My feelings at this point have more to do with their database structure, which I think is an area that could generate fairly substantial performance gains with the correct modifications. I don't want to really go into detail about this because I'm not sure how much Howard would like me to say about their setup, but I think that some changes in this area could be very effective.

I also think that something like the personal proxies that dnet uses could be helpful. Howard (for his own reasons) isn't particularly fond of that idea, so I've not pursued it any farther than just a brief suggestion.

If they were to use multiple servers, I'd think that they'd probably keep much the same structure as they have now with some sort of round-robin DNS (or other load balancing tool) to hand off connections to each of the servers.

guru
03-18-2002, 04:27 PM
I understand the problems that a caching proxy could cause. How about some sort of special registry for people that want to host dedicated caching proxy servers. The proxy software could do a pre check of the data to help off load that function from the server. I'd sign an NDA to host one. Then we could have a race to see who's proxy server had the most uptime. :0

guru

pointwood
03-19-2002, 06:10 AM
Is it really neccesary to validate the uploaded data immediately? I mean, if that is the problem, couldn't you somehow just let the "upload" server accept the data and let the client return to crunching. The server (or another server) could then check the data and the client could get an error message next time it connects if something went wrong.

Would that be possible or is that completely nonsense? (probably the last :D)