Dyyryath
03-17-2002, 12:48 PM
Hey guys. Word seems to have gotten around that I'm 'working on the problem' with Howard. That may be a little too strong a phrase.
Yesterday, I sent Howard an email offering to help if I could. He replied (quickly, I might add) that he thought he had the problem fixed. However, he also took the time to answer some questions I had concerning the way the data was handled on their end.
From that conversation, I made some guesses and offered a couple of opinions about what might be done to streamline their upload process, since that appeared to be the bottleneck.
We sent a flurry of emails back and forth where Howard graciously took the time to answer more of my questions and further define their upload process for me.
After getting a good idea of what was *supposed* to happen, I broke out a packet sniffer and spent some time analyzing the data passed between my clients and the servers. I found that the problem wasn't really a network level problem, but rather what appeared to be a logic problem on the part of the upload script. The client *was* sending the data in it's entirety (the first WU), but it wasn't receiving an OK from the server afterward. It would then quit trying to upload data (hence only the first WU would actually go) and give the 'Network Not Available' message.
I sent some data back to Howard on what I was seeing along with a couple of guesses as to what might be causing this. He suggested that I try his uploaded patch and see if I could sniff an upload that worked as it should.
I did this and determined that the original client (which was uploading structures in groups of 5000) was timing out while waiting for the server to return the OK. This was always after 10 seconds. The new client (which was working at the time) not only had the structure cache size decreased, but it also had a longer timeout on the client side; up from 10 seconds to 30 seconds. Howard confirmed that the original timeout was 10 seconds, and that he had increased it on the new client.
Using the new executable, I was able to upload about a dozen of the 5000 structure WUs without issue. It appears that the timeout was more important than the WU size, though together they should have an even greater impact.
What it came down to (as near as I could guess with limited information) was that the server was taking too long to decompress the info, check it for validity, and insert it into the database (which I suspect may be the biggest hold up). While the server *was* doing what it was supposed to do, it wasn't doing it fast enough to finish before the client timed out. This is why the longer timeout helps (as does the shorter WUs which take less time to process and return an OK).
I sent Howard my speculation on all of this and we tend to agree that the problem is a combination of factors causing a delay in returning a confirmation to the client. I've also sent him a couple of suggestions for streamlining the process, but they may or may not be valid given the infrastructure and code they are using.
Howard (quite correctly) doesn't want to give out the source code for his upload process, so there is not much more that I can do.
The problems we've seen today with invalid handles and such, I can't even begin to speculate on since I just woke up and found them.
What I can tell everybody is that Howard spent most of yesterday answering a plethora of questions on my behalf when he didn't have to and he seems genuinely interested in fixing these things and moving on.
Give him some time. I've shut down all my clients for now, but I'm not abandoning the project. I'm sure that he'll get things worked out and when he does, I'll come right back. He's been as willing to work with, listen to, and reply to the users as anyone I've seen in a DC project yet.
Yesterday, I sent Howard an email offering to help if I could. He replied (quickly, I might add) that he thought he had the problem fixed. However, he also took the time to answer some questions I had concerning the way the data was handled on their end.
From that conversation, I made some guesses and offered a couple of opinions about what might be done to streamline their upload process, since that appeared to be the bottleneck.
We sent a flurry of emails back and forth where Howard graciously took the time to answer more of my questions and further define their upload process for me.
After getting a good idea of what was *supposed* to happen, I broke out a packet sniffer and spent some time analyzing the data passed between my clients and the servers. I found that the problem wasn't really a network level problem, but rather what appeared to be a logic problem on the part of the upload script. The client *was* sending the data in it's entirety (the first WU), but it wasn't receiving an OK from the server afterward. It would then quit trying to upload data (hence only the first WU would actually go) and give the 'Network Not Available' message.
I sent some data back to Howard on what I was seeing along with a couple of guesses as to what might be causing this. He suggested that I try his uploaded patch and see if I could sniff an upload that worked as it should.
I did this and determined that the original client (which was uploading structures in groups of 5000) was timing out while waiting for the server to return the OK. This was always after 10 seconds. The new client (which was working at the time) not only had the structure cache size decreased, but it also had a longer timeout on the client side; up from 10 seconds to 30 seconds. Howard confirmed that the original timeout was 10 seconds, and that he had increased it on the new client.
Using the new executable, I was able to upload about a dozen of the 5000 structure WUs without issue. It appears that the timeout was more important than the WU size, though together they should have an even greater impact.
What it came down to (as near as I could guess with limited information) was that the server was taking too long to decompress the info, check it for validity, and insert it into the database (which I suspect may be the biggest hold up). While the server *was* doing what it was supposed to do, it wasn't doing it fast enough to finish before the client timed out. This is why the longer timeout helps (as does the shorter WUs which take less time to process and return an OK).
I sent Howard my speculation on all of this and we tend to agree that the problem is a combination of factors causing a delay in returning a confirmation to the client. I've also sent him a couple of suggestions for streamlining the process, but they may or may not be valid given the infrastructure and code they are using.
Howard (quite correctly) doesn't want to give out the source code for his upload process, so there is not much more that I can do.
The problems we've seen today with invalid handles and such, I can't even begin to speculate on since I just woke up and found them.
What I can tell everybody is that Howard spent most of yesterday answering a plethora of questions on my behalf when he didn't have to and he seems genuinely interested in fixing these things and moving on.
Give him some time. I've shut down all my clients for now, but I'm not abandoning the project. I'm sure that he'll get things worked out and when he does, I'll come right back. He's been as willing to work with, listen to, and reply to the users as anyone I've seen in a DC project yet.