Thank you
Please read the following carefully - it should address all of the concerns that have bee brought up over the last few days. If there is a question I missed, please ask it again and I will add a detailed answer.
--------------------------------------------------------
1. What is the ticketing system and how does it work?
The ticketing system is mainly implemented on the server and is meant to reduce idle-time during a client upload process. It also safeguards against timed-out connections and lost results. Each uploaded fileset retains a receipt from the server. This receipt allows the client to verify when the fileset has been fully processed on the server. The receipt is stored on disk in a file named receipt.txt. The contents of the file should never be tampered with.
Client behaviour with a receipt present on disk is as follows:
*Check status of fileset referenced in receipt
*If the fileset has been fully validated on the server:
-if client was just started or running with -ut: continue uploading buffered files
-if upload was occuring at the end of a generation: proceed to fold next
generation
-delete local copy of receipt.txt
*If the fileset has not been processed or receipt status cannot be queried:
-if running with -ut: quit
-if running with with -it: continue folding locally. Receipt will be checked when it
is time to upload again.
-the fileset referenced by the receipt is kept on disk (and in filelist) until receipt
is validated.
2. What happens if I delete receipt.txt?
Since the fileset referenced by the receipt is kept on disk until the receipt has been validated, removing the receipt will cause the same fileset to be uploaded again. While this will not have adverse effects on your generations, it will cause duplicate data to be uploaded to our servers, so it is best left alone.
3. Why do I have many buffered files and no errors in my log?
Some of the servers may accumulate a greater backlog of uploaded results to validate. If your fileset has been sent to one of the slower servers, it will take longer to get validated, because the server may still be processing files that were uploaded before yours. This means your receipt will not get validated for a while, and the client will continue working locally producing more buffered results. This is NOT an error. Once your original fileset has been processed you will be able to upload your remaining buffered results.
4. Why does deleting receipt.txt sometimes allow me to upload all my buffered files?
Deleting the receipt.txt will cause the client to re-upload the same fileset the receipt was already waiting on. If this upload happens to hit a less loaded server machine, the receipt will get validated much faster, allowing further upload of remaining results.
5. Why does the upload time seem longer than previously?
With the new system, a small time overhead is added because the receipt must be fully validated before the client can proceed to uploading the following filesets. Although this may add to overall upload time, it ensures a more robust uploading mechanism.
6. What is the advantage of the ticketing system?
The ticketing system was implemented to reduce wasted client time during busy server periods. Previously the client would attempt to connect to the servers until the connection timed out, and then attempt to reconnect numerous times before proceeding to work locally. This time is now used to allow client to work locally, as only a short check for the receipt status is performed.
The ticketing system removes any long connections to the servers, avoiding connection time-outs, which have been known to cause client and server results to become out of sync.
Elena Garderman
Originally posted by Stardragon
...snip...
*Check status of fileset referenced in receipt
*If the fileset has been fully validated on the server:
-if client was just started or running with -ut: continue uploading buffered files
-if upload was occuring at the end of a generation: proceed to fold next
generation
-delete local copy of receipt.txt
...snip...
So it seems that if the client was just started or running with -ut there could be more than one generation uploaded. But if the client is just running normally then it can only upload one file at the end of a generation. So if it misses a generation it doesn't ever get a chance to catch up if it just continues to run?
Is this the reason why we can't upload more than 1 unit before we get dis-connected? If you use a 56kb modem the conection time have to be have to be long.Originally posted by Stardragon
The ticketing system removes any long connections to the servers, avoiding connection time-outs, which have been known to cause client and server results to become out of sync.
Last edited by HansArne; 04-26-2004 at 02:44 PM.
This is all just too amazingly delightful to know.
Now, when will you goose the servers so that we can get rid of these backed-up gens? At the current rate of accumulation, you will not be able to process them all before Universe End.
HOME: A physical construct for keeping rain off your computers.
If the client is already running, and has accumulated some buffered generations during its 'up time', the buffered files will not ALL get uploaded when the client finishes its current generation. A complete upload will happen when you restart the client (since it check for buffered data before starting folding), or if you run the client with -ut. Keep in mind that this behaviour has NOT changed - the client operated like this before.
Elena Garderman
Exactly, how about uploading all buffered generations before continuing to crunch? We shouldn't have to resort to using -ut or restarting the client just to clear out our (currently substantial) buffers.Originally posted by Welnic
So it seems that if the client was just started or running with -ut there could be more than one generation uploaded. But if the client is just running normally then it can only upload one file at the end of a generation. So if it misses a generation it doesn't ever get a chance to catch up if it just continues to run?
"If angels have voices, then surely they must sound like Loreena McKennitt" - me 1/2/04, somewhere over Illinois
Member of Free-DC
Thanks for this, Stardragon! I really appreciate the intent of the new ticketing model; I have been watching all my clients crunch at 100% CPU all weekend, whereas with the previous client they would pause while waiting for the server. I hope this fact is not lost on all those commenting on the weekend problems. This is a big improvement; my thanks!
Now my comment/question:
So far I would not characterize the added time overhead as "small". My clients are falling further and further behind; they're crunching generations faster than they can be uploaded with the new system, so lots are getting buffered. So I'm curious whether you're seeing this as well (and are working on diagnosis and fixing; if so I think everyone would be happy to hear it)... or whether things are working "as designed" in which case I'm not sure how we'll ever catch up on uploads.Originally posted by Stardragon
5. Why does the upload time seem longer than previously?
With the new system, a small time overhead is added because the receipt must be fully validated before the client can proceed to uploading the following filesets. Although this may add to overall upload time, it ensures a more robust uploading mechanism.
Right now all my idle time is being wasted as I can't get results uploaded. This also limits us farmers who want to upload all the work from non-networked systems. We need a better way to upload the results then have the client simply try once then give up. I'm getting less results turned in now then when the servers were busy on the last protein. And this was suppose to be an upgrade?6. What is the advantage of the ticketing system?
The ticketing system was implemented to reduce wasted client time during busy server periods. Previously the client would attempt to connect to the servers until the connection timed out, and then attempt to reconnect numerous times before proceeding to work locally. This time is now used to allow client to work locally, as only a short check for the receipt status is performed.
The ticketing system removes any long connections to the servers, avoiding connection time-outs, which have been known to cause client and server results to become out of sync.
I do have a few systems that are connected to the internet and run 24/7. However they are all getting a large backlog of results that aren't being uploaded.
I think it's time to dump these slower systems as they seem to be causing problems. They seem to be the weak link in the ticketing system and may be doing more harm then good.If your fileset has been sent to one of the slower servers, it will take longer to get validated
guru
SOME buffered gens? Elena, you better re-read the posts around here!
HOME: A physical construct for keeping rain off your computers.
And that's the point: A complete upload doesn't happen at all. The client is able to upload three or four gens and then got a receipt. With this behaviour of the servers wer'e not able to catch up with the uploads.Originally posted by Stardragon
A complete upload will happen when you restart the client (since it check for buffered data before starting folding), or if you run the client with -ut.
I tested it all over the weekend....
I would much rather lose a generation or 2 to 910 errors than have this crippling overhead on uploads.
Once again a project has gone towards the online crunchers where this has far less impact.
It's hysterical to think that you are putting in a system that requires an overhead when it is evident that your backend cannot handle a fast protein wiht the older system.
This looks more like a method for drip feeding the ailing backend than validating uploads.
DF has no IDEA of the problems we are having. That's it switching of. What a load of crap, do they have no idea. They should read the posts.
When I am talking about upload time, I aqm referring to the time it takes to upload the files and verify the receipt. Once the client has established that the receipt cannot yet be validated, and continues crunching locally, that time is not factored into the "active upload time".Originally posted by Veneficus Fortis
<snip>
So far I would not characterize the added time overhead as "small". My clients are falling further and further behind; they're crunching generations faster than they can be uploaded with the new system, so lots are getting buffered. So I'm curious whether you're seeing this as well (and are working on diagnosis and fixing; if so I think everyone would be happy to hear it)... or whether things are working "as designed" in which case I'm not sure how we'll ever catch up on uploads.
We are working on a way to remove the backlog on the servers, and I will let you know once they appear to be caught up.
Elena Garderman
Sorry, Elena, but I'm calling :bs: When we were having trouble getting 58 AA gens uploaded, I would run nonet for a while (for efficiency's sake), then after I connected again, a few gens would get through. As the day went on (and I left the clients online), my load of buffered gens would drop, as the client would upload some of them whenever it could. Heck, I even tracked this behavior for a while in an attempt to determine the best time to leave my computers connected.Originally posted by Stardragon
Keep in mind that this behaviour has NOT changed - the client operated like this before.
Now, if I'm misunderstanding your statement above, please accept my apologies. But if you are trying to tell me that the client never uploaded more than one gen after completing a gen, then my experience differs.
"If angels have voices, then surely they must sound like Loreena McKennitt" - me 1/2/04, somewhere over Illinois
Member of Free-DC
But this behavious HAS changed. The client would upload all the buffered gens at the first opportunity, -ut or restart not needed. As it stands, this client will continue to accumulate buffered gens ad infinitum, unless the user constanly babysits.Originally posted by Stardragon
Keep in mind that this behaviour has NOT changed - the client operated like this before.
Hardly anyone is crunching for God's sake.
Have you looked at the active crunchers?
Have you seen the amount of uploads?
Why on earth is there a backlog?
Why did you not state categorically that NO-one should delete receipt.txt.
If we are left to hash up workarounds because no-one properly tested this system then you are bound to get problems with such a fragile system.
I can see many frustrated people going "Bollocks to this" and scripting a receipt.txt deletion and upload routine.
I know I was tempted when only 5 gens uploaded in 4 hours.
Sorry but EVERYONE is having this problem. Helix_Von_Smelix is correct...they have no idea what is going on. I too am pulling the plug. We can bitch all we want to in this forum but nothing gets done. I've witnessed it after every changeover. It's time to voice your dissatisfaction by pulling the plug!Originally posted by Fozzie
Once again a project has gone towards the online crunchers where this has far less impact.
that will be fairly obvious since the stats will then pick up to a more normal pace and we'll be able to upload more than we are cachingOriginally posted by Stardragon
... I will let you know once they appear to be caught up.
Use the right tool for the right job!
Thanks for the clarification, and:
Thanks! Much appreciated.Originally posted by Stardragon
We are working on a way to remove the backlog on the servers, and I will let you know once they appear to be caught up.
Unless the back-end is ever capable of absorbing several hundred units, all no-net options are gone. The ability to run a client off-line, zip up the entire directory, unzip to a conencted computer and set it to -ut and forget, was crucial to many of the participants. Such dump clients are not able to just go back to off-line crunching while waiting for the results to be inserted. It is running on another computer just to upload.
Do you have any indication of why this backlog is so horrendous? Are the servers waiting for the database again?
Hm. With 319 generations buffered on one client, I tried stopping it and running with -ut (it contacted the server and quit on its own), and then I restarted the client... but it still has 319 generations buffered. So it doesn't appear that it did a "complete upload", if I'm understanding you correctly.Originally posted by Stardragon
If the client is already running, and has accumulated some buffered generations during its 'up time', the buffered files will not ALL get uploaded when the client finishes its current generation. A complete upload will happen when you restart the client (since it check for buffered data before starting folding), or if you run the client with -ut.
My apologies It seems my brain is being warped by all the issues I'm trying to fix on the server side. I have just double-checked everything, to make sure it is me that is in fact losing my mind. I have indeed posted a misleading explanation.
The behaviour of the client is as follows:
when it is time to upload (whether at startup, or while running), it will FIRST check the receipt - if the receipt has been validated, it will proceed to upload ALL files in flelist.txt, one by one - that is, files are uploaded until the client gets a receipt it cannot validate since it is pending on the server. In that case it will revert to local folding until it is time to upload again. The reason you are currently falling behind is because of the backlog on some of our servers. If your files end up on those servers, it will take longer to validate them, as files are processed in a (FIFO) first-in-first-out manner.
Again, accept my aplogies for the confusion I have created. We are working on speeding up the backlogged servers so all results can get validated quickly, as they should be.
Elena Garderman
Nice, when you do, I have a backlog right here waiting for you to also catch up with.Originally posted by Stardragon
We are working on a way to remove the backlog on the servers, and I will let you know once they appear to be caught up.
or perhaps I can think of a way to remove it.
If you're a Cow or FreeDC'er, I agree 100% and rally around your cry! Pull the plug!Originally posted by SmallFry
It's time to voice your dissatisfaction by pulling the plug!
If you're a member of TSF, come join us in the TSF forum and we'll talk you down from the ledge.
Honestly, people, this is a volunteer project. There is no cause for and angry accusations of incompetency toward the project coordinators. I've lost track of how many times I heard people say this weekend they were "just too stressed". Wha? It's perhaps (perhaps) a few days' cycles lost while DF makes some improvements for the long-term. If you're stressed, go do something else for a while and come back. Don't keep posting about how you're going to leave, just quietly take a time-out and come back when your stress is gone.
SO how are we expected to upload gens from off-line servers ?Originally posted by Stardragon
Please read the following carefully - it should address all of the concerns that have bee brought up over the last few days. If there is a question I missed, please ask it again and I will add a detailed answer.
--------------------------------------------------------
1. What is the ticketing system and how does it work?
*If the fileset has not been processed or receipt status cannot be queried:
-if running with -ut: quit
I currently have 3 dual 2.4 GHz Xeon servers - a total of 6 DF directories - producing around 1000 gens in a typical working day. At the end of the day, I zip up these directories, copy them to a USB disk drive and take them home to upload. The server directories are cleared using "foldtrajlite -purgeupload xxx" and restarted.
If the upload at home just quits whenever it can't validate then I have ABSOLUTELY NO CHANCE of ever getting these generations uploaded unless I sit at the machines all night and babysit them i.e no chance of getting any :sleepy:
That's 14.4 GHz of processing power you've just lost in one fell swoop
Yea!We are working on speeding up the backlogged servers so all results can get validated quickly
Overall I feel that the ticket feature is a good thing. But right now it's being shadowed by the fact that it's not working 100% ie ticket servers being too slow.
guru
Since a lot of my "backlogged" generations are still from the old protein, now that it has been more than 48 hours, what happens? Do those proteins just get discarded by the server and it will continue to upload more generations? Or will the client get stuck forever on that generation and never proceed to the current protein's completed generations?Originally posted by Stardragon
Again, accept my aplogies for the confusion I have created. We are working on speeding up the backlogged servers so all results can get validated quickly, as they should be.
Thanks,
Jeff.
Whatever you do. Don't do a purgeupload list to get rid of them.Originally posted by Digital Parasite
Since a lot of my "backlogged" generations are still from the old protein, now that it has been more than 48 hours, what happens? Do those proteins just get discarded by the server and it will continue to upload more generations? Or will the client get stuck forever on that generation and never proceed to the current protein's completed generations?
Thanks,
Jeff.
I did and now I have missing generations all over the place.
Looks like we have to send them to get a valid receipt.txt file
Greetings Elena.
Quick question.....why are there differences in the error log?
I've got 5 systems with apprx 350-400 generations buffered
and nothing in the error log and with this computer, on a new
install this morning I get the following:
========================[ Apr 26, 2004 7:38 AM ]========================
Starting foldtrajlite built Apr 22 2004
========================[ Apr 26, 2004 7:39 AM ]========================
Starting foldtrajlite built Apr 22 2004
Mon Apr 26 09:24:00 2004 ERROR: [777.000] {ncbi_socket.c, line 1173} [SOCK::s_Connect] Failed SOCK_gethostbyname(anteater.blueprint.org)
Mon Apr 26 09:24:00 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to anteater.blueprint.org:80 failed: Unknown
Mon Apr 26 09:24:15 2004 ERROR: [777.000] {ncbi_socket.c, line 1173} [SOCK::s_Connect] Failed SOCK_gethostbyname(anteater.blueprint.org)
Mon Apr 26 09:24:15 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to anteater.blueprint.org:80 failed: Unknown
Mon Apr 26 09:24:30 2004 ERROR: [777.000] {ncbi_socket.c, line 1173} [SOCK::s_Connect] Failed SOCK_gethostbyname(anteater.blueprint.org)
Mon Apr 26 09:24:30 2004 ERROR: [777.000] {ncbi_connutil.c, line 801} [URL_Connect] Socket connect to anteater.blueprint.org:80 failed: Unknown
Mon Apr 26 09:24:30 2004 ERROR: [777.000] {ncbi_http_connector.c, line 101} [HTTP] Too many failed attempts, giving up
Mon Apr 26 09:24:32 2004 ERROR: [000.000] {foldtrajlite2.c, line 4380} Failed to query status for ticket 192.168.10.108_1082989220_27854
Just trying to understand how the client is working
How many generations do I have to buffer before I can start getting worried they may go to waste??
Here's the way I see it. Before, when we uploaded, it was buffered, verified and entered in the dB. Now, WE buffer, upload, the server verifies, and enters.
This takes resource strain off the upload servers by DOING LESS...Theoretically, it should go faster...IT'S NOT. Keep workin' at it....
Last edited by Gortok; 04-26-2004 at 03:46 PM.
Just to be curious: why was such a server-intensive solution chosen to solve this problem? Wouldn't it be easier to have two separate filelists: one with results that are currently being send to the server (this is locked until results are validated by the server) and one as a temporary holder for results found when the first list is locked. This way uploads can be done in a separate thread of the client (even by a separate process) and computation can continue without interference.
Before I get flamed, please let me emphasize that this is the first time Ive read something about the internals of this client and have no clue whether this idea is actually feasable for this particular project. (The idea is sound in principle though).
Last edited by SWfreak; 04-26-2004 at 03:55 PM.
Though I'm no longer on the DF project, I will post a bit of advice... Concentrate on one thing at a time server side. Never, never, never (etc) change multiple things at the same time. If you implement something in the client, don't do anything else until that has been pushed out to the users. That way, you can compare and see what effect it had.
Originally posted by Stardragon
My apologies It seems my brain is being warped by all the issues I'm trying to fix on the server side. I have just double-checked everything, to make sure it is me that is in fact losing my mind. I have indeed posted a misleading explanation.
There was indication that a couple of the servers (105, 106 and 110) were slower than the others - are these just backlogged or plain slow?Originally posted by Stardragon
We are working on speeding up the backlogged servers so all results can get validated quickly, as they should be.
Thanks for the FAQ though - theory sounds good, but the infrastructure seems to be holding it back from becoming reality...
I buy the parts that make up the computers I use and pay for the electricity that powers these computers that run the DF client. Therefore I am allowed to get slightly pissed off when changes are introduced and get left high and dry over the weekend when things go pear shape.Originally posted by Veneficus Fortis
If you're a Cow or FreeDC'er, I agree 100% and rally around your cry! Pull the plug!
If you're a member of TSF, come join us in the TSF forum and we'll talk you down from the ledge.
Honestly, people, this is a volunteer project. There is no cause for and angry accusations of incompetency toward the project coordinators. I've lost track of how many times I heard people say this weekend they were "just too stressed". Wha? It's perhaps (perhaps) a few days' cycles lost while DF makes some improvements for the long-term. If you're stressed, go do something else for a while and come back. Don't keep posting about how you're going to leave, just quietly take a time-out and come back when your stress is gone.
I would not say "stressed" is the way I feel about this project, more annoyed.
I fear getting everyone to be able to Upload is beyond all hope under the current circumstances...back to the pub
I am not a Stats Ho, it is just more satisfying to see that my numbers are better than yours.
Hicks!!!!!Originally posted by Grumpy
I fear getting everyone to be able to Upload is beyond all hope under the current circumstances...back to the pub
As the temps in my room are at 30C before summer has hit these parts it si becoming irritating realising that work is going to waste and jsut sitting here. More annoying is that the old protein work is now out of date. I have already switched off 4 machines to ease the temps here and the noise levels, another 7 of those less than 2000xp rating will follow tomorrow. Friday/Saturday I will really ease my temps if things are still the same. F@h looks like it will gain some of those machines, teh rest will be retired permanently, CA$250/month for electricity that is basically going to waste is too much to ask me to bear.Originally posted by Richard Clyne
I buy the parts that make up the computers I use and pay for the electricity that powers these computers that run the DF client. Therefore I am allowed to get slightly pissed off when changes are introduced and get left high and dry over the weekend when things go pear shape.
I would not say "stressed" is the way I feel about this project, more annoyed.
I no longer feel pissed off, I feel pissed on
The servers are processing the tickets slower than they should, indeed. We are going to get it fixed but it will take a few days to isolate and correct the problem. This is a complex system and not trivial to 'debug'. It sounds to me like the only real problem people are having is that it is buffering more than it is uploading. Once we fix the backend this will allow the buffered stuff to come through.
So no we will not leave it as is, do not worry. As for all the posts, we cannot possibly answer everyone's questions, it has taken me several hours just to read a day's postings. Please see the FAQ we added on the web site (and in this forum) to answer some of your questions and we will add to this if we feel it is necessary. Otherwise just hold tight and let us look at and fix the speed issues. If anyone has issues other than the buffering, let us know, preferably in a new thread with a constructive description of the problem.
We apologize for the inconvenience this may cause for off-line folders but we believe the new system will work much better once all the kinks are ironed out. Hopefully you will bear with us until then, or at least come back later when it is.
Howard Feldman