PDA

View Full Version : Missing Generations..



tpdooley
09-25-2003, 02:18 PM
I've got a collection of machines that are on 24/7 - and some that are only run for a few hours a day. All are hooked up to a cable modems attached to Linksys BEFSX41 cable/dsl router; which in turn is transmitted to a satellite, bounced back to Earth and put on a 1500 mile fiber optic cable to Seattle.

I'm noticing a fair number of my machines getting "error writing body at offset" messages - sometimes (not always) it's followed by the dreaded Missing Previous Generation error.. which continues till gen 250.

I'm seeing this on a WinXP Pro machine, a variety of WinXP home machines, and I'll have to go verify the Win2k and Win98 machines.

What kind of hardware and network connections are others having this kind of problem using?

AMD_is_logical
09-26-2003, 11:18 AM
Someone on our team had a similar problem with a cable ADSL connection. The connection was usually good, but at a certain time of day it became flaky for a while. He would often end up with the dreaded missing generation errors. :(

It's clear that a flaky network connection can cause generations to slip through the cracks and into the abyss. That should never happen. :mad: The client should never delete the generation until after it learns that the result has been successfully stored away on the server.

This is a very serious bug, and we need to keep after the DF team until they fix it!
:whip:

Brian the Fist
09-26-2003, 11:38 AM
It will be fixed as soon as I find out why/where the client thinks that the file should be deleted. Still unable to reproduce this but working on another idea that may avoid it altogether.

tpdooley
09-26-2003, 03:26 PM
My Win98 machines at work were all fine; but the Win2k machine had the lost gen error message in the error.log

I believed that I updated the client in all those machines when you announced having a new client that we had to actually download - weeks ago. Perhaps the Win2k machine didn't have the error.log removed. (Would it be possible to have the version number of foldittrajlite appended to the start notice in the error log?)


So I can say that the new client on Win98 machines doesn't seem to be very susceptible to the lost gen problem; but every one of my WinXP and Win2k machines with the old client seem to have been hit with it at least once or twice.

If the Win2k machine at work had it's error.log deleted when I updated it to the new client - then the newer client doesn't work perfectly with Win2k.
I'll keep an eye out for this when we get all new clients with the next protein change.

bwkaz
09-26-2003, 06:58 PM
Originally posted by tpdooley
(Would it be possible to have the version number of foldittrajlite appended to the start notice in the error log?) Dunno if you saw this or not, but we're going to be getting timestamps in error.log (every message) in the next release.

:|party|:

(This should accomplish the same thing as you're thinking, right?)

tpdooley
09-26-2003, 08:31 PM
I have two different versions of foldtrajlite.exe; ones that I've manually updated a few weeks ago, and ones that were updated on an older protein update.

The time stamps will show when an error occured. I'd find it a bit more helpful to also see what foldtrajlite version is running. It would make it easier to see when you updated foldtrajlite - and whether problems disappeared. :)

(I noticed another of my problem WinXP machines was running the old client)..

tpdooley
09-28-2003, 09:30 PM
I've run 3 machines here at work with the newest client running on Win98; no 910 missing gen errors since they got the new client (for weeks).

I've got 3 machines here at work that were running an Aug 12th version of foldtrajlite.exe; so this weekend, I upgraded them to the lastest client, and ran all 3 friday and saturday.. and showed up today and noticed ALL three machines had difficulty writing packets and after enough of the difficulty writing packets, they switched to the 910 errors.

The win98se machines are also having occassional problems writing packets - either at the 8k or 12k range, and then they merrily upload the next time without the 910 messing gen error.

Both Win2k machines were running nonet (-if); although I can switch one of those back to running online.

What is it about these WinXP machines that allows them to lose gens when Win98se doesn't?

tpdooley
10-04-2003, 09:24 PM
I've been running just one WinXP machine live all week; all the other WinXP machines are running nonet. Friday, I left 2 machines running live; and miraculously.. sometime after 6:50pm, the pair had repeated problems writing data to the servers (at 8196 and 12k bytes); and both switched to the 910 Missing Previous Generation error messages after every uploaded gen.


The Win98SE machines are all smiling along and folding away (even after the errors writing at 8k and 12k bytes). The Win2k machine is running nonet and doesn't have a problem. The WinXP machines that run nonet have no problem (yet)..

What more would you like? I can run EtherReal next weekend and leave 2 WinXP machines running live - and send you a burned DVD with all the network traffic over the weekend..

(30 megs an hour per machine of Battlefield communication adds up in a hurry.. :)

tpdooley
10-10-2003, 01:15 AM
My now sole Win2k machine ran live for a few days, and lost a generation after a write error at 8192 bytes.. (and after dozens of 910 error messages, it looks like the gen got lost in the ether..)

And my cable modem ISP has admitted that they've been kicking people off every night to try and recycle ip#s. (They should just make us all use NAT! :)

I've switched to the new client as of today, and will see if it handles the mistreatment at night better than the older client did.

tpdooley
10-12-2003, 11:59 PM
okay.. here's my home machine - running WinXP sp1 with all the critical updates. Axp2600+ 333mhz front side bus, nforce2 chipset, and I have Norton Anti virus 2003 running.

--------------------------------

========================[ Oct 10, 2003 12:31 PM ]========================
Starting foldtrajlite built Oct 2 2003
Sun Oct 12 14:47:09 2003 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Sun Oct 12 14:47:09 2003 ERROR: [000.000] {foldtrajlite2.c, line 4693} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Sun Oct 12 14:56:42 2003 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Sun Oct 12 14:56:42 2003 ERROR: [000.000] {foldtrajlite2.c, line 4693} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Sun Oct 12 15:26:13 2003 ERROR: [777.000] {ncbi_http_connector.c, line 244} [HTTP] Error writing body at offset 8192
Sun Oct 12 15:28:55 2003 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Sun Oct 12 15:28:55 2003 ERROR: [000.000] {foldtrajlite2.c, line 4693} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Sun Oct 12 15:34:09 2003 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Sun Oct 12 15:34:09 2003 ERROR: [000.000] {foldtrajlite2.c, line 4693} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Sun Oct 12 15:36:26 2003 ERROR: [777.000] {ncbi_http_connector.c, line 244} [HTTP] Error writing body at offset 8192
Sun Oct 12 15:36:57 2003 ERROR: [777.000] {ncbi_http_connector.c, line 244} [HTTP] Error writing body at offset 8192
Sun Oct 12 15:37:27 2003 ERROR: [777.000] {ncbi_http_connector.c, line 244} [HTTP] Error writing body at offset 8192
Sun Oct 12 15:37:27 2003 ERROR: [777.000] {ncbi_http_connector.c, line 101} [HTTP] Too many failed attempts, giving up
Sun Oct 12 15:37:27 2003 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Sun Oct 12 15:37:27 2003 ERROR: [000.000] {foldtrajlite2.c, line 4693} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Sun Oct 12 15:44:02 2003 ERROR: [000.000] {foldtrajlite2.c, line 4616} Warning during upload: STATUS 910 MISSING PREVIOUS OR ILLEGAL GENERATION
-------------------------------------------------
I'll compare that to the Win98se machines, Win2k machine and WinXP machine(s) running live at work.

From what I'm seeing - it looks like 1. I don't have a perfect internet connection. (ya. I knew that).. and 2. Win2k and WinXP for some reason are seeing acknowledgement that a packet was accepted by the server and throwing away the generation - but the server didn't send that. Or the client didn't finish sending sending the whole generation - and is being told to erase the whole generation.

(My win98se machines.. 2 without NAV, and 1 with Nav 2003 don't seem to have these problems).

I believe I see 3 files for every generation on my nonet machines. I assume there's some form of handshaking going on. Can the acknowledgement from the server be personalized? For each of the 2 or 3 files being uploaded.. a log, a min.val and a .trj file.
i.e. usercode-generation#-filetype-(parity code on the file.. parity code for this receive packet).

(JhihadAgainstBarney)-gen50-log-(pc file)-(pc packet).

(with some kind of provision for either the client or server not receiving the packet or acknowledgement).


------------------------------
Getting someone to run this client as a service is enough of a hassle - getting them to switch to win98se and run it as a visible client is going to be impossible. :)

As usual... I'll hold onto a copy of this directory for awhile - and nuke the filelist and files.. so it can start over and stop the long string of 910 errors that it currently has..

and thanks for both the time stamp and build date info, Howard.

Brian the Fist
10-14-2003, 12:42 PM
We are working on a more complex handshaking procedure when uploading to the server which should help eliminate these sorts of problems. We hope to have it going in a couple of weeks so please just bear with us until then and hopefully all these problems will disappear (or new ones will appear :whistle: )

tpdooley
10-14-2003, 05:54 PM
Originally posted by Brian the Fist
so please just bear with us

As a person from Kodiak, Alaska I have to say BOO to the unintended pun. :) Especially as how we had a sickly starving old bear gobble up a set of misguided bear "experts" on the mainland this past week.

.. will impatiently wait for the release of the new client and hope it puts this problem to rest for my poor Win2k and WinXP machines. :)
Thanks..

tpdooley
11-04-2003, 05:26 AM
Well.. after a weekend (5 days total).. 5 Xp and 1 win2k machine at one location and 1 winxp machine in another that I've been monitoring - I've yet to see the 910 error reappear. (we had several problems with my cable service.. the last few days which always caused these errors to spring up especially with several machines behind a router.)

So the new handshaking really seems to help. Thanks! :cheers: :thumbs:

Now to wait for the next client executable so the error logs stop getting filled up with the ignorable coordinates error message. :)

Brian the Fist
11-04-2003, 10:48 AM
Originally posted by tpdooley

Now to wait for the next client executable so the error logs stop getting filled up with the ignorable coordinates error message. :)

Yep, I've located exactly WHERE that is coming from, and WHEN, but not WHY... :confused: