http://www.free-dc.org/forum/showthr...&highlight=RAM
did a clean install, but not from generation 0-- will try that...
[minutes later];; problem seems to be resolved with brand new 100% clean install, beginning with gen 0..
just started getting messages suggesting faulty RAM...
RLEUnpack failed; size= 10926; should=400*251...
box is my DVD player- nothing else being used-- ever.. how good does my RAM need to be, all of a sudden???
, but initial and new RAM chip both give same results;
reinstalled [ downloaded] client to make sure I hadnt scrambled something..]
... was watching a flick and crunching, with this result...[ would let me upload/ wont let me crunch...any ideas ???
" All that's necessary for the forces of evil to win in the world is for enough good men to do nothing."-
Edmund Burke
" Crunch Away! But, play nice .."
--RagingSteveK's mom
http://www.free-dc.org/forum/showthr...&highlight=RAM
did a clean install, but not from generation 0-- will try that...
[minutes later];; problem seems to be resolved with brand new 100% clean install, beginning with gen 0..
Last edited by RaginSteveK; 12-24-2003 at 08:41 AM.
" All that's necessary for the forces of evil to win in the world is for enough good men to do nothing."-
Edmund Burke
" Crunch Away! But, play nice .."
--RagingSteveK's mom
I just got the same thing:
Thu Jan 15 20:20:42 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM
This is the text client on WinXP and I am not overlocking at all. The RAM passes the Prime95 torture tests AND memtest86.
Howard: Is this an RLEUnPack problem on the .val.bz2 files?
I tried re-installing a new fresh copy of the DF client over top of what I have now, didn't help. Once I deleted filelist.txt to force it to start at gen 0 again, the problem went away and now the DF client doesn't complain.
I have kept all the relevant files if you are interested.
Jeff.
This error refers to the most recent .trj file being corrupt. It may even be size zero. Probably caused by a system crash/program kill at an inopportune time.
Howard Feldman
I have had this twice on two different machines, both duals and in both cases the second service. I have not restarted the process after the second crash so I still have the files if you want them. There was no system failure or stopping of the program as the machines were only being used to run DC projects.Originally posted by Brian the Fist
This error refers to the most recent .trj file being corrupt. It may even be size zero. Probably caused by a system crash/program kill at an inopportune time.
I think this may be due to a memory leak. I have been running the client on some machines which have multiple language support. On leaving the machines over the weekend I have logged off. On returning on the Moday both machines insistes that the logon should be in French, even though the main language the previous week was English. On logging on the default language was English but if I looged out of either machine it insisted that the default language was French, including the keyboard layout.
HTH
for me the EXACT same thing, on a "dual" (p4 with HT) and always the 2nd client (or at least the same client dir).Originally posted by Nanobot
I have had this twice on two different machines, both duals and in both cases the second service.
I have had this problem on the second instance of the application running on a Dual 2 GHz G5 Mac so the problem is cross-platform. I am not overclocked. I checked my RAM and it was good. And the crash occurred in the middle of a run.
As has been mentioned before, this error means that the latest trajectory distribution file (.trj) is corrupted. It could be related to your RAM (faulty or being used up by other applications) as well as some sort of crash on your system.
If you are running on a dual machine, make sure you have your temp directories separated as well. You can simply add it you the foldit script, as follows:
set DFPTEMP=/distribfold1/TEMP
set DFPTEMP=/distribfold2/TEMP
where distribfold1 and distribfold2 are the installation directories. Make sure to create the different TEMP directoried first.
Elena Garderman
"As has been mentioned before, this error means that the latest trajectory distribution file (.trj) is corrupted. It could be related to your RAM (faulty or being used up by other applications) as well as some sort of crash on your system.
If you are running on a dual machine, make sure you have your temp directories separated as well."
I'm running a dual G5 Mac with your PowerPC/Darwin distribution. As far as I know, it doesn't need temp directories, merely that the path to each instance be different and I've run two instances for several weeks without difficulty using that method.
The failure of one of my instances occurred in mid-run. I checked the RAM and it was fine. I downloaded a complete new application from you, obtained a new work unit and the problem continued. I doubt it was corruption at my end. I'm from Toronto.
I had three clients crash over the weekend with the same error, on three different boxes.
I have multiple clients running on W2K server - one for each CPU.
Each client runs from it's own folder, with it's own TEMP space.
These are very high-end servers, with excellent memory that has exhibited no other problems. The machines have had no crashes or other failures - the remaining clients on the boxes each kept crunching. They have at least 2GB of RAM in each box.
I *seriously* doubt that this many people would start to have memory problems all at once. It would be a remarkable coincidence.
I think the project needs to look elsewhere rather than blaming bad memory.
willy1
if on all kinds of different platforms, OS'es and (especially dual)machines the same client crashes over and over, could't it just that maybe, maybe there is something wrong with the client?Originally posted by Stardragon
As has been mentioned before, this error means that the latest trajectory distribution file (.trj) is corrupted. It could be related to your RAM (faulty or being used up by other applications) as well as some sort of crash on your system.
There are numerous ppl with a dual setup, be it actually two cpu's or HT, that have to deal with random crashes of one of the clients, and yes, we did setup the tempdirs.
I've posted a quite elaborous discription of my (home)system and attached somefiles, saved the complete dir's for weeks but nobody wants them.
so with all due respect, since this distributed computing is just a means to your end, could you maybe just give these kinds of posts - from users who do this only for competition and science - a bit more serious attention instead of waving it away on our hardware/system?
Taking into consideration the serverproblems of the last few days, the least of your worries should be OUR hardware
So again, a quick summary of the error:
- on dual setups
- on Windows, Linux and MacOSX
- always the same (2nd) client
Last edited by Escrimador; 03-29-2004 at 04:53 PM.
We will try to reproduce the error on our end. If you indeed have saved folders from this crash, please upload them to ftp.blueprint.org/incoming, and send a descriptive e-mail to trades@mshri.on.ca notifying us of your upload.
Are there any errors appearing in the error log immediately before the RLE Unpack error?
Elena Garderman
Thu Mar 25 17:05:16 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Mar 25 17:05:16 2004 ERROR: [000.000] {foldtrajlite2.c, line 4933} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Thu Mar 25 17:35:23 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Mar 25 18:02:53 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Mar 25 18:02:53 2004 ERROR: [000.000] {foldtrajlite2.c, line 4933} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Thu Mar 25 18:28:22 2004 ERROR: [000.000] {foldtrajlite2.c, line 4933} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Thu Mar 25 19:20:29 2004 ERROR: [010.003] {taskapi.c, line 1218} [ReadServerResponse] Timeout waiting for response, got 0 chars.
Thu Mar 25 19:25:29 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=5604, should=400*20 - likely this is caused by overclocked or faulty RAM chips, please test your RAM
willy1
Code:========================[ Mar 23, 2004 12:22 AM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 23, 2004 9:44 AM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 25, 2004 9:50 AM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 26, 2004 9:31 AM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 26, 2004 10:49 AM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 26, 2004 11:26 AM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 29, 2004 5:24 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 29, 2004 5:33 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Mar 30, 2004 10:00 AM ]======================== Starting foldtrajlite built Jan 12 2004 Tue Mar 30 12:41:43 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM ========================[ Mar 31, 2004 10:15 PM ]======================== Starting foldtrajlite built Jan 12 2004 Wed Mar 31 22:15:29 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM ========================[ Mar 31, 2004 10:34 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Apr 1, 2004 9:53 AM ]======================== Starting foldtrajlite built Jan 12 2004 Thu Apr 01 09:53:09 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM ========================[ Apr 2, 2004 10:45 AM ]======================== Starting foldtrajlite built Jan 12 2004 Fri Apr 02 10:45:24 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=3, should=400*400 - likely this is caused by overclocked or faulty RAM chips, please test your RAM
and again on a P4-HT 2.4 GHz, not o/c, Win XP Pro
Code:========================[ Apr 8, 2004 4:41 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Apr 8, 2004 4:42 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Apr 8, 2004 4:46 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Apr 8, 2004 4:47 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Apr 8, 2004 4:48 PM ]======================== Starting foldtrajlite built Jan 12 2004 ========================[ Apr 8, 2004 5:03 PM ]======================== Starting foldtrajlite built Jan 12 2004 Thu Apr 08 19:53:17 2004 FATAL ERROR: [023.024] {trajtools.c, line 2637} RLEUnPack failed, size=8470, should=400*162 - likely this is caused by overclocked or faulty RAM chips, please test your RAM
Any hope that the bad RAM problem on multiple cpus will be solved in the April 20 protein release?
We have not solved this yet, sorry. the number in the error message is important. If it says size=3, should be 400*400, this usually means the .trj file is missing or empty for some reason. If it is some other number, it almost certainly IS a result of faulty RAM. The .trj is there and readable, but is essentially failing a consistency check
I suspect the 2-CPU problems may be a result of Mutex's and Semaphores in the NCBI toolkit which we are using. We have been unable to pinpoint it just yet though - it is an extremely difficult problem to debug, as it occurs apparently at random and relatively infrequently. Its on our list of things to fix of course!
Howard Feldman