Anyone else have this problem?
I am finding that this new Protein just dies for no reason and with no error message..it quits before it can write to the log. Very rare on the last couple of Proteins, but this one I am having several deaths a day on various PCs..some 45 KM apart :bs: Reinstall makes no difference. I run all PCs with the window visible now ( still quieyt mode) so I can monitor them And yes, DFGUI has no clue as to their demise either
I am not a Stats Ho, it is just more satisfying to see that my numbers are better than yours.
Anyone else have this problem?
Howard Feldman
I have seen this happen now and then, but only on my WIN2K servers, running dual procs. It just seems to shut itself down, with no entry to the log. Sometimes it can be restarted, but usually I have to dump the filelist.txt and start it over
I have 20 clients running on dual Xeon boxes. Every morning, there are 3 to 5 that have stopped in the last 24 hours. No errors, just stopped at the end of a generation. I most cases, I have had to dump the entire work unit and start over with gen 0.
All boxes are W2K Adv server, with dual HT Xeons running 1 client per virtual CPU, from different folders. This has not been a problem until this protein.
I have the same problem.
the client died two times by now.
Quietly, no error log and even dfgui thougt it was still running.
I am still getting random freezing under FreeBSD, all looks fine when you do a ps -ax, but you have to do a kill -9 to the foldtrajlite process to get it to stop, it ignores the removal of the lock file. This happens one several machines, I get one or two every day on a group of 20 different machines. If I run Linux on the same machines the problem does not occur, and they seem to get through a set of 250 faster than the same machine on FreeBSD. This also happend on the previous protein. All machines are running with the -q flag.
I have 17 AMD Systems running, most overclocked about 125% and 1 P4 system. .. OS are mostly Win2k, some are Win98SE..
None have exhibited this behavior....
Too many computers, too little time......
Happened to me the other day. Im running an AMD Athlon XP 2000+ on Win2K.
I only found out because my HSF (Volcano 9) is on auto fan speed, so the computer was much quieter than usual
Could someone possibly provide a screenshot immediately after this mysterious death? And could someone explain why you have to delete all your work and start again, is there an error message of some kind after this happens? What is it? Help me help you.
Howard Feldman
I have never seen it show any error message in the logs. All my clients are running as service, so there's no window to display a message in. dfGUI still shows the client as running. I have tried killing the lock file and restarting (which fails) and restarting the service (which fails) and issuing a "Recover" command thru the dfGUI interface (which will sometimes actually restart it, although not very often). It is a this point that I kill filelist.txt and restart it, because nothing else seems to bring it back. Hope this helps some.
i just found this on a linux box- ERROR: [001.001] {foldtrajlite2.c, line 2026} Caught sig 11
was running -if -qt -rt. had 127 structures buffered which i was able to upload. down for 24hr
I had one that died with sig11. I regreased the muffler bearings, changed the air in the tires, updated the BIOS, and it's still been folding just fine ever since!
HOME: A physical construct for keeping rain off your computers.
if it gives me more problems i will at least kick the tires
Had some odd things happening to me last night. I have a few screenshots at home that I will upload later.
I noticed DFGui was showing -1 gen's buffered. I made the client visible, and it showed the same.
I shut down the client, but it errored and was killed by Dr. Watson (i'll post the dump later too).
I restarted, all saved generations were gone (lost about 24 buffered gens)
I woke up this morning and found that the protein had killed itself. Lost 12 hours of work.
I'll post the stuff when I get home. Hope it helps
Here is a ZIP containing all the relevant stuff.
Hope this helps. If there is anything else I can provide, let me know.
It seems that my upload (around 250 gens) caused this. There are numerous errors in the error.log related to the upload, and I noticed that I didnt recieve points for any of the generations that I uploaded. Maybe thats it...
Im on a Dial-up, by the way. Its going through another computer on my LAN, via ICS. Also used an HTTP proxy occasionally, but not this time.
I thought it was just me!! I haven't found anything to show you, Howard, but I find my client turned off every morning when I wake up and every evening when I get home from work.
Hmm... I thought I didn't have it, but maybe this is it:
To start with, it seems like the GUI sits at 46/4 "structures complete/structures remaining" for a long time to begin with.... I fold off-line...
I'm finding systems sitting there forever. Typically they might have 20 or so buffered generations, while other systems are now up to 150 or so buffered...? And they all get up-loaded around the same time of day over the course of 2-3 hours.
I thought the Overclock was hanging them up, but they will do it even when Underclocked...... Checking CPU temperature, I can tell the client is not running....
Seems to happen on only 2-3 systems and not all... I used the "distribfold-update.exe" to update them after one system downloaded it.
When the worst system gets to the "10K run", I'll install the "new" downloaded client package and see what happens....
And finally, like others have indicated, the error log shows nothing...
Too many computers, too little time......
You guys (on windows machines, I think you need to be able run VBScript files as well) might want to try dfDetect. Something I wrote a long time ago. If your client goes down for any reason it'll start it again. You can also set it up to shutdown DF while running other programs on the computer.
if any1 is interested in the source give me a pm.
It's very small so it should be pretty easy to port to Linux, too.
Last edited by m0ti; 10-11-2003 at 09:24 AM.
Team Anandtech DF!
I agree with Rebels Heaven, it jut seems to sit there forever, altough I can tell by locking at the task manager and MB Monitor that the client takes all the cpu time it can get.
It then suddenly jumpes to the next generation.
Didn't have that with the old updated´version, but after i did a fresh install, it did that...
Don't know whats wrong....
Greets Thor
If you set it to use -g 1 (that is, in dfGUI, set "progress update" to 1 and restart the client), I bet it "hangs" at 49 structs (or maybe 50) instead of 46.
In other words, I bet this is completely normal behavior, because the client is either minimizing energy, or doing whatever else it does between generations.
"If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."
-- Originally posted by Paratima
Hmmm... So Thor, your saying it didn't do it with the up-dated protien. But it does it with the new one...???
Now that I'm watching, I have one machine that is always falling behind the others.. I happened to crach the OS yesterday, re-loaded 2K, but it's still doing it...... I'm gonna try the new client package when I finish the 250 structures, and I'll report back....
Too many computers, too little time......
bwkaz, anything is possable I guess, but the problem is confined to only 1 or 2 out of 18 systems I'm running....
Too many computers, too little time......
Any ideas what happened with my client?
I installed the new windows package, and have yet to see the same thing happen. I'll let you know if it happens again.
Wait a minute, it seems that I was confused. Oops.
I thought you were seeing it hang, then start up again. But I see after re-reading some of the old posts that the client just plain dies.
So never mind. I just don't know what I'm reading, that's all.
"If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."
-- Originally posted by Paratima
Somehow I missed this earlier.
All mine that 'quietly stop' are recording an error in the error.logNot a permissions problem - client can run for days before giving this error, run as admin userFATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error
Not a disk space error - 10 to 20 GB of free space on every box.
Not every client on every box fails at the same time.
Always stops at Structure 50 at the end of a generation.
All boxes are W2K Adv. Server, multi HT Xeon, lots of RAM
I had 3 client sessions on 3 different boxes fail over the weekend with the same error.
Last edited by Angus; 10-13-2003 at 12:23 PM.
Angus, are you running more than one instance of the client on all the boxes that are having trouble? Did you tell each instance to use a different directory for its temp files?
I'm running four instances on each box, each living in and started with it's own dfGui from it's own folder..
I haven't done anything else, and it's been working fine like this for a long time.
This is something new with the latest update.
Create two folders
mkdir \distribfold1
mkdir \distribfold2
extract the client into each folder
make sure you put handle.txt in each folder
make sure autoupdate.cfg is in each folder
edit foldit.bat in each folder, look for the line that looks like this
.\foldtrajlite -f protein -n native
change it to this (windows OS - need 256mb ram)
.\foldtrajlite -f protein -n native -rt -qt
You should also edit foldit.bat in each folder to add DFPTEMP variable to point the swapfiles someplace different from each other.
for example...
set DFPTEMP=\distribfold1\TEMP (in the first case)
set DFPTEMP=\distribfold2\TEMP (in the second case)
then
mkdir \distribfold1\TEMP
mkdir \distribfold2\TEMP
This prevents each client from stepping on each others temp scratch files, which has happened before when using duals.
open a cmd prompt
Then
cd \distribfold1
foldit
open a cmd prompt
cd \distribfold2
foldit
There is no way that I know of to dedicate each client to a different processor.
This is for *WINDOWS* boxes???
I see nothing in the c:\WINNT\foldtraj.ini or anything in the DF folders would lead one to think that the client is using anything other than it's own .\ folder for temp files.
What is the client's default TEMP location, and what would the file names be? There's nothing in C:\WINNT\TEMP or C:\TEMP or C:\ - in fact I couldn't find *anything* that looked like a DF TEMP file.
Angus -- open a command prompt and:
echo %TEMP%
$5 says that TEMP is pointing into your profile somewhere (usually TEMP is set to %userprofile%\Local Settings\Temp on 2K / XP boxes).
The DF temp files are named file<some stuff>.cdx, file<some stuff>.dbf, file<some stuff>.fpt, and file<some stuff> with no extension.
"If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."
-- Originally posted by Paratima
Angus, I run my DF stuff in RAM disk. This is my DFPTEMP directory as of right now:
Volume in drive Z is RAMDISKNT
Volume Serial Number is FE2D-F000
Directory of Z:\DFPTEMP
09/09/2003 07:00p <DIR> .
09/09/2003 07:00p <DIR> ..
10/13/2003 09:10p 3,072 1442_1168_730.cdx
10/13/2003 09:10p 15,202 1442_1168_730.dbf
10/13/2003 09:10p 106,004 1442_1168_730.fpt
10/13/2003 09:10p 57,341 1443_1168_767
4 File(s) 181,619 bytes
2 Dir(s) 12,032,000 bytes free
Hope this helps.
HOME: A physical construct for keeping rain off your computers.
Hmm...
Well, I am using the Linux version, but I thought the files would be the same on both platforms. Guess not...
"If you fail to adjust your notion of fairness to the reality of the Universe, you will probably not be happy."
-- Originally posted by Paratima
I'll see if I can get some of the boxes set to use specific TEMP folders tomorrow - if not, then they'll have to slowly die by thenselves as I'm away on a trip for a week.
None of this explains why this problem started with this update.
I'm not sure it started with this update-- I believe I've seen it prior to this, but I put it down to a local machine glitch since there was no error output. It does seem to happen more frequently with the current update though.
As a useful experiment, what if you take a 'broken' client, copy the whole directory to a 'good' machine, and then start it up. Is it still 'broken' or is it fixed? If it works after copying it, likely there is something physically wrong with your machine.Originally posted by Rebels Haven
bwkaz, anything is possable I guess, but the problem is confined to only 1 or 2 out of 18 systems I'm running....
Howard Feldman
I've read Ironbits suggestions about making different temp folders. I found out that this does not work for services, or maybe I made a mistake. So my question: how do you state the "DFPTEMP" for services. Is there a switch in service.cfg witch does the same as editing foldrit.bat?
DFPTEMP should work with service, but you must remember that a service is not normally run under your user account. The variable must be set for All Users, or the Admin user at least...
Howard Feldman
Not only have i had this quiet crashing problem on my Win XP boxes with no error output, but i have noticed that after a while(which could be up to 30hrs) the client stops using the extra ram feature.
Usually the client uses about 95mb ram while folding, but occasionaly it will slow down for no aparent reason and when this happens it only uses about 25mb ram. I have to restart the client to make it use extra ram. This has happens at least 6 times (probably more) over the process of this protein.
On a different matter just today, one of my Win XP boxes had just finished the 250 generation and had restarted from the begining, it got to the 11th structure and crashed with Win XP giving an the crash error massage asking if i wanted to send the info to Microsoft. Everytime i restarted it i got the same crash, but no message in the error log. I lost 8 hrs from this crash. I deleted the filelist.txt file and all was well again.
hope this helps.
Googlybear