PDA

View Full Version : Random Crashes



PS2pcGAMER
08-24-2005, 09:40 PM
I just started running BOINC for the first time a few weeks ago, so I am still getting used to it. On one computer, it keeps crashing. A windows box comes up telling me that mfoldb125_4.28_windows_intelx86.exe has crashed and it wants me to submit a bug report. I don't use this computer much, so by the time I notice it, predictor is running again. DrWatson also sometimes seems to crash with it.

I set it up as a scheduled task to run in the background as soon as the computer is turned on. I don't think it is a heat or stability related issue as it is running at stock speeds and for over a year it ran FAD with absolutely no problems and the temperatures are fine.

Since I am knew to boinc, I am not even sure where to begin with trying to diagnose the problem, so any suggestions would be appreciated.

Bok
08-24-2005, 09:43 PM
Can't say I've had it crash at all, but I haven't ran predictor on many XP based machines. Have you tried doing a memory test, perhaps a prime test. Maybe it's more memory intensive than FAD?

Bok

PS2pcGAMER
08-24-2005, 09:46 PM
Nope, not yet. I'll try both Prime95 (the memory stress test) and Memtest86+ later tonight. Thanks for the suggestion.

Chuck
08-24-2005, 10:13 PM
INFO: I have ECC, pass Prime95 as well and get the crash from time to time... it will go down so hard it even takes the boincmgr with it. when it does give that error, hit the RETRY so it doesn't leave your XML file in a messed up state so you will get credit for cpu time used after you restart. Otherwise, it will report back as failed and you lose the time.

I suspect boundary-condition data, but that's only a guess because it happens on several configurations of hosts (OS, CPU, etc), but SP2 does seem to be more sensitive to it due to DEP mode on my AMDs. I will try to find and resurrect my mfold WU dump tool to see if I can examine the next bad WU and report it if the WU is indeed corrupt or at least what is causing mfold to roll and die.

PS2pcGAMER
08-24-2005, 10:24 PM
Originally posted by Chuck
INFO: I have ECC, pass Prime95 as well and get the crash from time to time... it will go down so hard it even takes the boincmgr with it. when it does give that error, hit the RETRY so it doesn't leave your XML file in a messed up state so you will get credit for cpu time used after you restart. Otherwise, it will report back as failed and you lose the time.

I suspect boundary-condition data, but that's only a guess because it happens on several configurations of hosts (OS, CPU, etc), but SP2 does seem to be more sensitive to it due to DEP mode on my AMDs. I will try to find and resurrect my mfold WU dump tool to see if I can examine the next bad WU and report it if the WU is indeed corrupt or at least what is causing mfold to roll and die.

Thanks for the response. I am a litle confused on what you mean by "hit the RETRY". Is this in the BOINC manager?

Also are there logs that BOINC keeps on my computer that might help me have a better idea what is going on?

So far memtest has ran for ~30 minutes with no errors. I'll let it run for awhile longer though.

Chuck
08-24-2005, 10:38 PM
The .xml file simply keeps track of what your boinc projects were doing at last shutdown. A clean shutdown is what you want. that was my reference there.

I am also referring to the popup that gives you the 'abort, ignore, retry' choice. Think that one comes from within the boincmgr itself. (someone please correct me if wrong).

(( apologizes for it being a long day in very hot texas weather.... a bit of brain fry here))


As for your pc's memory / cpu , etc.... i doubt you will find any hardware problems.

granted, it doesn't hurt to check, but i would not waste too much time on it.... just run a full memtest86+ (extended tests) pass on it ... just 1 is needed to make sure it's all ship-shape.

if prime95 doesn't fry the cpu, nothing will unless you want my core-melter (a rather deliberate program used to test liquid cooling)... I can run two prime95s with no errors, yet mfold will go coo-coo... so just write up the error'd WU number and submit it... let Chahm and David (@ Scripps) figure out if there is a pattern.

PS2pcGAMER
08-24-2005, 11:05 PM
Ah, gotcha. Thanks for your reply. Yeah, I agree, I am beginning to doubt it is a hardware issue, but I'll keep looking into it.

memtest passed one full test with no errors. I am having it run the Prime95 Torture test right now.

I may switch it over to einstein@home at least temporarily until I can figure out what is going on. It has crashed at least 8 times in the past few days whereas my laptop hasn't crashed at all running predictor.

Chuck
08-25-2005, 12:26 AM
Can you take a look at the installed updates from MS and determine the differences between the two machines? If you can, it would definately point us in a direction.

As for the Prime95 test and memtest86+, I would stop now... I'm willing to bet you are just fine.

If you can mount drive 'C$' on one of the two machines then look at the 'system' and 'system32' directories, see if any obvious date/time stamps show up (sort either alphabetically or date/time ordered (detailed list for either) for the .dll files.

You definately have a nice simple case... I am jealous as you have what I would consider the ideal debugging solution.

PS2pcGAMER
08-25-2005, 01:28 AM
Prime95 went for two hours with no errors.

The problem machine has SP2+all updates. My laptop (the problem-free machine) has only SP2 and no updates.

I have removed predictor from the problem computer and will attach it again later. I don't know if that will help or not, but we'll see. I'll compared DLL files and keep diagnosing things tomorrow, as I have given up for tonight. I'll just have it run einstein over night. Thanks again.

Chuck
08-25-2005, 01:49 AM
Originally posted by PS2pcGAMER
Prime95 went for two hours with no errors.

The problem machine has SP2+all updates. My laptop (the problem-free machine) has only SP2 and no updates.

I have removed predictor from the problem computer and will attach it again later. I don't know if that will help or not, but we'll see. I'll compared DLL files and keep diagnosing things tomorrow, as I have given up for tonight. I'll just have it run einstein over night. Thanks again.

That's very very good info. Thanks I think you have nearly enough to pass on to P@H HQ which is starting to imply they need to relink and re-release the app due to the MS upgrades.... Looking forward, perhaps a re-install / upgrade (is there a newer one than 4.45?) as well, before reattach, might be in order since you are now detached (unless the effect on einstein would be too significant) as it will link up differently in the registry.

PS2pcGAMER
08-25-2005, 02:05 AM
The client was 4.45 and that is still the newest release.


Just for kicks I might try updating my working predictor computer to have the latest windows updates and see how it behaves.

Chuck
08-25-2005, 02:59 AM
Please make sure you have a Ghost backup or a GOOD checkpoint so you can undo the updates..... if any are not 'undoable', then Ghost it..... but definately sounds like a good plan.

PS2pcGAMER
08-29-2005, 02:52 AM
I haven't gotten around to looking into this since my last post, until tonight.

I had completely removed Predictor from the problem machine. Tonight I reinstalled it and I am going to let it run for a few days to see what happens.

I haven't gotten around to updating Windows on the good machine to see if that is an issue. I'll wait and see if the problem computer acts up again.