PDA

View Full Version : Possible problems with SMB client...or my machine?



Halon50
11-09-2002, 06:16 AM
I have a Dual P3-450 setup running the regular client at idle priority and the SMB version at normal priority.

Looking through the error logs, I see a lot of interesting errors. I am currently running Memtest86 on it to see if there's a problem with memory, or if it's something more serious.

Here are clips from the log:


[Wed Oct 30 18:59:53 2002] internal computation error [mismatched sums]! check your memory/processor.
[Wed Oct 30 19:04:54 2002] restarting proth test from cache (k=27653, n=2261337) [58.1%]

[Sat Nov 02 06:21:41 2002] internal computation error [excessive roundoff]! check your memory/processor. test will restart in 5 minutes.
[Sat Nov 02 06:26:41 2002] restarting proth test from cache (k=27653, n=2309757) [21.9%]

[Sat Nov 02 13:50:20 2002] internal computation error [excessive roundoff]! check your memory/processor. test will restart in 5 minutes.
[Sat Nov 02 13:55:21 2002] restarting proth test from cache (k=27653, n=2321025) [7.6%]

[Sat Nov 02 15:48:51 2002] internal computation error [excessive roundoff]! check your memory/processor. test will restart in 5 minutes.
[Sat Nov 02 15:53:51 2002] restarting proth test from cache (k=27653, n=2309757) [31.1%]

[Sun Nov 03 15:32:17 2002] internal computation error [mismatched sums]! check your memory/processor. test will restart in 5 minutes.
[Sun Nov 03 15:37:17 2002] restarting proth test from cache (k=27653, n=2321025) [32.2%]
[Sun Nov 03 15:45:17 2002] internal computation error [excessive roundoff]! check your memory/processor. test will restart in 5 minutes.
[Sun Nov 03 15:50:17 2002] restarting proth test from cache (k=27653, n=2321025) [32.2%]
[Sun Nov 03 15:54:25 2002] internal computation error [excessive roundoff]! check your memory/processor. test will restart in 5 minutes.
[Sun Nov 03 15:59:25 2002] restarting proth test from cache (k=27653, n=2321025) [32.2%]

...

[Sat Nov 09 02:06:53 2002] internal computation error [mismatched sums]! check your memory/processor. test will restart in 5 minutes.
[Sat Nov 09 02:11:53 2002] restarting proth test from cache (k=27653, n=2443281) [75.2%]

[Sat Nov 09 02:25:22 2002] error writing the cache files. will retry again in 10 minutes.


That last error really caught my eye, and is the point at which I stopped the machine altogether to run Memtest.

This machine has had problems with overheating in the past, but only during the summer when ambient temperature rose above 80F or so. Even then, the OS (WinXP) caught the problems and stopped the process threads, effectively halting the cause of overheating until I got to it and rebooted.

Well, I guess I'll come back and post the results of Memtest when it finishes in the morning. Meantime, do you have any suggestions on what else (other than memory or CPU heat) could cause intermittent errors like this?

System info: 440BX dual Pentium 3-450, 384MB RAM (assorted brands), 20GB HD with no errors (last checked 2 days ago).

Halon50
11-09-2002, 06:18 AM
Here's the full log (limited to 100K).

Halon50
11-09-2002, 07:12 AM
ASDkflj'skfdaj; :bang: :swear:

From the log it looks like the problems started before I got the SMB client loaded and running on the machine, so it's probably not the SMB client.

Still, I'm not entirely convinced it's heat (yet). Memtest just finished its default tests with no errors, and I have it running the full test. I'm going to bed!
:sleepy:

Halon50
11-09-2002, 03:50 PM
Well, I restarted the dual 450 after no problems with Memtest running for several hours.

Someone commented in another thread about losing blocks? My machine seems to have lost around 50 blocks from one of the counts; attached is the tail end of the log.

EDIT: The block loss is explained (but not accounted for) in this thread (http://bane.free-dc.org/forum/showthread.php3?s=&threadid=1884) .

Halon50
11-16-2002, 04:21 PM
An update: The machine's been fine since the Memtest run. Either Memtest put the fear of Cod into it, or there was something borked with the way blocks remaining were calculated, and it fixed itself.

I did come across an interesting-looking "hiccup" in block transmissions in the log though:


[Tue Nov 12 16:04:24 2002] n.high = 434655 . 68 blocks left in test
[Tue Nov 12 16:12:41 2002] logging into server
[Tue Nov 12 16:12:41 2002] login successful
[Tue Nov 12 16:12:41 2002] n.high = 1005053 . 48 blocks left in test
[Tue Nov 12 17:44:36 2002] logging into server
[Tue Nov 12 17:44:36 2002] login successful
[Tue Nov 12 17:44:36 2002] n.high = 468090 . 67 blocks left in test
[Tue Nov 12 17:56:54 2002] temporarily unable to connect -- block added to submit queue
[Tue Nov 12 19:24:44 2002] logging into server
[Tue Nov 12 19:24:44 2002] login successful
[Tue Nov 12 19:24:45 2002] n.high = 501525 . 66 blocks left in test
[Tue Nov 12 19:40:46 2002] logging into server
[Tue Nov 12 19:40:46 2002] login successful
[Tue Nov 12 19:40:47 2002] n.high = 1074367 . 46 blocks left in test
[Tue Nov 12 21:04:55 2002] logging into server
[Tue Nov 12 21:04:55 2002] login successful
[Tue Nov 12 21:04:56 2002] n.high = 534960 . 65 blocks left in test

jjjjL
11-16-2002, 06:04 PM
I'm pretty sure you're actually using the SMP client, correct? (what would SMB be?)

In my experience, memory controllers on dual boards are normally better than regular mobos but perhaps one of the chanels has flaked out. Abit boards have notoriously poor controllers. Hopefully you aren't using an abit board with more than one stick of ram... that's just asking for trouble. I remember back in the days of the BH6 when people would have to move their ram to slot 2 because 1 and 3 would lose connection to the south bridge. :rolleyes: Many of their current boards still have similar issues.

Ok, you probably have 4 ram slots. Usually they are paired 1+2 on chan 1 and 3+4 on chan 2. You should experiment with moving all the ram to one or the other. I'm hoping you have a 256MB and a 128MB stick. If you're using smaller sticks, you should consider removing them. If both channels pass with all the ram on each, I'd recommend having one stick on each channel (slot 1 + 3, probably). Having ram on both channels is definately a plus... especially for SMP systems. Look up the specs for your board as they may break the channels up differently than I've described.

Now for your other issues. I think the reason you have "missing" blocks is because the new version of SB uses larger blocks than the last.

--- here you start a new test using v0.9.8 --
[Fri Nov 08 18:28:07 2002] got proth test from server (k=27653, n=2589225)
[Fri Nov 08 18:45:17 2002] logging into server
[Fri Nov 08 18:45:17 2002] login successful
[Fri Nov 08 18:45:18 2002] n.high = 14916 . 172 blocks left in test
[Fri Nov 08 19:02:28 2002] logging into server
[Fri Nov 08 19:02:28 2002] login successful
[Fri Nov 08 19:02:28 2002] n.high = 29832 . 171 blocks left in test
[Fri Nov 08 19:19:38 2002] logging into server
[Fri Nov 08 19:19:38 2002] login successful
[Fri Nov 08 19:19:39 2002] n.high = 44748 . 170 blocks left in test
[Fri Nov 08 19:36:49 2002] logging into server
[Fri Nov 08 19:36:49 2002] login successful
[Fri Nov 08 19:36:49 2002] n.high = 59664 . 169 blocks left in test
[Fri Nov 08 19:54:11 2002] block processing paused
--- here you re-start a new test using v0.9.9 --
[Fri Nov 08 19:54:33 2002] got k and n from cache
[Fri Nov 08 19:54:33 2002] restarting proth test from cache (k=27653, n=2589225) [2.9%]
[Fri Nov 08 19:55:04 2002] logging into server
[Fri Nov 08 19:55:04 2002] login successful
[Fri Nov 08 19:55:05 2002] n.high = 74580 . 67 blocks left in test

...and and now since the blocks are 2.5x larger, you go from 169 blocks to (169 / 2.5) ~= 67 blocks.

As for the "hiccup" you mention, there are a number of reasons it could have not transmitted that block. Of course, since it continued transmitting right after the next block, there is absolutely no problem. It was suppose to do that. There is a small chance that was during the period where I was upgrading the server and it was down approximately three minutes while I had to wait for dead sockets to be closed by the linux networking kernel. I noticed that 3 clients tried to connect during that period. One of them was mine ;), but you could have been one of the other two.

Hope the info helps. Let me know if playing with the ram configs produces any insights.

-Louie

Halon50
11-17-2002, 01:39 AM
Thanks for the response.

It's a Tyan Tiger 1833D. Attached is a pic. I think the problems may be due to the proximity of the memory controller to the second CPU's heatsink. It's been better since I attached spare fans to the bottoms of the heatsinks (see picture), but I think I'm going to need some sort of directed airflow to help cool off the memory controller.

Another interesting point is, even though all 4 slots are filled, I did mix up the order of the SIMMs in between the first Memtest run, and the other 5 runs. Since it's been running fine since these test, I figure I'll leave it alone.

Sorry about the "SMP" vs "SMB" confusion. I've always mixed up the terms ever since I took a course on Intel Microprocessors a few years back.

I should have been more specific with the "hiccup." The interesting part of the log wasn't that one block didn't get transmitted, but rather that the other test (being run on the SMP client, I assume) skipped a count and went from 48 to 46 blocks remaining. I can track down exactly which n values were being crunched at the time if you need them.

jjjjL
11-17-2002, 01:47 AM
i wouldn't worry about the block skip. i see what you mean now... it means nothing. the blocks left number is actually dividing the length of the full test by 250McEMs and then rounding (because the last block is actually smaller than 250McEMs). It can also occur that it will say "0 blocks left" right before it does a very small block that completes the test. both these errors are really just a rounding in the display and don't amount to a real problem.

I'll fix it in the next release so other people don't get confused too. Thanx for being on the look out. :)

-Louie

ps - sweet mobo. i want one! :D