Getting a sumout error on last block in test [Archive]

Bok

04-17-2005, 06:26 PM

[Sun Apr 17 19:06:35 2005] n.high = 8772315 . 2 blocks left in test
[Sun Apr 17 19:11:20 2005] resolving hostname
[Sun Apr 17 19:11:24 2005] opening connection
[Sun Apr 17 19:11:24 2005] logging into server
[Sun Apr 17 19:11:24 2005] login successful
[Sun Apr 17 19:11:24 2005] n.high = 8775558 . 1 blocks left in test
[Sun Apr 17 19:16:03 2005] internal computation error [sumout error]! check your memory/processor. test will restart in 5 minutes.

This is on a non overclocked fairly new Dell P4 3.2 with 1Gb DDR2. I also copied the test to another machine and it gives the same error on restart, but I guess this might be due to it being corrupt already???

Anything I can do? I'm pretty confident the machine is stable, but I can't check as it's a remote machine in MA..

:cheers:

Bok

IronBits

04-17-2005, 06:38 PM

I have the same problem with an amd xp2600 box, not OC'd.
I have it on dpad instead. Any other project runs fine.

Frodo42

04-17-2005, 08:36 PM

I have also seen this problem a few times (but not all that often) on my P4 3Ghz with 1024 MB dual ram, not overclocked or overheated ... not all that big a problem when i do double testing ... a test only takes a few hours ... I just restart and don't get the problem ... I have had no problems with P-1 factoring or stress testing.

vjs

04-18-2005, 09:49 AM

Bok,

THat machine would be great on holepatch, do you think you could run the username holepatch for a while if you don't mind not getting personal credit. Probably fly through quite a few tests quickly they are around 1.7M right now if not I'd go with frodo's suggestion of secondpass.

Also Ironbits would you consider running sieve on that xp2600 we could use the help getting to 2^52. Should be able to do close to 50G perday with that machine.

Bok

04-18-2005, 09:53 AM

Haven't decided. I might just put it on Fad.

I'm getting somewhat pissed at it as I've now got a second machine (another P4 3.2 which has been running at full stress with uptime of 240 or so days) with the same error on the last block.

[Mon Apr 18 08:48:19 2005] resolving hostname
[Mon Apr 18 08:48:19 2005] opening connection
[Mon Apr 18 08:48:19 2005] logging into server
[Mon Apr 18 08:48:19 2005] login successful
[Mon Apr 18 08:48:19 2005] n.high = 8820152 . 1 blocks left in test
[Mon Apr 18 08:48:56 2005] internal computation error [sumout error]! check your memory/processor. test will restart in 5 minutes.
Segmentation fault

It seems to me that it must be a program fault...

I may just pull everything off SoB until I see some answers.

Bok :swear:

vjs

04-18-2005, 10:05 AM

Bok,

I hate to say this but it's probably the machine (yes I know it's new) or the particular test, this is why I suggested that you run holepatch as a user name.

First it will totally clear your registry/test k/n pair/etc, second the tests are short so you won't have to wait that many days before you see if you get a sumout error, you could also run secondpass if you don't mind mucking with the registry.

It's quite possible that one of the sticks of memory is bad from the factory, insufficent cooling, you could also try reducing the cas setting.

THe fact that processors overheat and produce errors in this project is actually a representation of how optimized the client actually is. It's really forcing the P4's to show their strength in calcuations, My processor actually runs at 41C with Sob and 38-39C if you run seti, haven't tried folding lately...

Bok

04-18-2005, 10:12 AM

I'm well aware of that, this is linux btw (no registry), but these have run many tests all ready successfully

I'm not convinced it's the machine(s).

Bok

IronBits

04-18-2005, 10:29 AM

Originally posted by vjs
Also Ironbits would you consider running sieve on that xp2600 we could use the help getting to 2^52. Should be able to do close to 50G perday with that machine. I would if you guys could find a way to automate it. I don't feel comfortable with manual intervention and reservations... ;)
I have enough on SOB now, with more xeons on the way :)

vjs

04-18-2005, 11:03 AM

I guess its more to each his own I personally like the manual intervention :D it feels more real to me. But if I had horde of machines I may feel different.

I do use sobistrator now however, its alot easier just one click to submit factors although the reservation is still manual (I sort of see it as a manual brag however).

Keep up the good work ironbits, looking at your scores have you considered changing your name from ironbits to HEAVYIRON??? :cheers:

Keroberts1

04-18-2005, 01:47 PM

if the reason you believe it is the program that is at fault when twocomputers are both having a sum-out error keep in mind when oyu transfered the data from one machine otthe other if there had been any errors in the cache you would have transfered them too. Also the error could have happened many blocks back and only have been caught when the data was checked to see that it made sense at the end.

Bok

04-18-2005, 02:02 PM

yes, that's what I mentioned earlier on in the thread, but it's now happened on a totally seperate machine & test as I mention just a few posts up. Nothing to do with the original failure....

Bok

vjs

04-18-2005, 02:23 PM

Originally posted by Bok
I'm not convinced it's the machine(s).

There are three things that can be a problems the client, that particular k/n, or the computer.

The only thing you can really test is if it's the machine or test. In order to be positive it's not the machine de-clock (under fsb) and increase the cas numerically. For the life of me I couldn't get one of my nforce2 machines to complete a tests without some failure reported even at normal clock speeds... it ran prime95 torture tests etc just fine. However when I changed/decreased the memory to 2.5-3-3-8 (lower than the rated 2-2-2-6) the tests started to run fine. I think it's more likey that manufactures are not totally upto par on dual channel memory. It may only produce one memory error every other day but when it take 3days to run a test...

It's also possible that there are some error with a particular range of n say around n=8.85M, you may be the first to notice? Seriously first try the smaller test then larger tests with lower fsb, then orginal fsb and memory timings see what happens and let us know.

Bok

04-18-2005, 02:33 PM

I'm afraid that's not going to happen...

One machine is a linux server I have running at our offices in Boston (I'm in NC) and the other is none other than the machine hosting this very forum in a secure datacenter in Dallas somewhere :p

Maybe I'll try one of the k/n pairs on one if my dual xeons or my desktop P4.

Bok

pixl97

04-19-2005, 11:28 PM

What linux kernel are you running Bok, have had some issues with 2.6.11+

pixl97

Bok

04-19-2005, 11:36 PM

One of the machines is

Linux version 2.6.10-gentoo-r5

the other is RHEL so it's

Linux version 2.4.21-20.ELsmp

These were independent tests too..

I've only updated a few of my boxen to 11 so far.

Bok

prime95

04-20-2005, 10:11 AM

Can you email the k/n values and the save file? I'm at [email protected]
Hopefully I can use the save file in prp.exe to see if I can reproduce the problem.

Bok

04-20-2005, 10:45 AM

email sent..

In case it doesn't get there, you can download the relevant info from

http://stats.free-dc.org/sobfail.tar.gz

Bok :cheers:

vjs

04-20-2005, 07:41 PM

Bok,

Sorry about pointing my finger at your machine so forcefully... Turns out one of the members of my home team was/is having problems with the client and linux lately, pretty sure it's a similar box P4. I've posted a link to this thread and I'll let you know if he responds.

Bok

04-20-2005, 08:13 PM

No problems,

it's always better to get a response at least :)

Bok

prime95

04-20-2005, 08:48 PM

My latest (unreleased) PRP excutable returns the following using a 768K FFT size.

55459*2^8821390+1 is not prime. RES64: FEF527DAA2A46CD4. OLD64: FCDF778FE7ED4677

Using the last released PRP, I get the same result but it used an 896K zero-padded FFT size. The SoB executable should have been built with an FFT library somewhere between these two executables.

I concur that this is likely an SoB executable problem. Yes, errors are almost always hardware related. The key difference is yours is reproducable - happening in the same spot every time (at least we think so, SoB does not print the exact iteration number that went awry).

To make any more progress, the SoB folks will need to look into this further.

jjjjL

04-20-2005, 11:45 PM

I agree that there is reason to believe our exe file is malfunctioning.

I got an email report about this problem a few days ago and followed up on it. I was able to finish the test on a Pentium 3 but the original box was a well maintained P4.

So this looks to be a P4 issue relating to the code.

I'll be posting a test client here shortly for you to try.

Thanks for brining this issue to our attention. I'm glad you were persistent enough and diligent enough to collect good data. Good going Bok!

Cheers,
Louie

Bok

04-20-2005, 11:52 PM

Thanks Louie, anything you need just let me know :)

And thanks prime95 for troubleshooting it...

Bok

*edit* I just finished one of the failed tests on an opteron proc

Trying the other one on a dual MP machine right now.....so it certainly does point to a problematic P4 issue. I'm sure you guys can solve it. If you need an account on the box one of the tests failed on let me know...

I've kept the original status of the tests on the P4's to test the new clients when ready.

vjs

04-21-2005, 11:16 AM

Does this problem only effect p4's running linux with n>8.8M?

Or is it all P4's running n>8.8M windows included?

jjjjL

04-22-2005, 12:48 AM

Got a new test client compiled for windows.

Can you folks recreate the error outside of Linux or do I need to recompile that version before I know if this change helps?

Here's the bare exe. Drop it into a healthy windows install to experiment.
http://www.thecorporatedrone.com/sb240test.zip

Cheers,
Louie

IronBits

04-22-2005, 01:31 AM

It's FASTER!!!
Have it on a Dual xeon 2.6GHz with 2 instances and an AMD FX55.
AMD used to run about 1.9, now it's about 2.3 :)
xeons are at 4.5 cEMs/sec total was 4.2 cEMs

Mystwalker

04-22-2005, 07:07 AM

It's FASTER!!!

I wouldn't use the test client for more than checking this particular error and maybe DoubleCheck - and even that only after explicit permission...

Bok

04-22-2005, 07:11 AM

Originally posted by jjjjL
Got a new test client compiled for windows.

Can you folks recreate the error outside of Linux or do I need to recompile that version before I know if this change helps?

Here's the bare exe. Drop it into a healthy windows install to experiment.
http://www.thecorporatedrone.com/sb240test.zip

Cheers,
Louie

I've only seen this error on linux at the moment. Got another error yesterday, completed the test on an AMD 64 no problems. So if you compile a linux version I could test it within minutes on that particular test...

Bok

IronBits

04-22-2005, 08:55 AM

Originally posted by Mystwalker
I wouldn't use the test client for more than checking this particular error and maybe DoubleCheck - and even that only after explicit permission... :blush: Oh! I thought we were supposed to test the client as well. :bang:

CaptainMooseInc

04-22-2005, 12:28 PM

Is it OK to put this in place of 2.3 and start returning results??? I am noticing a speed difference too and I'd like to go as fast as I can. :)

:beep:

-Jeff

vjs

04-22-2005, 02:45 PM

I'd hold off until louie gives us the ok

Ken_g6[TA]

04-22-2005, 03:02 PM

Originally posted by Bok
I've only seen this error on linux at the moment. Got another error yesterday, completed the test on an AMD 64 no problems. So if you compile a linux version I could test it within minutes on that particular test...

Bok If you post the K and N, we could all test it on Windows, and see if there is a problem there.

CaptainMooseInc

04-22-2005, 03:26 PM

I'll replace version 2.4 with 2.3 when I get home tonight then. I accidently switched them out and left 2.4 running...

I guess if it's still running when I get back then it'll be a successful "beta test". :) :thumbs:

-Jeff

Bok

04-22-2005, 03:35 PM

Originally posted by Ken_g6[TA]
If you post the K and N, we could all test it on Windows, and see if there is a problem there.

(k=55459, n=8821390)

Of course you could have downloaded the file I referenced earlier in the thread.....

Bok :p

jjjjL

04-25-2005, 03:29 AM

I have some linux test client binaries ready. It appears to rerun the "Bok test" on a Xeon box (the SB server). However, so does v2.3 so I need others to double check that these new versions actually behave correctly:

http://www.thecorporatedrone.com/sb24test-static
http://www.thecorporatedrone.com/sb24test-non-static

chmod +x sb24test-static sb24test-non-static

The gethostbyname function in linux does not like being statically linked. Try running these on a number of machines and let me know your experiences good and bad.

I noticed this version is faster. If someone could benchmark in windows a little more that would be cool. I don't have a huge problem with people running the windows version on regular tests. If you don't mind upgrading again soon once the final windows client comes out -- deploy away. However, I don't recommend heavily deploying the linux test versions for real tests at this time.

Cheers,
Louie

Bok

04-25-2005, 04:16 AM

Tested the non-static version on both servers which failed before using the tests that failed and they both completed successfully... :)

Bok :cheers:

Frodo42

04-25-2005, 04:35 AM

Thank you Louie.

As soon as I get back home (in ~8 hours time) I will deploy these new clients on secondpass ... that should also give the oppertunity to check if they give reasonable resutls (on n~1.8m anyways).

jamroga

04-25-2005, 03:12 PM

Before you release version 2.4, perhaps you might consider some
very small changes. Specifically: increase user name space so
"usernameQQQsecondpass" can be entered without the need for
registry editing. Also, have you considered changing the transmit
intermediate blocks option to be based one time... for example: if only
once an hour intermediate blocks are sent...it should have negligible
effects on the statistics, but reduce network traffic significantly.

PS - The sb 2.4 windows client seems to run slightly slower on my PIII machines.
Have not yet tested with linux or P4 machines.

Frodo42

04-26-2005, 04:40 AM

From the log on my P4 3GHz it looks as if there might be a speed increase of something like 4% on secondpass tests with the non-static linux-version.

I haven't seen any problems ... if you want to check residues all tests I (Frodo42) have reported within the last 12 hours have been done with the 2.40 non static version.

Bok

04-26-2005, 07:25 AM

Just a report. Got another failure on once instance running on a dual 3Ghz Xeon (Gentoo 2.6.11 kernel). Restarted it with the new non-static version and it completed successfully. I'll continue running this one with a new test.

Bok

hhh

04-27-2005, 04:52 AM

I'll give the test version a try on my 1.3 GHz celeron M Laptop the next week, while on a vacation. Until now is seems to me that it's neither faster nor slower, but just the same. You will get exact numbers in a week.
DC blocks take about 25 minutes on this machine.
Yours H.

[DPC]Tweakert

04-27-2005, 05:09 PM

AMD64 3000+ from 1,4M with the old client
to 1,65M with the new test-client. :D

hhh

04-28-2005, 01:03 AM

Well, I can give you the numbers right now:

26:08 min/block -->25:19 min/block

So, 3.125% less time needed, 3.22 % speed increase.

(on my 1.3 GHz Celeron without load).

It' not sooo much, but sounds fine to me. I hope everything else works fine, too. :)

See you, H.

ServerStrike

04-28-2005, 04:34 PM

Originally posted by [DPC]Tweakert
AMD64 3000+ from 1,4M with the old client
to 1,65M with the new test-client. :D

Same for me :cheers:

Its working fine here :rotfl:

Theadalus

04-28-2005, 08:31 PM

I've tested the new client on a P4 3.06GHz @ 3.3GHz and noticed a performance drop from 3.31M => 3.07M :confused:

On an AMD Athlon 64 3400+ @ 2.4GHz a performance gain from 1.77M => 2.11M :thumbs:

[DPC]Tweakert

05-03-2005, 08:50 AM

I've benchmarked some more and I have to say that on average the speed of the new client would be around 1,8M so that's even better. :)

ShoeLace

05-03-2005, 08:57 PM

i have noticed an increase in speed also using the 2.4.0 TEST client.. [email protected]

Polski Radon

05-04-2005, 02:31 PM

Theadalus, was that for the same WU?

My Prescott 3.0 @3.75 is giving me 3.4M with two 898xxxx WU's.

However for my s754 A64 I had to reduce the HTT by 10MHz to make it stable under a 896xxxx WU. 250*9 is only doing 1.5M.

Theadalus

05-04-2005, 04:16 PM

Originally posted by Polski Radon
Theadalus, was that for the same WU?
Yes, i let it run for approx. 1.5 hours on the v2.4 client, when i didn't notice any speed increase anymore (maybe not a reliable benchmark?)
Currently i'm running 1 client on this CPU because it has a higher output then 2 clients (HT not optimized with "older" P4's?)

Originally posted by Polski Radon
My Prescott 3.0 @3.75 is giving me 3.4M with two 898xxxx WU's.
I have another P4 3.0GHz (FSB800, no Prescott) @ 3.41GHz, running v2.3 with 2 clients (n=893xxxx and n=895xxxx) gives me a total of 3.79M

Originally posted by Polski Radon
However for my s754 A64 I had to reduce the HTT by 10MHz to make it stable under a 896xxxx WU. 250*9 is only doing 1.5M. My AMD 64 3400+ (s754, 1MB, 2.2GHz Clawhammer) @ 2.4GHz is currently doing 2.33M with n=895xxxx and k=10233 and v2.4 client.

Obviously lower k's gives (much) higher output then higher ones, with almost same n.

For example (on my P4 3.41GHz):
n=89xxxxx and k=21181 => 1.90M
n=89xxxxx and k=55459 => 1.57M

Mystwalker

05-04-2005, 06:01 PM

Originally posted by Theadalus
Obviously lower k's gives (much) higher output then higher ones, with almost same n.

For example (on my P4 3.41GHz):
n=89xxxxx and k=21181 => 1.90M
n=89xxxxx and k=55459 => 1.57M

That's correct (at least in average).
Lower k values allow higher FFT boundaries, hence it is possible to have a certain FFT size for a k/n pair (e.g. 1024K), but a lower size for another k*/n pair (e.g. 768K), with k* < k.

IIRC, FFT size is proportional to sqrt(k). I'm pretty sure George Woltman ("Prime95") posted a more thorough explanation somewhere in this forum...