Log in

View Full Version : New exponents are crashing the client [FIX: SB v1.1.1]



kenlow
10-29-2003, 11:55 PM
I have a few boxes that returned completed results today. But when the client starts work on the new downloaded exponent, the client will always crash.

I tried removing the service install and the -m, -s switches and try to start up the client manually, but it still crashed.

I also tried uninstalling the client and reinstalling it in a different directory, but the problem still persisted. It is able to download a new exponent, but it will always crash when it starts working on the exponent.

As a last resort, I copied a work in progress from 1 of my other machines and edited the registry accordingly. Restarted the client and the client started crunching away. So the problem seems to be bad WUs being handed out.

So far I have 4 boxes that have stopped crunching SoB and at least 2 other teammates are experiencing the same problem.

All my boxes are running on WinXP. They have been ruuning the client for the past month with no problems until today.

Anyone else experiencing the same problem? Anybody knows what is going on?

BentRing
10-30-2003, 12:37 AM
Same

eatmadustch
10-30-2003, 12:42 AM
after what n do you get this?

kenlow
10-30-2003, 02:11 AM
Originally posted by eatmadustch
after what n do you get this?

They are all in the 498xxxx range.

Soccer9889
10-30-2003, 02:26 AM
Same problem. Running XP on a 2.4 P4

ceselb
10-30-2003, 03:52 AM
Please state your CPU and OS, that might help to locate the problem.

I'm seeing a few of these reports in the ars technica forum aswell.

MikeH
10-30-2003, 04:17 AM
This is exactly the same problem we've had with the P-1 factoring client. Sorry to say we didn't have a solution. Just from 498xxxx onwards it works on some machines, but not on others.

As ceselb suggests, if everyone with this problem lists their PC spec this should give Louie some clues.

I'll drop Louie an e-mail in a few minutes.

hc_grove
10-30-2003, 04:17 AM
This is around the same point as Nuri found to be an upper bound (4980670) for p-1 factoring on his P4 running windows (some version).

Yet another reason to drop windows? :D

ceselb
10-30-2003, 04:23 AM
Yeah, I was thinking the same thing, but the SoB client uses different code afaik.
The same code is used by GIMPS, so they *should* have spotted any errors.

Keroberts1
10-30-2003, 04:30 AM
is this just the new version or does this happen in the previous releases too?

ceselb
10-30-2003, 04:46 AM
I tried the 31337 account, it crashes too. :(

SB v1.1 P4 running w2k.

hc_grove
10-30-2003, 06:24 AM
Originally posted by ceselb
I tried the 31337 account, it crashes too. :(

SB v1.1 P4 running w2k.

I just tried on two different machines, it works on both:
SB v1.02 PentiumMMX running Linux.
SB v1.02 Pentium III running Linux.

I have no intentions of completing those 31337 tests, but they don't show up under 'Current pending tests', so I can't expire them. Guess I just have to hope regular prp'ing doesn't catch up with those within the next 10 days. But that would leave me quite :shocked:

ceselb
10-30-2003, 06:47 AM
Tried v1.1 on an old PIII-700, works fine (seems to hang for a minute, but then starts to work).

Anybody got a P4 on windows that does work?

MikeH
10-30-2003, 08:50 AM
A suggestion for anyone who can't crunch units right now.

If you change your username in the client to "secret", you'll be testing numbers that were tested by previous searchers (before SoB), but who haven't provided residues (the proof that the tests have been done).

These number are quite small (n=675000), so they will complete quite quickly, and therefore might be best suited to users with a permanent internet connections. The chance of finding a prime is remote but not impossible, and it won't do your personal stats any good, but it will help the SoB project.....and it's only until the client's fixed.
:walking:

OberonBob
10-30-2003, 12:42 PM
yeah, that is not a bad suggestion.

Nuri
10-30-2003, 01:46 PM
good idea Mike. :thumbs:

Lagardo
10-30-2003, 02:32 PM
This may be sheer coincidence, but 1.5e6/log(2) = 4982892

So if there's some limit to windows or to the client at 1.5 million digits, that would show up for n around 498xxxx

(my 1.0 client on a P4 running linux keeps crashing and hanging, but it has been doing that all along and doesn't seem to need a restart any more often than before:

Mon Oct 20 - 2 times
Tue Oct 21 - 3 times
Wed Oct 22 - 1 time
Thu Oct 23 - 5 times
Fri Oct 24 - 2 times
Sat Oct 25 - 2 times
Sun Oct 26 - 1 time
Mon Oct 27 - 5 times
Tue Oct 28 - 4 times
Wed Oct 29 - 2 times
Thu Oct 30 - 2 times so far

Thanks to sbwrap for the log and the restarts.)

MereMortal
10-30-2003, 05:54 PM
So the problems seem to occur on Opterons and P4 systems. Does this code have SSE2 optimizations? That seems to be the first link that pops out at me.

I have Xeons running just fine in Linux at 499+, so some problem with Windows + SSE2, perhaps?

Mystwalker
10-30-2003, 06:52 PM
The client is heavily SSE2 optmized.
I wouldn't be surprised if the crash limit of SB and SBfactor would be exactly the same. I recall Louie saying that both are "based" on the same code - maybe using the same maths libraries?

jhites
10-30-2003, 07:11 PM
Add another P4/2.6C Windows XP Rig with 2X512 Kingston HyperX PC3500 to the list that will not run SoB client. I have been running just fine up until it started trying to start with n-498XXX and up.

I have also tried running the client in compatability mode using Win2K, Win98, etc. Not the best test senerio for testing WinXP for being the problem but it was just a thought I had and tried.

jhites
10-31-2003, 05:58 AM
I think the problem is related to P4 systems.

Just got 2 new proth tests on 2 of my AMD Win2K rigs and they are running fine.
got proth test from server (k=33661, n=4997112)
got proth test from server (k=19249, n=4999082)

Still can not run on my P4/2.6C WinXP rig

Memnoch
10-31-2003, 08:40 AM
I've been able to successfully run using the secret account, on my P4/ XP pro system so it something to do with the larger Ns...

bburgner
10-31-2003, 12:01 PM
This is my log before it crashed.

[Wed Oct 29 17:14:32 2003] got proth test from server (k=33661, n=4983408)
[Wed Oct 29 17:15:31 2003] got k and n from cache

I'm running Windows 2000 SP 4 on a Mobile Pentium-4 2.00 GHz.

The error appears to be a null pointer error. "instruction at blah referenced memory at 0x00000"

HTH,
Brian

Nuri
10-31-2003, 03:58 PM
k=22699, n=5001382.

P4-1700.

And as expected, crashes.

allio
10-31-2003, 06:42 PM
Have we determined if the crashing is isolated to NT/2k/XP yet? It seems that it's definitely only SSE2 enabled cpus (ie, the p4 and opteron) that are having trouble.

Lagardo
10-31-2003, 07:24 PM
As far as I can tell, my athlons seem to be producing fewer cem/s lately than they used to. I cannot really tell whether this is because the FFT length just made another jump (odd coincidence?) or whether there are in fact "hidden crashes" on athlons -- i.e. the current exponents kill athlons as much as P4's, just not as thoroughly, and thus the client has to be restarted by the service and produces less "long-term average" output.

I just thought I'd mention this to people who have (only) athlons running: is your output constant or have you seen reduction in production (in terms of your actually submitted cem/sec, not in terms of the nonsense the client app displays) in the last couple days?

jhites
10-31-2003, 07:56 PM
Originally posted by allio
Have we determined if the crashing is isolated to NT/2k/XP yet? It seems that it's definitely only SSE2 enabled cpus (ie, the p4 and opteron) that are having trouble.
Well my other P4 2.4C/3.0Ghz rig is off SoB, also.

All of my AMD rigs (5 total) are running just fine and they are
running Win2K Pro and WinXP Pro. This would indicate that this
bug is with P4 processors and not the OS.

All my AMD rigs have completed tests and downloaded new n = 498-500
range proth tests.

Joe O
10-31-2003, 09:40 PM
Originally posted by Lagardo
As far as I can tell, my athlons seem to be producing fewer cem/s lately than they used to. I cannot really tell whether this is because the FFT length just made another jump (odd coincidence?)

While I'm not 100% sure, I think that we just crossed or are crossing an FFT boundary. If you look at the exponent boundaries for GIMPS/Prime95 non SSE2 code, the 2 closest are 4598000 & 5255000. The Pentium 4 SSE2 code has slightly different ranges for each FFT size. You also have to allow for the "K" values. They would force us to switch to a larger FFT a little sooner than GIMPS/PRime95.
We are having similar problems in SB P-1 factoring. If you search through the P-1 Factoring program thread, you will see that Louie fixed a similar problem for SBFactor running under LINUX. That problem was a code alignment problem. The problem was manifested only under Linux not windows, and the fix was only to the Linux not the windows program. Well this time the Linux programs work and the Windows ones don't. If the problem with the PRP client is the same as the SBFactor program, the non SSE2 machines are not affected. At least not yet. I don't know if a similar problem awaits us at the next boundary. We can only wait and see. By the way, the next non SSE2 boundaries for GIMPS/P95 are at 6520000 & 7760000.

Slatz
10-31-2003, 10:20 PM
would be nice to see a reply from Louie or someone with access to the code that they are aware of the problem and are looking at it :confused: :rolleyes:

Slatz

[EGBT]ComOy
11-01-2003, 02:57 AM
I think I may know what the problem is: the program doesn't create the z****** file!

Or at least this is what occured on my computer. I finished a test with n=4.97 million or so and then dl'ed a new one with n=5 million or so. Using Go Back 3 Deluxe, I was able to confrim that no z***** file was created for my new n! However, the program refuses to give up trying: even though I successfully expired the test (I hadn't read this thread yet and assumed it was just that test at first) via the "preferences" tab on the main page, the client will not get a new test (and remember, there is no z***** file!). As far as I know, if you expire a test and make sure the the z***** file does not exist, you should automatically acquire a new test!

Therefore, I'm willing to guess that the limitation is the 1.5 million digits thing, and that this limitation is causing a problem with creating a z****** file for n>4.98 million or so.

Good luck figuring out how to fix this....I'm clueless. I'll just try changing my username to [removed by alien88] for now.

Oh yeah I'm running p4 2.8c with kington hyperx 3200 winxp pro with a 10,000 RPM HD and the service install of the client in -o2 mode.

rosebud
11-01-2003, 03:37 AM
Did anyone try to use an older version??

I once tested the 31337 account with an P4 and it worked just fine. That was when v1.1 wasn't out yet.

eatmadustch
11-01-2003, 05:37 AM
my test actually switched off my computer!!
here's the log:

[Sat Nov 01 04:48:03 2003] n.high = 4963519 . 1 blocks left in test
[Sat Nov 01 04:55:41 2003] residue: 7F2449D014FB3C22
[Sat Nov 01 04:55:41 2003] completed proth test(k=33661, n=4972896): result 3 <--damn, no prime ;)
[Sat Nov 01 04:55:41 2003] connecting to server
[Sat Nov 01 04:55:42 2003] logging into server
[Sat Nov 01 04:55:43 2003] requesting a block
[Sat Nov 01 04:55:48 2003] got proth test from server (k=21181, n=5007260)

*** Sat Nov 01 10:00 wake up, realize computer isn't working, switch it on***

[Sat Nov 01 11:18:22 2003] got k and n from cache

*** start client again, crashes again ...

[Sat Nov 01 11:24:34 2003] got k and n from cache

I would really like a fix for this, now I only have my slow athlon!

MikeH
11-01-2003, 06:04 AM
would be nice to see a reply from Louie or someone with access to the code that they are aware of the problem and are looking at it I have heard from Louie. He is on the case. :)

hc_grove
11-01-2003, 06:20 AM
Originally posted by allio
Have we determined if the crashing is isolated to NT/2k/XP yet? It seems that it's definitely only SSE2 enabled cpus (ie, the p4 and opteron) that are having trouble.

The "new" client (v1.10) hasn't been ported to anything but windows yet, and I don't think anyone ever made v1.02 run under Linux on a P4 (it was discussed in several threads i February/March). I once managed to make an even older client (I think it was v1.00) run under Linux on a P4, but I have deleted that, and am now using that machine for P-1 factoring.

But the problem seems very familiar to the problem with SBfactor that only occur under windows.

[EGBT]ComOy
11-01-2003, 03:30 PM
Guys, please try to see if you have .z****** files for these exponents. I had THREE different tests greater than 4.98 million on my computer and NONE of them successfully created a .z***** file. If this is also occurring on anyone else's computer, I'm pretty sure that it is what is causing the problem (or, at least, the problem is occuring before the cache file is created, and a far as I know, the process goes like this: 1) You get a test from the server, 2) You create a cache (.z******) file for the test 3) You actually start testing. Obviously the problem is not in 1), so 2) is the next logical conclusion).

In case you don't know, the cache file for your tests is a .z******* file that is saved in the directory where you installed SB (or, equivalently, sobsvc). The ******* is the exponent (n-value)....for example, if you were testing 5353*2^5009190+1, your .z******* file would be called .z5009190. Therefore, if you were assigned a test with n>4.98 million, PLEASE check whether or not you have a .z498**** file (or a .z499**** file or a .z500**** file) on your computer and then post about it here!

eatmadustch
11-01-2003, 03:34 PM
My client didn't create the z****** files either

hc_grove
11-01-2003, 05:49 PM
The z******* files aren't made until the test has run for around 10 minutes.

[EGBT]ComOy
11-01-2003, 09:55 PM
Wow, I never noticed that before. Well, that means instead of the cache files being the problem, it probably has to do with the communication between the cache and the client (since the tests are obviously cached as they won't go away). But that pretty much leads us right back to square one......

:(

jjjjL
11-02-2003, 01:49 AM
I have been aware of this problem for a couple days.

The issue only effects processors that use SSE2 (ie P4, Opterons(?)) for exponents n > 498000.

A fix is ready. Download this new version. It will be posted on regular mirrors soon but here it is now:
SB v1.1.1
http://www.seventeenorbust.com/download/

Cheers,
Louie

Alien88
11-02-2003, 03:51 AM
woooo im an idiot and confused two usernames.. maybe its time for sleep :P

Nuri
11-02-2003, 07:28 AM
Installed the client.

First set of feedback:

Good news first. It seems to work.

And a strange behavior. It first grabbed 55459/5017270 k/n pair. I exited the client roughly 20 seconds after the y5017270 file was created. When restarted a couple of minutes later, it grabbed another k/n pair (24737/4823143). May be just a coincidence. No such problems with the second k/n pair with exiting and restarting.

rosebud
11-02-2003, 09:00 AM
Seems to work here too. :thumbs:

jhites
11-02-2003, 09:49 AM
It is working for me on my (2) P4 HT rigs which is good news. The bad news is that with SSE2 turned off in the new v1.1.1 client there is about a ~30-40% performance hit. With the -o2 optimization using v1.1.0, I would normally run about 480-490xxx cEMs/sec running 2 clients and now I am only getting about 297xxx cEMs/sec running 2 clients. :(

MikeH
11-02-2003, 11:44 AM
The bad news is that with SSE2 turned off This is a short term fix to get all the P4s back up and working again. This gives Louie a bit of breathing space to get the SSE2 fix sorted.

MikeH
11-03-2003, 02:58 PM
Many thanks to everyone that gave their cycles to 'secret'.:cheers: