PDA

View Full Version : Server down?



S3NTYN3L
04-21-2016, 10:06 AM
I'm posting here in the hope that someone might be able to contact an SoB project/site admin, in case they're not aware of the issue.
The project/site server seems to be down and has been for the past 48 hours.
That said, I'm unable to send any results nor receive new assignments.

engracio
04-22-2016, 09:44 AM
Did somebody forget pay the domain name renewal?? I think it due this year.

AG5BPilot
04-23-2016, 01:17 PM
Hello,

This is Michael Goetz, admin over at PrimeGrid. As you're likely aware, we've been collaborating with the Seventeen or Bust project for the last six years.

I'll let Louie fill you in with the details, but he's in the process of restoring/moving the server. I get the impression that this is not something that was planned. No ETA yet.

Anyone suffering from severe SoB-withdrawal symptoms is, of course, welcome to fill the downtime by crunching SoB tasks over at PrimeGrid. Different software, but the same project. Please do return here when SoB is back online, however.

Hopefully Louie will get everything back up and running soon.

Mike

S3NTYN3L
04-23-2016, 11:32 PM
Different software, but the same project.

Which would mean that we'd still need to connect to the project server for assignments, yes?
As that server is down, we'd be in the same situation we are now.

AG5BPilot
04-24-2016, 07:54 AM
Which would mean that we'd still need to connect to the project server for assignments, yes?
As that server is down, we'd be in the same situation we are now.

No. PrimeGrid's servers work completely independently of Seventeen or Bust's servers. All coordination between the projects is done on a purely human level between the admins, not electronically.

Both servers have different lists of candidates they're testing, so we don't duplicate the same work. But we're both working on solving the Sierpinski problem. At the moment we're sending out work for k=10223 and k=67607 with 31M<n<32M.

S3NTYN3L
04-24-2016, 12:27 PM
Both servers have different lists of candidates they're testing, so we don't duplicate the same work. But we're both working on solving the Sierpinski problem. At the moment we're sending out work for k=10223 and k=67607 with 31M<n<32M.

Understood.
Any work assigned from one site wouldn't be credited on the other.

Sorry, but that is a problem for me.
(I've OCD issues).

engracio
04-25-2016, 09:48 AM
Good to know. Tanks for the update Mike. :)

engracio
04-29-2016, 03:22 PM
Any update???

AG5BPilot
04-30-2016, 07:39 AM
Any update???

I have no additional information.

AG5BPilot
05-08-2016, 09:22 PM
We just heard from Louie and the news is not good. It looks like the old server, and everything on it, is toast. From what I understand the software that runs the server was lost too. There were backups, but unfortunately the entire datacenter was lost, taking the backups with it.

No decision has been made about what happens next -- in fact I just got this information and there's been no discussion yet about how or if to move forward.

We will keep you informed as new information becomes available.

For the sake of those wondering what they should do, I'll give you a guess as to what I think will happen. Everything from this point on is pure speculation on my part and may very well be incorrect. Take the rest of this message with a huge grain of salt.

It sounds like none of the software or data can be recovered from the datacenter, and if Louie had off-site backups somewhere he would have already found them. Everything on the SoB server is probably gone forever. The records of which numbers were tested, the residues, the user records, everything. Louie has indicated that he probably won't be relaunching SoB unless he's able to recover at least the software.

The only information that still exists is what PrimeGrid has. So what does PrimeGrid have? Well, for one thing, we have almost all of the sieve data, so that's not lost. Plus, we have complete records of all the work done on PrimeGrid. We also have complete records of all the work that was done on SoB for the two k's that were permanently transferred to PrimeGrid last year. Everything on the other 4 k's might be lost, however.

If SoB goes under, PrimeGrid will continue to work on our 2 k's. I suspect we'll pick up the other 4 k's as well if SoB can't be restarted.

S3NTYN3L, unfortunately, it looks like there's no way to recover any of the user data, so all credits and other information relating to SoB appear to be gone for good.

tim
05-08-2016, 11:15 PM
What the actual ****ing hell?


My log files can be used to re-create some of the data submitted lately.

AG5BPilot
05-09-2016, 03:55 AM
What the actual ****ing hell?


My log files can be used to re-create some of the data submitted lately.

Log files would be helpful. I should have thought to mention that.

If Louie and/or PrimeGrid need to try to put the pieces back together again from scratch, every little bit will help.

engracio
05-09-2016, 10:43 AM
I have all of my residue for at least 5 years. Also we have done double check all the way up to 30Mil just saying.

AG5BPilot
05-10-2016, 08:27 AM
I have all of my residue for at least 5 years. Also we have done double check all the way up to 30Mil just saying.

Feel free to send any logs that you have to me at mgoetz [at] primegrid [dot] com. I'll make sure Louie gets those, along with anything we have that can help. If SoB doesn't get going again, then those would also be helpful for restarting the full SoB project at PrimeGrid if that's the direction we go.

engracio
05-10-2016, 12:16 PM
Feel free to send any logs that you have to me at moetz [at] primegrid [dot] com. I'll make sure Louie gets those, along with anything we have that can help. If SoB doesn't get going again, then those would also be helpful for restarting the full SoB project at PrimeGrid if that's the direction we go.

Thanks Mike I will. Still have a few wu to finish, I will sent it as soon as it complete. I wanted to make sure the wu i had remaining were done. About a week or so.

tim
05-11-2016, 12:22 AM
I'll gather my log files too, when current work units are done. Kills me to think of the wasted clock cycles.

S3NTYN3L
05-11-2016, 10:36 AM
Wow.
Fourteen years of work gone in an instant. :(
What an extreme example of the need for OFF SITE BACKUPS.

Can the wayback machine (http://archive.org/web/) possibly help to get some of the code back?

I'll allow my current assignments to finish. What are you wanting when they're done? My results.txt file? Something else?

Regardless, I'll find some other diversion for my OCD.
This is just too painful of a mind**** to process at the moment...

shifted
05-11-2016, 12:03 PM
I don't keep all my logs, but I'll see what I have. I was one of the top contributors.

shifted
05-11-2016, 03:25 PM
Feel free to send any logs that you have to me at moetz [at] primegrid [dot] com. I'll make sure Louie gets those, along with anything we have that can help. If SoB doesn't get going again, then those would also be helpful for restarting the full SoB project at PrimeGrid if that's the direction we go.

Should be mgoetz [at] primegrid [dot] com :)

AG5BPilot
05-11-2016, 04:12 PM
Should be mgoetz [at] primegrid [dot] com :)

::babbles incoherently::

Yup, spelled my own name wrong. :)

I've corrected the original post; thanks.

engracio
05-18-2016, 11:03 AM
Mike please check your mail. Thanks

AG5BPilot
05-18-2016, 04:46 PM
Got it earlier today, thanks.

endless mike
05-19-2016, 05:40 AM
Are you still looking for log files? I've got a few from just over a year ago, don't know how helpful they would be. Let me know if you could use them, and I'll do some digging to get them you.

AG5BPilot
05-19-2016, 11:17 AM
Are you still looking for log files? I've got a few from just over a year ago, don't know how helpful they would be. Let me know if you could use them, and I'll do some digging to get them you.

Yes, and yes. ALL log files will be useful. If everything was indeed lost, all tests for which there are no records will need to be redone. So every result from a log file is one less result that will eventually have to be repeated.

jMcCranie
05-19-2016, 03:44 PM
Yes, and yes. ALL log files will be useful. If everything was indeed lost, all tests for which there are no records will need to be redone. So every result from a log file is one less result that will eventually have to be repeated.

First, I have several log files that I will send after the last one finishes.

Secondly, why is it necessary to redo all of the old ones? We know that all exponents below a certain limit (27,700,000) have been checked - I don't see a need to redo those.

AG5BPilot
05-19-2016, 05:16 PM
First, I have several log files that I will send after the last one finishes.

Secondly, why is it necessary to redo all of the old ones? We know that all exponents below a certain limit (27,700,000) have been checked - I don't see a need to redo those.

We don't know which ones were double checked -- and at PrimeGrid I've got a really good window into the quality -- or lack thereof -- of the computers used in distributed computing. In general, we no longer trust any results unless they're double checked. The problem with not immediately double checking results is that when a computer starts going bad, you have no way of detecting it. So any results that don't have matching residues from different computers are suspect. Unless we get really lucky, except for whatever we can get from log files, we have no residues at all on 4 of the 6 k's.

Calculation errors are proportionally more likely to occur on larger candidates, especially when the error rate is fairly low, but non-zero.

Our position on double checking is especially rigid when it comes to conjectures like SoB. Consider a hypothetical k where the first prime is at n=100,000, and the second prime is at n=100,000,000. If you miss the first prime because of an undetected computation error, many years of unnecessary computing will be wasted searching for the second prime.

It's actually not as horrible as it might seem at first glance. The vast majority of candidates are small and can be rechecked much faster than the original search.

endless mike
05-19-2016, 06:21 PM
Yes, and yes. ALL log files will be useful. If everything was indeed lost, all tests for which there are no records will need to be redone. So every result from a log file is one less result that will eventually have to be repeated.

I will get my results files to you; might take me until Monday though. Real life responsibilities come first.

AG5BPilot
05-19-2016, 07:41 PM
I will get my results files to you; might take me until Monday though. Real life responsibilities come first.

Thank you! There's no need to feel rushed. Louie can, of course, take as much time as he needs, and we're not going to rush him. We're not going to take any action until he determines if restarting SoB is possible.

jMcCranie
05-19-2016, 08:54 PM
We don't know which ones were double checked

Unlike the Mersenne prime search, we only need to double check positive results. The time double-checking is better spent checking new numbers. For Mersenne primes, we want a complete list. For 17-or-bust, we only need to find a prime for each coefficient. If we get a false negative, no harm is done if we find a prime for that coefficient.

endless mike
05-20-2016, 03:03 AM
Unlike the Mersenne prime search, we only need to double check positive results. The time double-checking is better spent checking new numbers. For Mersenne primes, we want a complete list. For 17-or-bust, we only need to find a prime for each coefficient. If we get a false negative, no harm is done if we find a prime for that coefficient.

A false negative would mean a missed prime. Potentially wasted years of computing would count as harm in my book. That's the main reason I gave up on SOB and went back to GIMPS.


Consider a hypothetical k where the first prime is at n=100,000, and the second prime is at n=100,000,000. If you miss the first prime because of an undetected computation error, many years of unnecessary computing will be wasted searching for the second prime.

S3NTYN3L
05-20-2016, 04:54 AM
Wow.
Fourteen years of work gone in an instant. :(
What an extreme example of the need for OFF SITE BACKUPS.

Can the wayback machine (https://web.archive.org/web/20160305001541/http://www.seventeenorbust.com/) possibly help to get some of the code back?

I'll allow my current assignments to finish. What are you wanting when they're done? My results.txt file? Something else?

Regardless, I'll find some other diversion for my OCD.
This is just too painful of a mind**** to process at the moment...



Just because it seems to have been missed due to moderation...

AG5BPilot
05-20-2016, 08:26 AM
Wow.
Fourteen years of work gone in an instant. :(
What an extreme example of the need for OFF SITE BACKUPS.

Can the wayback machine (http://archive.org/web/) possibly help to get some of the code back?

I'll allow my current assignments to finish. What are you wanting when they're done? My results.txt file? Something else?

Regardless, I'll find some other diversion for my OCD.
This is just too painful of a mind**** to process at the moment...


Just because it seems to have been missed due to moderation...

I did miss that somehow. Thanks for bringing it up again.

Louie's the one who would be interested in code, so the Wayback machine might possibly be of use to him.

As for off site backups, you've got that right. Stuff happens. I had a 110-story building fall on top of one of my data centers once, Not only did we not lose data, but our business operations didn't miss a beat. Both Jim and myself are rather obsessive about making sure the data is backed up, including daily copies to an off-site location. In the event of a real disaster, PrimeGrid may be down for as much as 24 hours, but at worst we'll lose the last day's results. If you do this stuff long enough, you see this kind of thing happen over and over. If you do it for a living, you're expected to make sure that you're protected against this kind of event because it's a matter of when, not if. The problem is that when you're running things on a shoestring budget that's typical of an operation run as a hobby or a club, it can be really hard to put together the resources just to keep the operation running, let alone to have adequate assets in place to handle disaster recovery scenarios. So let's not judge the powers that be too harshly because it's difficult enough just to keep something like this running.

For me personally, I use a backup system that lets me back up my entire family's computers to the cloud.

As for which files, what I'm interested in is anything that lists test results, meaning the number being tested and the residue produced by the program. I have no idea what Louie might want.

Finally, yes, this is a huge blow. No way to sugarcoat this; it's going to set back this search quite a lot if the data can't be recovered. The only good news in this is that because we decided to split the project between SoB and PG by k's, at least we have all the data for 2 of the 6 remaining k's. Only two thirds of the work is lost rather than all of it.

S3NTYN3L
05-20-2016, 09:44 AM
So let's not judge the powers that be too harshly because it's difficult enough just to keep something like this running.

I'm not judging anyone.
It was simply a statement alluding to my shock and sorrow regarding all the years of work lost.

Hindsight is always 20/20, I know. Lesson learned.


Please DO let Louie know about the Wayback Machine! ;)
It might take a bit of source code browsing but, even with my limited skills, I'm able to get into some of the back-end directory structure, so... :dunno:


Mike, you never answered my questions.
Would you like me to send you my results.txt file?
Is that the only file you'll need?

AG5BPilot
05-20-2016, 12:33 PM
I'm not judging anyone.
It was simply a statement alluding to my shock and sorrow regarding all the years of work lost.

Hindsight is always 20/20, I know. Lesson learned.

Sorry, I wasn't clear! I was stating that MY comments were not intended to be judgemental. I certainly did not mean to imply that you were being judgemental. Everyone's a little shell shocked right now.



Please DO let Louie know about the Wayback Machine! ;)
It might take a bit of source code browsing but, even with my limited skills, I'm able to get into some of the back-end directory structure, so... :dunno:

Will do.


Mike, you never answered my questions.
Would you like me to send you my results.txt file?
Is that the only file you'll need?

Short answer: Yes! Longer answer: I'm not familiar with SoB's client files, so I don't know. Other people have sent that file, and it seems to have what we would need. That being said, I'd rather have too much information than too little. :)

jMcCranie
05-20-2016, 09:30 PM
A false negative would mean a missed prime. Potentially wasted years of computing would count as harm in my book. That's the main reason I gave up on SOB and went back to GIMPS.

Yes, but the purpose of SOB is to try to prove the Sierpenski conjecture. Large primes are very rare and a false positive is extremely rare. Double-checking in SOB cuts the throughput down by half. In other words, double-checking essentially doubles the expected computing that has to be done to prove the conjecture.

endless mike
05-20-2016, 11:22 PM
Yes, but the purpose of SOB is to try to prove the Sierpenski conjecture. Large primes are very rare and a false positive is extremely rare. Double-checking in SOB cuts the throughput down by half. In other words, double-checking essentially doubles the expected computing that has to be done to prove the conjecture.

Consider the highly unlikely possibility that there's only one prime for a given k. A false negative with no double checking means we crunch that k forever and never prove the conjecture. Unlikely to happen that way, but not impossible. People more in the know claim an error rate of about 4% (IIRC) on GIMPS. On the PrimeGrid message board, someone mentioned that a SOB work unit had to be sent out on average of 4.7 times to get a matching doublecheck. That post is three years old, but I can't image the situation is much different now. I still think double checking is valuable.

AG5BPilot
05-21-2016, 07:06 AM
On the PrimeGrid message board, someone mentioned that a SOB work unit had to be sent out on average of 4.7 times to get a matching doublecheck. That post is three years old, but I can't image the situation is much different now. I still think double checking is valuable.

While I'm solidly in the "double checking is a necessity camp" (and I'm one of the people making the decisions), let me correct that "4.7" statistic. While it's true that some of our sub-projects require a lot of tasks to be sent out in order to get two matching results, it's not because the results don't match. It's because most of the tasks either don't get returned at all (or at least not by the deadline), or have some sort of error that prevents the result from completing. Bad residues are more common than I'd like, but they're not THAT common. Here's some hard data on our SoB tasks currently in the database:

SoB:
Completed workunits: 1089
Average number of tasks per WU (2 matching tasks are required, and 2 are sent out initially): 3.7805 tasks per workunit (4117 tasks total)
Number of tasks successfully returned but eventually proven to be incorrect: 61

As you can see, about 6% of the workunits had tasks that looked like they returned a correct result, but in fact didn't. These are SoB tasks -- the same as you run here. We use LLR, but it uses the same gwnum library as you do here, so the error rates are going to be comparable. LLR has lots of internal consistency checks, so many computation errors are caught and not even returned to us. That's just the ones that slipped through all the checks and made it to the end.

At PrimeGrid we detect the errors, so the user gets an immediate indication that's something's wrong. On projects that don't double check, the users never know there's a problem, so the error rate might be higher.

The numbers are worse on GPU calculations. It's much harder to get GPUs to work at all, resulting in many tasks which fail immediately. On our GFN (n=22) tasks, which are GIMPS-sized numbers:

GFN-22:
Completed WUs: 2217
Tasks: 17996 (about 8 tasks per WU)
Completed but incorrect tasks: 85 (about 4%)

Some of those tasks are CPU tasks, but the vast majority are GPU tasks.

So there's your hard data: On the long tasks (SoB on CPU, GFN-22 on GPU), about 6% of workunits had seemingly correct results from CPUs which turned out to be wrong, and about 4% of the workunits had GPU tasks which were wrong.

(Frankly I'm surprised that the CPU error rate is higher than the GPU error rate.)

engracio
05-21-2016, 04:18 PM
A while back we ran a double check up to 30M. After most wu units were returned 22699K only had 1 wu needed to be submitted. The higher k had less than 10 wu each to be return, Unfortunately I am not sure if the results were matched with the previous results. Mike any Idea??

AG5BPilot
05-21-2016, 04:37 PM
A while back we ran a double check up to 30M. After most wu units were returned 22699K only had 1 wu needed to be submitted. The higher k had less than 10 wu each to be return, Unfortunately I am not sure if the results were matched with the previous results. Mike any Idea??

None at all.

jMcCranie
05-21-2016, 09:32 PM
Consider the highly unlikely possibility that there's only one prime for a given k. A false negative with no double checking means we crunch that k forever and never prove the conjecture. Unlikely to happen that way, but not impossible...

OK, so how about: no double checking to try to resolve the conjecture nearly twice as quickly, but if and when it gets down to only one k with unknown status, run a double check on those.

----Added----

I'll make an analogy. Suppose that there are a large number of boxes. A small number of boxes contain a diamond and you want to find diamonds. The first time you look in a specific box, if it contains a diamond, there is a 5% chance that you will not see it.

Should you (1) spend half of your time double-checking boxes you have already opened, or (2) open as many boxes as you can? I would open as many boxes as I can.

jMcCranie
05-21-2016, 09:33 PM
SoB:
Completed workunits: 1089


What is a "workunit"?

AG5BPilot
05-21-2016, 11:50 PM
What is a "workunit"?

For the purposes of this discussion, "a candidate", .i.e., a number to be tested, is a reasonable definition.

Joe O
05-23-2016, 07:37 AM
OK, so how about: no double checking to try to resolve the conjecture nearly twice as quickly, but if and when it gets down to only one k with unknown status, run a double check on those.

----Added----

I'll make an analogy. Suppose that there are a large number of boxes. A small number of boxes contain a diamond and you want to find diamonds. The first time you look in a specific box, if it contains a diamond, there is a 5% chance that you will not see it.

Should you (1) spend half of your time double-checking boxes you have already opened, or (2) open as many boxes as you can? I would open as many boxes as I can.

It is important to note that the boxes are numbered, and
1) The lower numbered boxes are more likely to contain a diamond than the higher numbered boxes.
2) The higher numbered boxes are harder to open than the lower numbered boxes.

AG5BPilot
05-23-2016, 10:14 AM
It is important to note that the boxes are numbered, and
1) The lower numbered boxes are more likely to contain a diamond than the higher numbered boxes.
2) The higher numbered boxes are harder to open than the lower numbered boxes.

Taking that a bit further, the difficulty of opening the boxes is proportional to the square of the box number, and the overall chance of finding a diamond (taking into account how hard it is to open the box as well as the likelihood of a given box containing a diamond) is inversely proportional approximately to the cube of the box number times the logarithm of the box number. Diamonds in higher numbered boxes are much harder to find. You really don't want to miss the easy ones, ever.

The allure of progressing twice as fast is obvious, but the penalty for missing a prime is tremendous.

tim
05-29-2016, 12:34 AM
Mike, please check your email for my results.txt files.

chris
05-30-2016, 02:41 PM
Greetings,

about the double check discussion.

I like to remind you guys that at least one of the primes was found via secondpass - that means a prime was missed with the firstpass tests, aka we already had a false negative - right in the SoB project. (the one at ~3M)


Chris

engracio
06-01-2016, 10:42 AM
All true. Just saying. Hate to go 10 miles down the road and find out we missed the turn.
Greetings,

about the double check discussion.

I like to remind you guys that at least one of the primes was found via secondpass - that means a prime was missed with the firstpass tests, aka we already had a false negative - right in the SoB project. (the one at ~3M)


Chris

jMcCranie
06-01-2016, 07:17 PM
Is there any word on the SeventeenOrBust project?

AG5BPilot
06-01-2016, 08:53 PM
Is there any word on the SeventeenOrBust project?

Nothing new to report.

Rest assured that someone (probably me, unless Louie jumps in) will let you know any information as soon as we know anything.

If I were Louie, I wouldn't give up until every last possibility was tried. And that may take a while.

tqft
06-07-2016, 04:48 PM
Just emailed a results.txt - close to 5 years worth

If I can find the others I will send them too

shifted
06-08-2016, 10:18 AM
I also believe double checks are worth it. Basically, if we have a 5% error rate, then it's even faster to find a prime if we can complete a double check in 1/20th the time of an initial check.

A recent change at GIMPS is to send out a double check assignment to everyone when they first join and once every year. This helps the project find bad computers quickly so their work can be double checked immediately. Sending out a double check at the beginning is also good in that new users get to finish something sooner.

AG5BPilot
07-26-2016, 09:54 AM
Nothing new to report.

Rest assured that someone (probably me, unless Louie jumps in) will let you know any information as soon as we know anything.

I did say I'd let you know as soon as I heard anything, so here it is. It's not good news, unfortunately.

There's no hope of recovering the data or software from the SoB server. It's gone. SoB is not coming back.

Louie has asked us to take over the entire SoB search. We intend to do so, but I can't tell you exactly what that means. For now, we are crunching all 6 Ks in the 31M < n < 32M range, and we'll continue with that until we decide how to move forward.

You are all, of course, welcome to come on over to PrimeGrid and help with SoB.

I'd like, at this point, to sincerely thank everyone who sent us their log files. Of all the information we've been able to gather from different sources, I suspect that your log files may end up being the most useful. They contain the most recent information, regarding the largest tasks. Those are most useful.

endless mike
07-26-2016, 11:46 PM
Thanks for the update, bad news is better than not knowing. I've been peeking at this forum and PrimeGrid's forum every few days to see if there's been any updates.

AG5BPilot
11-06-2016, 08:43 AM
Want some good news?

There's now 5 Ks remaining in the Sierpinski Problem.

Death
11-10-2016, 06:10 PM
Missed whole thing and come here from PG announcement.

Got mixed feelings, dammit.
Really miss sob.com ((((( fancy stats, once-a-year news ^_^, fffffff,...........

Death
06-06-2017, 03:51 PM
domain will expire in few days.

can someone 1. prolong if for year maybe. it will be pity to lose such valued domain.
2. set a redirect from it to primegrid appropriate page?