Server down?

**S3NTYN3L** · 04-21-2016, 10:06 AM

I'm posting here in the hope that someone might be able to contact an SoB project/site admin, in case they're not aware of the issue.
The project/site server seems to be down and has been for the past 48 hours.
That said, I'm unable to send any results nor receive new assignments.

**engracio** · 04-22-2016, 09:44 AM

Did somebody forget pay the domain name renewal?? I think it due this year.

**AG5BPilot** · 04-23-2016, 01:17 PM

Hello,

This is Michael Goetz, admin over at PrimeGrid. As you're likely aware, we've been collaborating with the Seventeen or Bust project for the last six years.

I'll let Louie fill you in with the details, but he's in the process of restoring/moving the server. I get the impression that this is not something that was planned. No ETA yet.

Anyone suffering from severe SoB-withdrawal symptoms is, of course, welcome to fill the downtime by crunching SoB tasks over at PrimeGrid. Different software, but the same project. Please do return here when SoB is back online, however.

Hopefully Louie will get everything back up and running soon.

Mike

**S3NTYN3L** · 04-23-2016, 11:32 PM

Originally Posted by AG5BPilot

Different software, but the same project.

Which would mean that we'd still need to connect to the project server for assignments, yes?
As that server is down, we'd be in the same situation we are now.

**AG5BPilot** · 04-24-2016, 07:54 AM

Originally Posted by S3NTYN3L

Which would mean that we'd still need to connect to the project server for assignments, yes?
As that server is down, we'd be in the same situation we are now.

No. PrimeGrid's servers work completely independently of Seventeen or Bust's servers. All coordination between the projects is done on a purely human level between the admins, not electronically.

Both servers have different lists of candidates they're testing, so we don't duplicate the same work. But we're both working on solving the Sierpinski problem. At the moment we're sending out work for k=10223 and k=67607 with 31M<n<32M.

**S3NTYN3L** · 04-24-2016, 12:27 PM

Originally Posted by AG5BPilot

Both servers have different lists of candidates they're testing, so we don't duplicate the same work. But we're both working on solving the Sierpinski problem. At the moment we're sending out work for k=10223 and k=67607 with 31M<n<32M.

Understood.
Any work assigned from one site wouldn't be credited on the other.

Sorry, but that is a problem for me.
(I've OCD issues).

**engracio** · 04-25-2016, 09:48 AM

Good to know. Tanks for the update Mike.

**engracio** · 04-29-2016, 03:22 PM

Any update???

**AG5BPilot** · 04-30-2016, 07:39 AM

Originally Posted by engracio

Any update???

I have no additional information.

**AG5BPilot** · 05-08-2016, 09:22 PM

We just heard from Louie and the news is not good. It looks like the old server, and everything on it, is toast. From what I understand the software that runs the server was lost too. There were backups, but unfortunately the entire datacenter was lost, taking the backups with it.

No decision has been made about what happens next -- in fact I just got this information and there's been no discussion yet about how or if to move forward.

We will keep you informed as new information becomes available.

For the sake of those wondering what they should do, I'll give you a guess as to what I think will happen. Everything from this point on is pure speculation on my part and may very well be incorrect. Take the rest of this message with a huge grain of salt.

It sounds like none of the software or data can be recovered from the datacenter, and if Louie had off-site backups somewhere he would have already found them. Everything on the SoB server is probably gone forever. The records of which numbers were tested, the residues, the user records, everything. Louie has indicated that he probably won't be relaunching SoB unless he's able to recover at least the software.

The only information that still exists is what PrimeGrid has. So what does PrimeGrid have? Well, for one thing, we have almost all of the sieve data, so that's not lost. Plus, we have complete records of all the work done on PrimeGrid. We also have complete records of all the work that was done on SoB for the two k's that were permanently transferred to PrimeGrid last year. Everything on the other 4 k's might be lost, however.

If SoB goes under, PrimeGrid will continue to work on our 2 k's. I suspect we'll pick up the other 4 k's as well if SoB can't be restarted.

S3NTYN3L, unfortunately, it looks like there's no way to recover any of the user data, so all credits and other information relating to SoB appear to be gone for good.

**tim** · 05-08-2016, 11:15 PM

What the actual ****ing hell?

My log files can be used to re-create some of the data submitted lately.

**AG5BPilot** · 05-09-2016, 03:55 AM

Originally Posted by tim

What the actual ****ing hell?

My log files can be used to re-create some of the data submitted lately.

Log files would be helpful. I should have thought to mention that.

If Louie and/or PrimeGrid need to try to put the pieces back together again from scratch, every little bit will help.

**engracio** · 05-09-2016, 10:43 AM

I have all of my residue for at least 5 years. Also we have done double check all the way up to 30Mil just saying.

**AG5BPilot** · 05-10-2016, 08:27 AM

Originally Posted by engracio

I have all of my residue for at least 5 years. Also we have done double check all the way up to 30Mil just saying.

Feel free to send any logs that you have to me at mgoetz [at] primegrid [dot] com. I'll make sure Louie gets those, along with anything we have that can help. If SoB doesn't get going again, then those would also be helpful for restarting the full SoB project at PrimeGrid if that's the direction we go.

**engracio** · 05-10-2016, 12:16 PM

Originally Posted by AG5BPilot

Feel free to send any logs that you have to me at moetz [at] primegrid [dot] com. I'll make sure Louie gets those, along with anything we have that can help. If SoB doesn't get going again, then those would also be helpful for restarting the full SoB project at PrimeGrid if that's the direction we go.

Thanks Mike I will. Still have a few wu to finish, I will sent it as soon as it complete. I wanted to make sure the wu i had remaining were done. About a week or so.

**tim** · 05-11-2016, 12:22 AM

I'll gather my log files too, when current work units are done. Kills me to think of the wasted clock cycles.

**S3NTYN3L** · 05-11-2016, 10:36 AM

Wow.
Fourteen years of work gone in an instant.

What an extreme example of the need for OFF SITE BACKUPS.

Can the wayback machine possibly help to get some of the code back?

I'll allow my current assignments to finish. What are you wanting when they're done? My results.txt file? Something else?

Regardless, I'll find some other diversion for my OCD.
This is just too painful of a mind**** to process at the moment...

**shifted** · 05-11-2016, 12:03 PM

I don't keep all my logs, but I'll see what I have. I was one of the top contributors.

**shifted** · 05-11-2016, 03:25 PM

Originally Posted by AG5BPilot

Feel free to send any logs that you have to me at moetz [at] primegrid [dot] com. I'll make sure Louie gets those, along with anything we have that can help. If SoB doesn't get going again, then those would also be helpful for restarting the full SoB project at PrimeGrid if that's the direction we go.

Should be mgoetz [at] primegrid [dot] com

**AG5BPilot** · 05-11-2016, 04:12 PM

Originally Posted by shifted

Should be mgoetz [at] primegrid [dot] com

::babbles incoherently::

Yup, spelled my own name wrong.

I've corrected the original post; thanks.

**engracio** · 05-18-2016, 11:03 AM

Mike please check your mail. Thanks

**AG5BPilot** · 05-18-2016, 04:46 PM

Got it earlier today, thanks.

**endless mike** · 05-19-2016, 05:40 AM

Are you still looking for log files? I've got a few from just over a year ago, don't know how helpful they would be. Let me know if you could use them, and I'll do some digging to get them you.

**AG5BPilot** · 05-19-2016, 11:17 AM

Originally Posted by endless mike

Are you still looking for log files? I've got a few from just over a year ago, don't know how helpful they would be. Let me know if you could use them, and I'll do some digging to get them you.

Yes, and yes. ALL log files will be useful. If everything was indeed lost, all tests for which there are no records will need to be redone. So every result from a log file is one less result that will eventually have to be repeated.

**jMcCranie** · 05-19-2016, 03:44 PM

Originally Posted by AG5BPilot

Yes, and yes. ALL log files will be useful. If everything was indeed lost, all tests for which there are no records will need to be redone. So every result from a log file is one less result that will eventually have to be repeated.

First, I have several log files that I will send after the last one finishes.

Secondly, why is it necessary to redo all of the old ones? We know that all exponents below a certain limit (27,700,000) have been checked - I don't see a need to redo those.

**AG5BPilot** · 05-19-2016, 05:16 PM

Originally Posted by jMcCranie

First, I have several log files that I will send after the last one finishes.

Secondly, why is it necessary to redo all of the old ones? We know that all exponents below a certain limit (27,700,000) have been checked - I don't see a need to redo those.

We don't know which ones were double checked -- and at PrimeGrid I've got a really good window into the quality -- or lack thereof -- of the computers used in distributed computing. In general, we no longer trust any results unless they're double checked. The problem with not immediately double checking results is that when a computer starts going bad, you have no way of detecting it. So any results that don't have matching residues from different computers are suspect. Unless we get really lucky, except for whatever we can get from log files, we have no residues at all on 4 of the 6 k's.

Calculation errors are proportionally more likely to occur on larger candidates, especially when the error rate is fairly low, but non-zero.

Our position on double checking is especially rigid when it comes to conjectures like SoB. Consider a hypothetical k where the first prime is at n=100,000, and the second prime is at n=100,000,000. If you miss the first prime because of an undetected computation error, many years of unnecessary computing will be wasted searching for the second prime.

It's actually not as horrible as it might seem at first glance. The vast majority of candidates are small and can be rechecked much faster than the original search.

**endless mike** · 05-19-2016, 06:21 PM

Originally Posted by AG5BPilot

Yes, and yes. ALL log files will be useful. If everything was indeed lost, all tests for which there are no records will need to be redone. So every result from a log file is one less result that will eventually have to be repeated.

I will get my results files to you; might take me until Monday though. Real life responsibilities come first.

**AG5BPilot** · 05-19-2016, 07:41 PM

Originally Posted by endless mike

I will get my results files to you; might take me until Monday though. Real life responsibilities come first.

Thank you! There's no need to feel rushed. Louie can, of course, take as much time as he needs, and we're not going to rush him. We're not going to take any action until he determines if restarting SoB is possible.

**jMcCranie** · 05-19-2016, 08:54 PM

Originally Posted by AG5BPilot

We don't know which ones were double checked

Unlike the Mersenne prime search, we only need to double check positive results. The time double-checking is better spent checking new numbers. For Mersenne primes, we want a complete list. For 17-or-bust, we only need to find a prime for each coefficient. If we get a false negative, no harm is done if we find a prime for that coefficient.

**endless mike** · 05-20-2016, 03:03 AM

Originally Posted by jMcCranie

Unlike the Mersenne prime search, we only need to double check positive results. The time double-checking is better spent checking new numbers. For Mersenne primes, we want a complete list. For 17-or-bust, we only need to find a prime for each coefficient. If we get a false negative, no harm is done if we find a prime for that coefficient.

A false negative would mean a missed prime. Potentially wasted years of computing would count as harm in my book. That's the main reason I gave up on SOB and went back to GIMPS.

Originally Posted by AG5BPilot

Consider a hypothetical k where the first prime is at n=100,000, and the second prime is at n=100,000,000. If you miss the first prime because of an undetected computation error, many years of unnecessary computing will be wasted searching for the second prime.

**S3NTYN3L** · 05-20-2016, 04:54 AM

Originally Posted by S3NTYN3L

Wow.
Fourteen years of work gone in an instant.

What an extreme example of the need for OFF SITE BACKUPS.

Can the wayback machine possibly help to get some of the code back?

I'll allow my current assignments to finish. What are you wanting when they're done? My results.txt file? Something else?

Regardless, I'll find some other diversion for my OCD.
This is just too painful of a mind**** to process at the moment...

Just because it seems to have been missed due to moderation...

**AG5BPilot** · 05-20-2016, 08:26 AM

Originally Posted by S3NTYN3L

Wow.
Fourteen years of work gone in an instant.

What an extreme example of the need for OFF SITE BACKUPS.

Can the wayback machine possibly help to get some of the code back?

I'll allow my current assignments to finish. What are you wanting when they're done? My results.txt file? Something else?

Regardless, I'll find some other diversion for my OCD.
This is just too painful of a mind**** to process at the moment...

Originally Posted by S3NTYN3L

Just because it seems to have been missed due to moderation...

I did miss that somehow. Thanks for bringing it up again.

Louie's the one who would be interested in code, so the Wayback machine might possibly be of use to him.

As for off site backups, you've got that right. Stuff happens. I had a 110-story building fall on top of one of my data centers once, Not only did we not lose data, but our business operations didn't miss a beat. Both Jim and myself are rather obsessive about making sure the data is backed up, including daily copies to an off-site location. In the event of a real disaster, PrimeGrid may be down for as much as 24 hours, but at worst we'll lose the last day's results. If you do this stuff long enough, you see this kind of thing happen over and over. If you do it for a living, you're expected to make sure that you're protected against this kind of event because it's a matter of when, not if. The problem is that when you're running things on a shoestring budget that's typical of an operation run as a hobby or a club, it can be really hard to put together the resources just to keep the operation running, let alone to have adequate assets in place to handle disaster recovery scenarios. So let's not judge the powers that be too harshly because it's difficult enough just to keep something like this running.

For me personally, I use a backup system that lets me back up my entire family's computers to the cloud.

As for which files, what I'm interested in is anything that lists test results, meaning the number being tested and the residue produced by the program. I have no idea what Louie might want.

Finally, yes, this is a huge blow. No way to sugarcoat this; it's going to set back this search quite a lot if the data can't be recovered. The only good news in this is that because we decided to split the project between SoB and PG by k's, at least we have all the data for 2 of the 6 remaining k's. Only two thirds of the work is lost rather than all of it.

**S3NTYN3L** · 05-20-2016, 09:44 AM

Originally Posted by AG5BPilot

So let's not judge the powers that be too harshly because it's difficult enough just to keep something like this running.

I'm not judging anyone.
It was simply a statement alluding to my shock and sorrow regarding all the years of work lost.

Hindsight is always 20/20, I know. Lesson learned.

Please DO let Louie know about the Wayback Machine!

It might take a bit of source code browsing but, even with my limited skills, I'm able to get into some of the back-end directory structure, so...

Mike, you never answered my questions.
Would you like me to send you my results.txt file?
Is that the only file you'll need?

**AG5BPilot** · 05-20-2016, 12:33 PM

Originally Posted by S3NTYN3L

I'm not judging anyone.
It was simply a statement alluding to my shock and sorrow regarding all the years of work lost.

Hindsight is always 20/20, I know. Lesson learned.

Sorry, I wasn't clear! I was stating that MY comments were not intended to be judgemental. I certainly did not mean to imply that you were being judgemental. Everyone's a little shell shocked right now.

Please DO let Louie know about the Wayback Machine!

It might take a bit of source code browsing but, even with my limited skills, I'm able to get into some of the back-end directory structure, so...

Will do.

Mike, you never answered my questions.
Would you like me to send you my results.txt file?
Is that the only file you'll need?

Short answer: Yes! Longer answer: I'm not familiar with SoB's client files, so I don't know. Other people have sent that file, and it seems to have what we would need. That being said, I'd rather have too much information than too little.

**jMcCranie** · 05-20-2016, 09:30 PM

Originally Posted by endless mike

A false negative would mean a missed prime. Potentially wasted years of computing would count as harm in my book. That's the main reason I gave up on SOB and went back to GIMPS.

Yes, but the purpose of SOB is to try to prove the Sierpenski conjecture. Large primes are very rare and a false positive is extremely rare. Double-checking in SOB cuts the throughput down by half. In other words, double-checking essentially doubles the expected computing that has to be done to prove the conjecture.

**endless mike** · 05-20-2016, 11:22 PM

Originally Posted by jMcCranie

Yes, but the purpose of SOB is to try to prove the Sierpenski conjecture. Large primes are very rare and a false positive is extremely rare. Double-checking in SOB cuts the throughput down by half. In other words, double-checking essentially doubles the expected computing that has to be done to prove the conjecture.

Consider the highly unlikely possibility that there's only one prime for a given k. A false negative with no double checking means we crunch that k forever and never prove the conjecture. Unlikely to happen that way, but not impossible. People more in the know claim an error rate of about 4% (IIRC) on GIMPS. On the PrimeGrid message board, someone mentioned that a SOB work unit had to be sent out on average of 4.7 times to get a matching doublecheck. That post is three years old, but I can't image the situation is much different now. I still think double checking is valuable.

**AG5BPilot** · 05-21-2016, 07:06 AM

Originally Posted by endless mike

On the PrimeGrid message board, someone mentioned that a SOB work unit had to be sent out on average of 4.7 times to get a matching doublecheck. That post is three years old, but I can't image the situation is much different now. I still think double checking is valuable.

While I'm solidly in the "double checking is a necessity camp" (and I'm one of the people making the decisions), let me correct that "4.7" statistic. While it's true that some of our sub-projects require a lot of tasks to be sent out in order to get two matching results, it's not because the results don't match. It's because most of the tasks either don't get returned at all (or at least not by the deadline), or have some sort of error that prevents the result from completing. Bad residues are more common than I'd like, but they're not THAT common. Here's some hard data on our SoB tasks currently in the database:

SoB:
Completed workunits: 1089
Average number of tasks per WU (2 matching tasks are required, and 2 are sent out initially): 3.7805 tasks per workunit (4117 tasks total)
Number of tasks successfully returned but eventually proven to be incorrect: 61

As you can see, about 6% of the workunits had tasks that looked like they returned a correct result, but in fact didn't. These are SoB tasks -- the same as you run here. We use LLR, but it uses the same gwnum library as you do here, so the error rates are going to be comparable. LLR has lots of internal consistency checks, so many computation errors are caught and not even returned to us. That's just the ones that slipped through all the checks and made it to the end.

At PrimeGrid we detect the errors, so the user gets an immediate indication that's something's wrong. On projects that don't double check, the users never know there's a problem, so the error rate might be higher.

The numbers are worse on GPU calculations. It's much harder to get GPUs to work at all, resulting in many tasks which fail immediately. On our GFN (n=22) tasks, which are GIMPS-sized numbers:

GFN-22:
Completed WUs: 2217
Tasks: 17996 (about 8 tasks per WU)
Completed but incorrect tasks: 85 (about 4%)

Some of those tasks are CPU tasks, but the vast majority are GPU tasks.

So there's your hard data: On the long tasks (SoB on CPU, GFN-22 on GPU), about 6% of workunits had seemingly correct results from CPUs which turned out to be wrong, and about 4% of the workunits had GPU tasks which were wrong.

(Frankly I'm surprised that the CPU error rate is higher than the GPU error rate.)

**engracio** · 05-21-2016, 04:18 PM

A while back we ran a double check up to 30M. After most wu units were returned 22699K only had 1 wu needed to be submitted. The higher k had less than 10 wu each to be return, Unfortunately I am not sure if the results were matched with the previous results. Mike any Idea??

**AG5BPilot** · 05-21-2016, 04:37 PM

Originally Posted by engracio

A while back we ran a double check up to 30M. After most wu units were returned 22699K only had 1 wu needed to be submitted. The higher k had less than 10 wu each to be return, Unfortunately I am not sure if the results were matched with the previous results. Mike any Idea??

None at all.

**jMcCranie** · 05-21-2016, 09:32 PM

Originally Posted by endless mike

Consider the highly unlikely possibility that there's only one prime for a given k. A false negative with no double checking means we crunch that k forever and never prove the conjecture. Unlikely to happen that way, but not impossible...

OK, so how about: no double checking to try to resolve the conjecture nearly twice as quickly, but if and when it gets down to only one k with unknown status, run a double check on those.

----Added----

I'll make an analogy. Suppose that there are a large number of boxes. A small number of boxes contain a diamond and you want to find diamonds. The first time you look in a specific box, if it contains a diamond, there is a 5% chance that you will not see it.

Should you (1) spend half of your time double-checking boxes you have already opened, or (2) open as many boxes as you can? I would open as many boxes as I can.

Thread: Server down?

Thread Tools

Rate This Thread

Display

Server down?

Posting Permissions