Server down?

**AG5BPilot** · 05-19-2016, 05:16 PM

Originally Posted by jMcCranie

First, I have several log files that I will send after the last one finishes.

Secondly, why is it necessary to redo all of the old ones? We know that all exponents below a certain limit (27,700,000) have been checked - I don't see a need to redo those.

We don't know which ones were double checked -- and at PrimeGrid I've got a really good window into the quality -- or lack thereof -- of the computers used in distributed computing. In general, we no longer trust any results unless they're double checked. The problem with not immediately double checking results is that when a computer starts going bad, you have no way of detecting it. So any results that don't have matching residues from different computers are suspect. Unless we get really lucky, except for whatever we can get from log files, we have no residues at all on 4 of the 6 k's.

Calculation errors are proportionally more likely to occur on larger candidates, especially when the error rate is fairly low, but non-zero.

Our position on double checking is especially rigid when it comes to conjectures like SoB. Consider a hypothetical k where the first prime is at n=100,000, and the second prime is at n=100,000,000. If you miss the first prime because of an undetected computation error, many years of unnecessary computing will be wasted searching for the second prime.

It's actually not as horrible as it might seem at first glance. The vast majority of candidates are small and can be rechecked much faster than the original search.

**jMcCranie** · 05-19-2016, 08:54 PM

Originally Posted by AG5BPilot

We don't know which ones were double checked

Unlike the Mersenne prime search, we only need to double check positive results. The time double-checking is better spent checking new numbers. For Mersenne primes, we want a complete list. For 17-or-bust, we only need to find a prime for each coefficient. If we get a false negative, no harm is done if we find a prime for that coefficient.

**endless mike** · 05-20-2016, 03:03 AM

Originally Posted by jMcCranie

Unlike the Mersenne prime search, we only need to double check positive results. The time double-checking is better spent checking new numbers. For Mersenne primes, we want a complete list. For 17-or-bust, we only need to find a prime for each coefficient. If we get a false negative, no harm is done if we find a prime for that coefficient.

A false negative would mean a missed prime. Potentially wasted years of computing would count as harm in my book. That's the main reason I gave up on SOB and went back to GIMPS.

Originally Posted by AG5BPilot

Consider a hypothetical k where the first prime is at n=100,000, and the second prime is at n=100,000,000. If you miss the first prime because of an undetected computation error, many years of unnecessary computing will be wasted searching for the second prime.

**jMcCranie** · 05-20-2016, 09:30 PM

Originally Posted by endless mike

A false negative would mean a missed prime. Potentially wasted years of computing would count as harm in my book. That's the main reason I gave up on SOB and went back to GIMPS.

Yes, but the purpose of SOB is to try to prove the Sierpenski conjecture. Large primes are very rare and a false positive is extremely rare. Double-checking in SOB cuts the throughput down by half. In other words, double-checking essentially doubles the expected computing that has to be done to prove the conjecture.

**endless mike** · 05-20-2016, 11:22 PM

Originally Posted by jMcCranie

Yes, but the purpose of SOB is to try to prove the Sierpenski conjecture. Large primes are very rare and a false positive is extremely rare. Double-checking in SOB cuts the throughput down by half. In other words, double-checking essentially doubles the expected computing that has to be done to prove the conjecture.

Consider the highly unlikely possibility that there's only one prime for a given k. A false negative with no double checking means we crunch that k forever and never prove the conjecture. Unlikely to happen that way, but not impossible. People more in the know claim an error rate of about 4% (IIRC) on GIMPS. On the PrimeGrid message board, someone mentioned that a SOB work unit had to be sent out on average of 4.7 times to get a matching doublecheck. That post is three years old, but I can't image the situation is much different now. I still think double checking is valuable.

**AG5BPilot** · 05-21-2016, 07:06 AM

Originally Posted by endless mike

On the PrimeGrid message board, someone mentioned that a SOB work unit had to be sent out on average of 4.7 times to get a matching doublecheck. That post is three years old, but I can't image the situation is much different now. I still think double checking is valuable.

While I'm solidly in the "double checking is a necessity camp" (and I'm one of the people making the decisions), let me correct that "4.7" statistic. While it's true that some of our sub-projects require a lot of tasks to be sent out in order to get two matching results, it's not because the results don't match. It's because most of the tasks either don't get returned at all (or at least not by the deadline), or have some sort of error that prevents the result from completing. Bad residues are more common than I'd like, but they're not THAT common. Here's some hard data on our SoB tasks currently in the database:

SoB:
Completed workunits: 1089
Average number of tasks per WU (2 matching tasks are required, and 2 are sent out initially): 3.7805 tasks per workunit (4117 tasks total)
Number of tasks successfully returned but eventually proven to be incorrect: 61

As you can see, about 6% of the workunits had tasks that looked like they returned a correct result, but in fact didn't. These are SoB tasks -- the same as you run here. We use LLR, but it uses the same gwnum library as you do here, so the error rates are going to be comparable. LLR has lots of internal consistency checks, so many computation errors are caught and not even returned to us. That's just the ones that slipped through all the checks and made it to the end.

At PrimeGrid we detect the errors, so the user gets an immediate indication that's something's wrong. On projects that don't double check, the users never know there's a problem, so the error rate might be higher.

The numbers are worse on GPU calculations. It's much harder to get GPUs to work at all, resulting in many tasks which fail immediately. On our GFN (n=22) tasks, which are GIMPS-sized numbers:

GFN-22:
Completed WUs: 2217
Tasks: 17996 (about 8 tasks per WU)
Completed but incorrect tasks: 85 (about 4%)

Some of those tasks are CPU tasks, but the vast majority are GPU tasks.

So there's your hard data: On the long tasks (SoB on CPU, GFN-22 on GPU), about 6% of workunits had seemingly correct results from CPUs which turned out to be wrong, and about 4% of the workunits had GPU tasks which were wrong.

(Frankly I'm surprised that the CPU error rate is higher than the GPU error rate.)

**engracio** · 05-21-2016, 04:18 PM

A while back we ran a double check up to 30M. After most wu units were returned 22699K only had 1 wu needed to be submitted. The higher k had less than 10 wu each to be return, Unfortunately I am not sure if the results were matched with the previous results. Mike any Idea??

**jMcCranie** · 05-21-2016, 09:33 PM

Originally Posted by AG5BPilot

SoB:
Completed workunits: 1089

What is a "workunit"?

**jMcCranie** · 05-21-2016, 09:32 PM

Originally Posted by endless mike

Consider the highly unlikely possibility that there's only one prime for a given k. A false negative with no double checking means we crunch that k forever and never prove the conjecture. Unlikely to happen that way, but not impossible...

OK, so how about: no double checking to try to resolve the conjecture nearly twice as quickly, but if and when it gets down to only one k with unknown status, run a double check on those.

----Added----

I'll make an analogy. Suppose that there are a large number of boxes. A small number of boxes contain a diamond and you want to find diamonds. The first time you look in a specific box, if it contains a diamond, there is a 5% chance that you will not see it.

Should you (1) spend half of your time double-checking boxes you have already opened, or (2) open as many boxes as you can? I would open as many boxes as I can.

**Joe O** · 05-23-2016, 07:37 AM

Originally Posted by jMcCranie

OK, so how about: no double checking to try to resolve the conjecture nearly twice as quickly, but if and when it gets down to only one k with unknown status, run a double check on those.

----Added----

I'll make an analogy. Suppose that there are a large number of boxes. A small number of boxes contain a diamond and you want to find diamonds. The first time you look in a specific box, if it contains a diamond, there is a 5% chance that you will not see it.

Should you (1) spend half of your time double-checking boxes you have already opened, or (2) open as many boxes as you can? I would open as many boxes as I can.

It is important to note that the boxes are numbered, and
1) The lower numbered boxes are more likely to contain a diamond than the higher numbered boxes.
2) The higher numbered boxes are harder to open than the lower numbered boxes.

**jMcCranie** · 06-01-2016, 07:17 PM

Is there any word on the SeventeenOrBust project?

Thread: Server down?

Thread Tools

Rate This Thread

Display

Hybrid View

Posting Permissions