Supersecret done to 800K soon [Archive]

vjs

05-31-2004, 04:25 PM

Information for those who don't know, please correct the following if there are any errors:

Supersecret is a teamless username which double checks all values of n for the remaining k's less than 800K. It's imporatnt to note that the chances of finding a prime here are next to if not zero. These numbers have been tested by SOB and some other effort, however they didn't record the residues, the SOB effort did. So these are really tripple checks.

The orginal goal of supersecret was to recheck all n<800K for all K's without primes to see if something was missed and more importantly find an error rate.

By my calculations supersecret should be done testing all values of n<800K in about 2 weeks time.

Question what direction with supersecret take.

Once every k reaches 800K for supersecret which k/n's will this account pull?

Is it useful to move the 800K limit to 1.5M?

Have supersecret start testing between 800K and 2M (sort of the same as above)?

Test the n values for known primes to make sure we have found the smallest prime, (I think this has some other merit I read about)

Pick one k=4847 or 69109, the most desirable or one with the least amount of test? And test between n=7m and 15m for example.

Assign the smallest expired tests to this account?

Just curious if we when we will know the erorr rate.

Troodon

05-31-2004, 04:51 PM

http://www.free-dc.org/forum/showthread.php?s=&postid=50555#post50555

vjs

05-31-2004, 06:03 PM

- prothserach.net tested all test with n<800k but didn't save the residues
- supersecret will complete all n<800k and an error rate will be derived from this data since we are saving residues.

Based on this error rate we will evaulate the need for double checking.

http://www.seventeenorbust.com/secret/

- supersecret gets the smallest pseudo-tripple check test and these tests will soon be n=800, the remaining test are for those left to reach n<800k
- secret is also doing double checks starting at 2M to 3M, the remaining test are the number of test required to double check all values between n=2M-3M

So what will happen??

According to the above once supersecret reaches n=800k it will continue on with n>800K although the stats page will not display the number of remaining tests correctly.

Will we simply move the n value to a larger n?

When will we know the error rate?

Keroberts1

05-31-2004, 06:21 PM

as far as i ca tell there haven't been any errors yet. This is because wiht tests this small the chances of an error ocurring is extremely small. However as the tests get larger the error rate grows and the reason wre are testing i nthe 2-3 millon range is because that is where we predicted we might start finding errors. Do any of the admins know if there have ben any mismatched residues in this range yet? I bvelieve someone had said there was one but it was a suspicious looking test because of the amount of time required or it ot be returned the first time. Was a third check done immediatly? Are there more suspicious looking tests out there?

vjs

05-31-2004, 06:48 PM

Ken,

Isn't comarring the residues the whole idea behind the error rate? We don't have to miss a prime then find it with a double check to have an error.

Yes, there was a suspicious looking test because of the amount of time required to complete. It has been tested at least 3 times. The original test, by someone on the board that had commented, I did this one myself as well and posted the result. I think it was near some boundary that we always hear about but not a prime.

Your right the error rate will have to grow as the n increases but only because the time required to process the n also increases. A decent error rate can be calculated with supersecret residues then projected to higher n. For example if n were an extremely large number (Infinite) and we have any error rate greater than zero we could never calculate the residue correctly.

Keroberts1

05-31-2004, 09:23 PM

i understand the ideaof using residues to confirm compositness. What i was trying to say is taht we have not gotten any non-matching residues implying that the tests had returned different results (most likely both composite). And also i thought that no residue was produced if it was a prime? I don't fully understand this part fo the formula. I haven't found any primes yet.

vjs

06-01-2004, 09:44 AM

ken,

I'm pretty sure you understand the composities as well as I do, and as far as I know there would be no residue if it were prime. in addition the result would be stated as a result 3... i'm assuming a prime would be result 1 or 2 in the log file???

i also though they did a quick check for residue difference and found 1 in a large number of tests. making the error rate low but noticeable.

I'm very curious about the error rate for the super secret tests.

kugano

06-01-2004, 12:28 PM

I just ran some quick-and-dirty queries about the error rate.

There are 24,030 k/n pairs with n less than 10^6 and for which we have double-checks. Of these, mismatched residues had been reported for 105. This gives a crude error rate of about 0.4%.

HOWEVER, I really do mean that's a "crude" error rate. There are other factors that aren't represented well in a simple query like this. I'm sure Louie can give you a much better description of why, but here are a few basic reasons:

- As was already pointed out, tests of higher n take longer. Longer tests are more likely to exhibit errors, and the numbers here are only for n less than 10^6.

- Some of our "early" residue data is less than 100% reliable. Previous client versions had problems reporting residues, and much of this data has been reconstructed by hand. Also, a lot of this early residue data was actually provided by external (non-SB) contributors who had the data lying around from their own searches.

In the past Louie and I have done queries of the entire n range (not just below 10^6). Some of these queries gave much higher error rates, anywhere from five to ten percent. So clearly some more research needs to be done to arrive at a reliable number.

vjs

06-01-2004, 03:09 PM

Then is it worth making supersecret double check all of k's even those with known primes? Or creating another account "prime" to double check those n's lower than the prime to verify residues?

It's my understanding that this should be done to make sure our k/n pair is the smallest prime for that number.

Were the 105 mismatched residues re-checked?

wblipp

06-01-2004, 04:00 PM

Originally posted by kugano
Longer tests are more likely to exhibit errors, and the numbers here are only for n less than 10^6.
...
In the past Louie and I have done queries of the entire n range (not just below 10^6). Some of these queries gave much higher error rates

There are two leading theories about errors. The "cosmic ray" theory says they are random events that happen uniformly through time. An "n" value that is twice as large takes 4-5 times as long, so it should have 4-5 times the error rate. The "faulty machine" theory says that errors are the results of marginal machines - perhaps too aggressively overclocked or poor quality components that are not up to the continuous 100% busy of distributed computing. GIMPS has found that many of their errors are from faulty machines. It was once common for overclockers to use the highest setting that would pass the Prime95 Torture Test - it turns out that working on that edge was not reliable.

It would be interesting to know if SoB errors show either pattern.

Joe O

06-01-2004, 04:34 PM

Originally posted by vjs
Then is it worth making supersecret double check all of k's even those with known primes? Or creating another account "prime" to double check those n's lower than the prime to verify residues?

It's my understanding that this should be done to make sure our k/n pair is the smallest prime for that number.

Were the 105 mismatched residues re-checked?

Louie once stated in this forum that he was not interested in proving that our primes were the smallest primes for each k, it was enough that we had found a prime for that k. I agree with him, as I'm sure do many others. Louie and Dan have been very careful, and as a result I believe that the primes we have found will eventually prove to be the lowest primes for that k. This is a very big project, and there is enough to do without inventing more work. Let's find 11 more primes first.

kugano

06-01-2004, 05:38 PM

Here's some more detailed error rate calculations.

total bad
n range tests tests error
---------------- ----- ----- -----
( 0, 1*10^6] 48097 108 0.2 %
(1*10^6, 2*10^6] 1703 123 7.2 %
(2*10^6, 3*10^6] 2373 78 3.3 %
(3*10^6, 4*10^6] 2634 172 6.5 %
(4*10^6, 5*10^6] 2571 149 5.8 %
(5*10^6, 6*10^6] 918 54 5.9 %
(6*10^6, 7*10^6] 58 3 5.2 %
---------------- ----- ----- -----
cumulative 58354 687 1.2 %Here's how my code decides whether a test is valid or "bad" (incorrect results): Look at all tests for a specific k/n pair. If there is only one test that produced a residue for that k/n, ignore it. Otherwise suppose we have N residue-producing tests. Now choose M, the largest number of tests that share a common residue. Assume that M of the N tests produced the correct residue, and the other N - M (if nonzero) didn't. Therefore we add N to the total number of tests and M to the total number of tests that exhibited errors. Examples:
correct incorrect
residues tests tests
X, X 2 0
X, Y 1 1
X, X, X 3 0
X, X, Y 2 1
X, Y, Z 1 2

Nuri

06-01-2004, 06:02 PM

Interesting stats (though higher than I would expect).

I guess it would be fair to assume:

For most of the double residue k/n pairs between for 1m to 2m and for n>3m, there are more than one residues, mostly because some users returned their tests very late (so late that a second test was assigned). If this is the case, I think that in most of the cases, the faulty residue is the one that lagged behind.

On the other hand, for 2m to 3m range, secret has tested roughly 450 k/n pairs (=15500*29500/1000000). So, roughly speaking, for 900 of the double residues, the first one is from a first time test, and the second one is from secret.

Is it possible to check this? (i.e. the error rate for tests where one of the users is secret and n>2m)

Or, are you keeping track of how long it took for each test to be returned? If so, is it possible to pool out the pairs where at least one of the test were returned after, say, 30 (60?, 90?) days it was assigned?

PS: On second note, I decided to throw in some cycles (for the next few weeks) for secret to speed up at data gathering on error rates.

kugano

06-02-2004, 11:12 AM

I wouldn't worry too much. I've been poking around more... as might be expected, certain users are more likely to return "suspect" tests than others. (By "suspect" test I mean a test with a residue that conflicts with that from another test of the same k/n.) It's even more pronounced with IP addresses... a couple dozen IP addresses are responsible for about 50% of all suspect tests.

I want to generate a curve of error rate vs. n size... but I think we need more data points (by raising the double-check point)...

A breakdown of suspect tests by version is also sort of interesting... (maybe I'll separate them out by O/S soon, but for now they're aggregate):
error usage
version # suspect tests percent percent
------- ----------------- ------- -------
0.9.7 11 suspect tests 0.81 % 1.96 %
0.9.9 7 suspect tests 0.51 % 0.68 %
1.0.0 512 suspect tests 37.51 % 45.58 %
1.0.2 70 suspect tests 5.13 % 9.46 %
1.1.0 376 suspect tests 27.55 % 19.21 %
1.1.1 36 suspect tests 2.64 % 2.46 %
1.2.0 122 suspect tests 8.94 % 7.00 %
1.2.1 1 suspect tests 0.07 % 0.02 %
1.2.2 20 suspect tests 1.47 % 0.22 %
1.2.3 1 suspect tests 0.07 % 0.06 %
1.2.5 35 suspect tests 2.56 % 4.51 %
(The "usage percent" is the percentage of all tests that were reported by clients of that version...) I think I'm happy to see that the error rates correspond almost exactly to usage percent. It means no one version was measurably "worse" than any of the others... which is the way it should be =)

vjs

06-02-2004, 11:39 AM

Is it possible to have a webpage of suspect IP's?

Follow this up with:
corrupt test per total test complete by that particular IP
a follow up e-mail perferably automated

Example

Dear (user name)

It has come to our attention your machine with IP address (XXX.XXX.XXX.XXX) is submitting faulty data X% of the time. We appreciate your time and dedication, could you please investigate this computer. This computer may be having issues with overheating, fault memory or CPU that may potentially hinder the project rather than helping.

Thank-you for your contributions

SOB Team

Please visit norealpage.www.seventeenofbust.com/suspectIPS.htm

Or take actions on the servers part for that particular IP and assign double checks?

Mystwalker

06-02-2004, 12:39 PM

Originally posted by kugano
I think I'm happy to see that the error rates correspond almost exactly to usage percent. It means no one version was measurably "worse" than any of the others... which is the way it should be =)

IMHO, newer versions should even be allowed to have a higher error/usage ratio without branding them 'faultier', as the early versions were only in use with smaller n's.

vjs

06-02-2004, 02:46 PM

"I want to generate a curve of error rate vs. n size... but I think we need more data points (by raising the double-check point)..."

IF the error rate were hardware based then it should increase directly with n weight.

According to your data (ingore 3.3%) the error rate seems to be going down with n weigth. This is very suprising, it makes me think that the majority of the computers producing bad results are older low memory under powered machines.

I would almost wager the probablity of test error increases while processing rate decreases.

smh

06-02-2004, 02:59 PM

Originally posted by Nuri
Interesting stats (though higher than I would expect).

You gave a few reasons youreself why some ranges hava higher errorrates then expected.

But the 2M-3M range seems to be inline with results from GIMPS (IIRC)

GP2 did some great datamining around there which is really worth going through if you are interested in error rates etc...

A 3 to 4% error rate (as expected!) gives a significant chance (can someone do the math?) of missing a prime.

Tests from suspected users/computers need to be double checked ASAP to keep the chance of missing a prime low.

Are tests with a faulty residue double checked straight away to determine which user/computer made the errors?

kugano

06-02-2004, 03:20 PM

Is it possible to have a webpage of suspect IP's?It's possible, sure... does anyone imagine people might get upset seeing their IPs posted publicly, even if they're not associated with usernames? Such a thing wouldn't upset me, but I know some people can be very protective of their IPs... I don't think we want to send automatic emails except in extreme cases, but I'll ask Louie and Mike about it.
Or take actions on the servers part for that particular IP and assign double checks?Appropriate double-checks are definitely done by the server.

kugano

06-02-2004, 03:27 PM

According to your data (ingore 3.3%) the error rate seems to be going down with n weigth. This is very suprising, it makes me think that the majority of the computers producing bad results are older low memory under powered machines.I think it's premature to jump to that conclusion. The fluctuations in rates for the ranges I posted are well within the limits of what I'd expect from randomness and sample error. Remember that we don't have nearly as much (sometimes none) systematic error checking done for the higher ranges. Until we do I'd rely very little on those numbers. Also keep in mind that at the moment, double-checking is voluntary... so the user base doing double-checking (from which error rate data can be deduced) is *very much* different, in terms of typical machine types and usage habits, than the at-large user base. This undoubtedly has an impact.

kugano

06-02-2004, 03:32 PM

But the 2M-3M range seems to be inline with results from GIMPS (IIRC)Sounds right from what I remember too. Also, for the higher ranges where we haven't started systematic double checking yet, consider the reason we have any error rate data at all -- since we're not assigning double-checks up there, the only time we ever get a second residue is if a client which had previously dropped the test comes back and reports it. But the very fact that the test got dropped and then picked up later probably has a lot of impact on the error rate. What is it about the machine or user that caused the test to "die" for so long and then come back?

We really can't rely on error rate data for any n range until we've done thorough double-checks on that range.
A 3 to 4% error rate (as expected!) gives a significant chance (can someone do the math?) of missing a prime.A *very* significant chance! That's why systematic double-checks are so important!
Tests from suspected users/computers need to be double checked ASAP to keep the chance of missing a prime low.I think the double-check threshold should be set so that the probability of finding a new prime on the first pass is equal to the probability of recovering a 'missed' prime on the second pass. Since an ideal server would always assign work most likely to result in a prime discovery, two tests assigned at the same time should always have very nearly the same probability of success.

kugano

06-02-2004, 03:39 PM

I think the double-check threshold should be set so that the probability of finding a new prime on the first pass is equal to the probability of recovering a 'missed' prime on the second pass. Since an ideal server would always assign work most likely to result in a prime discovery, two tests assigned at the same time should always have very nearly the same probability of success.Actually I should say "same probability of success per unit time"... handing out a 5-minute test with 1% chance of success is just as good as handing out a 10-minute test with a 2% chance of success. Probability experts feel free to jump in anytime and correct my abysmal terminology and/or reasoning...

vjs

06-03-2004, 11:09 AM

Your correct IP address should not be posted.

However an E-mail could be sent with the IP or a section could be added under your logon section. Pending test, completed tests, erronous test computers.

As for the error rates, I've seen geological papers that make broad conclusions with less data.

Proposal for discussion: (5 min 1% or 10 min 2% arguement)

Assume --> error rate is 4%
Assume --> The error rate on a prime k/n pair is the same as a non-prime k/n pair
Known ---> CPU time per k/n pair increases with n

Conclusion --> All k/n pairs that require 4% of the processing time of current k/n pair should be double checked.

Implications:

For example If CPU time were linear with n (Is it?) and we are at n=6.5M with a 4% error rate everything below n=260K should be double checked we get matching residues.

So if everything upto n=800K is double checked with matching residues. (Which will happen shortly) even if we have a 10% error rate we shouldn't do anymore double checking until reaching n=8M.

Does anyone know how n relates to processing time exactly?

Mystwalker

06-03-2004, 11:33 AM

AFAIK, effort increases to the square of n, thus when n doubles, the effort is 4 times as high.

Incorporating this into your model, the DC upper bound should be 20% of the normal upper bound (as 0.2² = 0.04).

We are currently at n=6.25M, which results in a DC upper bound of n=1.25M.

kugano

06-03-2004, 12:14 PM

I like the idea of having a section on the website that shows error rate data for a user's IP address. That way the IPs are kept private and we don't have to send out emails. (I hesitate to send emails only because I don't want to needlessly scare people, or make them think their computers are harming the project, or accuse them of having "bad" machines, etc...)

PRP's asymptotic running time is rather complicated. It's certainly not a simple linear or polynomial relationship. This is exactly the reason we have so many problems with "cEMs." The current formula is:

cEMs := n^3 / 10^9

But, since n^3 is not the real running time for the PRP test, cEMs are inaccurate. A test that takes 2 hours will not necessarily be worth twice as many cEMs as a test that takes 1 hour. And to make things even more complicated, the real running time differs locally quite a bit from the asymptotic running time. Breakpoints in FFT sizes cause "jumps" in a plot of running time vs. n size.

Louie's been (I think) poking around the code trying to figure out what the real running time is, in the hopes of finding a better formula / algorithm and better units to use in future versions. I'll have to ask him if he's made any progress. He really knows a lot more about this than I do.

hc_grove

06-03-2004, 02:30 PM

Originally posted by kugano
Louie's been (I think) poking around the code trying to figure out what the real running time is, in the hopes of finding a better formula / algorithm and better units to use in future versions. I'll have to ask him if he's made any progress. He really knows a lot more about this than I do.

I remember seeing in some other thread that the running time grows approximately like (n*log n)^2.

vjs

06-03-2004, 03:28 PM

I don't think we have to waste too much time on determining the exact relationships between DC and first checks, since we don't really know the error rate at present anyways.

I was thinking processing time per n would be some sort of inverse log relationship, (n*log n)^2 seems reasonable if not correct.

couple quick calculations comparing the error level vs DC once we achieve a n=6.5M value on the first checks using the above log relationship

Error rate / Double n which equals same propability of returning prime

1.00% / 753K
5.00% / 1.597M
10.00% / 2.210M

1.84% / 1.00M
8.10% / 2.00M

So in other words per unit processing time if you recompute a 1M n value instead of a first time 6.5M n-value your more likey to find a prime if you believe the error rate is greater than 1.8% for a first time test.

But if you believe the error rate is less than 5% computing any n value greater than 1.6M is a waste of time.

Basically means were better off doing supersecret test (~800K) than secret test (2M) at present. Unless your intentions are determining error rates at higher n.

Joe O

06-03-2004, 03:30 PM

Dave, I'm very interested in your error algorithm. What would it do in the following cases:

correct incorrect
residues tests tests
X, X ,Y, Y ? ?
X, X, X ,Y, Y ? ?
X, X, Y, Y, Z ? ?

This is from a real world case that I have. In any case, could you/ would you share your algorithm with me? My SB profile has my valid email address.

vjs

06-03-2004, 03:36 PM

addition to the above

http://www.seventeenorbust.com/sieve/next.txt

Current n primary test is 6292951

http://www.seventeenorbust.com/secret

Smallest double check is around 772K

This means we are working currently at a 1.13% error level equality of n's

kugano

06-03-2004, 05:15 PM

Dave, I'm very interested in your error algorithm. What would it do in the following cases:2 correct, 3 correct and 2 correct, respectively. Of course the first and last situations would certainly require further investigation to figure out which residue is really right. But for purposes of guessing at an error rate it works okay. The algorithm goes like this (Perl version):
use List::Util qw(max);

sub count_correct_residues
{
my @residues = @_;
my %residue_classes;
foreach my $residue (@residues) {
$residue_classes{$residue} = 0 unless exists $residue_classes{$residue};
$residue_classes{$residue}++;
}
return max(values(%residue_classes));
}In mathematical terms, this function says "given residues S(0), S(1), ..., S(n), return the size of the largest set P of indices such that S(a) = S(b) for any indices a, b in P."

In English, it says "pick the residue that occurs the most times and return how many times it occurs in the list."

Joe O

06-03-2004, 06:10 PM

Dave,
Thank you very much! That's just what I needed. I couldn't quite "get my mind around the problem".
Joe.
ps
Isn't Perl great!
J.

kugano

06-03-2004, 06:58 PM

Isn't Perl great!
My first daughter will likely be named "Pearl." And I'm only just a little bit kidding... it has the advantage of being a beautiful name (I think) even aside from its connection with that sweetest of languages.

wblipp

06-03-2004, 11:06 PM

Originally posted by Mystwalker
AFAIK, effort increases to the square of n, thus when n doubles, the effort is 4 times as high.

When we reseached this question in the thread about a Resource Allocation Model, we found two formulas that give similar results. Based on theoretical issues, it should scale as (n log(n))^2. Based in the heuristic observation that numbers 2 time larger take 5 times longer, it should scale as n^2.32

Discussed Here:
http://www.free-dc.org/forum/showthread.php?s=&threadid=2988&perpage=14&pagenumber=2

kugano

06-04-2004, 02:20 AM

Based in the heuristic observation that numbers 2 time larger take 5 times longer, it should scale as n^2.32
Only if you assume the running time is strictly exponential (n^x for some x). And that'd be an incorrect assumption...

(n log n)^2 may indeed be the correct running time, but I wouldn't count on it. FFTs come in diverse shapes and sizes. Woltman's code does a lot of very clever things, and it's certainly not the same straight algorithm you'd get out of, say, Knuth. (n log n)^2 may be the running time for the algorithm discussed in that article, but don't rashly assume it works for what George has done too. (Although, like I said, it might... I really don't know. And the jumps due to FFT size changes mess up the data enough to make it hard to test empirically.)

vjs

06-04-2004, 04:29 PM

[(n' log n')^2] / [(n log n)^2] = time difference multiple between n's

(n log n)^2 yeilds a computation time difference of about 85X for n=800k vs n=6.3m

Basically if superssecret test take about an hour a regular test 85X longer so 3.5 days. Sounds like a pretty good estimate.

I'd like to personally thank Kugano for taking the time to enter this discussion.

MikeH

06-06-2004, 07:27 AM

Dave,

Any chance you could seed the supersecret account with some more tests. At current rate it will run out in about 5 days. I assume for the time being we'll just continue to move forward from 800K, so it would seem logical to seed with all unverified candidates from 800K - 2M.

The error rate information is very interesting. If we are a little worried about offending people by making any 'bad' results public (I myself wouldn't have a problem), how about adding this info to a page where a user is logged in (for this own tests). This will at least give users chance to figure out for themselves if they have a faultly PC. At the moment users have little or no visability of anything being wrong.

If you are going to put together a regulaly updating page of error rates, it might be worth differentiating

Y, X, X
from
X, Y, X and
X, X, Y

Since we are really interested in the credability of first time tests it is important that we separate out any cases where the DC tests were faultly, since these really aren't such a major problem (but this info still needs displaying as well).

vjs

06-06-2004, 10:29 AM

Mike,

I'd suggest simply raising the bar to 1M to begin with.

It gives us a better feeling of accomplishment once we reach a particular goal. Besides there are several thousnad test between 800k-1M and the test will start to take much longer.

VJS

kugano

06-08-2004, 08:19 PM

I've been doing some (rather extensive) analysis of SB's historic test data. It turns out that the algorithmic running time is either exactly O(n^2 log n) or very, very close. This would mean a single iteration of the FFT takes O(n log n) time. Then there are O(n) FFTs per test, for a product of O(n^2 log n). I may post some graphs later that illustrate this.

Note this isn't the same as O((n log n)^2)...

In light of this, it's very unfortunate that cEMs are computed at O(n^3)!! The next client or website overhaul will definitely involve a total unit-system change!

kugano

06-08-2004, 09:42 PM

Here are some graphs, as promised...

The x axis is the n value of the test. The y axis is the time, in seconds, it took to complete it. The scale is logarithmic.

I've also created two graphs with gradient lines plotted first for n^3 (cEMs) and then for n^2 log n. You can see how inaccurate cEMs are by noticing how these gradient lines are much steeper than the data points.

Here are the links:

http://www.seventeenorbust.com/cems/log/data-only-small.png (data only)
http://www.seventeenorbust.com/cems/log/cems-small.png (cEM gradient lines)
http://www.seventeenorbust.com/cems/log/new-units-small.png (new gradient lines)

You can get larger versions by removing the "-small" from the URL.

For comparison, here are some graphs in linear scale rather than logarithmic:

http://www.seventeenorbust.com/cems/new/data-only-small.png (data only)
http://www.seventeenorbust.com/cems/new/cems-small.png (cEM gradient lines)
http://www.seventeenorbust.com/cems/new/new-units-small.png (new gradient lines)

Death

06-09-2004, 03:08 AM

Originally posted by kugano
Here are some graphs, as promised...

The x axis is the n value of the test. The y axis is the time, in seconds, it took to complete it. The scale is logarithmic.
http://www.seventeenorbust.com/cems/log/new-units-small.png (new gradient lines)

I think the holes at left on a graph appears because of client progress and calculation speed incrementation. If we count again all the project from the beginning with current client, the graph will be more smooth, like a solid color cone.