a passel of suggestions

**Blackout** · 10-05-2003, 11:33 PM

Hi again,

This continues from the thread http://free-dc.org/forum/showthread....threadid=4251, "questions about energy, generation 0, and a cheeky suggestion".

Prokaryote, I didn't know there are multiple copies of some genes. Interesting. I see what you mean about allowing mutations to occur in one copy of the gene while still having the protein available from the non-mutated gene.

. . . . . . . .

Howard, before I unveil my wee suggestions, let me spell out the main assumption on which they rest.

I think of the high level strategy of the d.f. algorithm this way; create enough random structures to be confident there is a good one to use as the starting point for the actual folding algorithm. Repeat this cycle about 200,000 times (= about 5 billion structures altogether). The key point being, it is not important to start with a fold with a fantastic energy value, simply a good one.

Here then, is a modified version of my original suggestion to speed up generation 0:

1) Create enough random folds to get a reasonably good energy score.
2) Thereafter, check the energy once at a residue number at some point after the halfway residue.
E.g. if the protein has 100 residues, check at residue 70.
3) If the energy is more than 100% percent worse than the current best energy, discard this fold, start the next.

An interrupted fold takes 30% less time than a completed one. If half the folds are discarded, generation 0 will complete 15% faster than it currently does (roughly speaking).

I realize that backtracking could occur right after the 70th residue and result in a lower energy, but I am guessing that the "discard energy" can be set high enough, 100% in my example, so that this would happen so seldom as to make the check worthwhile.

Many variations are possible. You could check if the energy is worse than 90% at residue 60 and again if it is worse than 120% at residue 80, etc.

I am making a number of assumptions here, the principle one being that, well before all the residues have been added, the energy of most bad folds is already significantly higher than that of a good fold.

. . . . . . . .

Here's a second suggestion.

I expect you have a statistical method for determining a "reasonably" good energy for the protein once 10,000 random structures have been created. (Not the best energy, simply one that can determine a good starting fold for gen 1.) Use the statistical information of the protein available after the first cycle is done.

1) Calculate a good energy value based on the first gen 0 to use as a threshold.
2) Pass the threshold to all subsequent gen 0s.
3) If a fold in gen 0 has an energy better than this threshold value, start gen 1.
4) The threshold could be refined as structures are created.

Call me an optimist with no knowledge of the inner workings of d.f. (the latter is definitely true), but I imagine this could reduce the number of gen 0 structures significantly. You could also pass this threshold value from the server to the clients after it is calculated the first time.

. . . . . . . .

And at no extra cost, a third suggestion.

It must happen occasionally, that a gen 0 has a good batch of folds and the folds in the next gen 0 are worse than average, and its best fold is only as good as the 9th best from the previous batch. But those better folds have been discarded.

Would it be worthwhile keeping the 10 or 50 best unused folds from each gen 0 and passing them along to the next? That should ensure that each gen 1 always starts with the best fold from all the gen 0s.

I find the idea of trying to shorten gen 0 appealing, because as it stands, 44% of all structures are created in gen 0, and therefore, in a sense, not used.

. . . . . . . .

BTW, you mention that a structure typically takes a second to complete. While I have seen that happen, on average mine take much longer and I am running the text client on a 1.7GHz Pentium 4 machine. At a very rough guess, I would say around 10 seconds per structure.

Two minor questions. If my calculations are correct (what in the name of Hollywood got me to write that?) there are:

22500 folds are created in a complete 251 generation cycle (10000 + 50 * 250)

132083 points are awarded for a complete cycle (50 + sum of floor(sqrt(x)*50)
where x is 1 to 250)

So does "Total work done so far:4,705,074,436" mean 4.7 billion points or structures?

Is there an explanation on the site of the energy graphs? For starters, I don't understand what the colours mean in the chart. (If you don't have a link handy, please don't waste your time on this. I will hunt around in my textbook/online soon. Nice word, "soon".)

Interested to hear your remarks,
Michael

______________________________________

An addendum. I timed my text client and found:

gen 0
sec/structure 0.67

gen 233 to 238
sec/structure 26.75

Which means generation 0 completes in 1.9 hours
gens 1 to 250 complete in 92.9 hours

Hm. The gain in time by shortening gen 0 would not be what I was expecting. However, might there still be something to gain by collecting more than just the best fold from generation 0?

Michael

**m0ti** · 10-06-2003, 02:02 AM

One comment on your first suggestion (I don't take it your a programmer):

You forgot to take into consideration how much time to takes to evaluate the energy. You have to take a look at the times it takes in order to determine the optimal point at which to evaluate a protein before it is done in order to consider it for elimination. If the cost is too high, then the fold would have to be completed before evaluating the energy, other wise it would just be too expensive.

**tpdooley** · 10-06-2003, 03:03 AM

As for one of the later questions - if we're at 5 billion structures or points..
during phase I.. it was structures.
now, during phase II, the score is points.

The Phase II client is doing much better with much fewer folds than the Phase I client. And when we track down a few more bugs that are driving folders nuts.. it'll approach the reliability of the Phase I client. Vast cheering with spring out..

(50 people on the phase II beta folded for about a week, and got a score 2A better than about 2000 folks folding for about 5 weeks).

**Brian the Fist** · 10-06-2003, 11:46 AM

You figured out the answer about speeding up phase I yourself already. Phase II stats are all in points, not structures.

To help understand the graphs, there is a small link called 'details' above each graph - click there for an explanation and return here if you still have questions after reading that.

**Blackout** · 10-06-2003, 03:28 PM

Hi Howard,

There are no Details links in the pages with energy graphs that I'm looking at, for example, http://www.distributedfolding.org/1viiEnergy.html

Are there other energy graphs?

Which change sped up Phase I?

Thanks and good luck with the update,
Michael

**tpdooley** · 10-06-2003, 05:01 PM

http://www.distributedfolding.org/login.html
take a look at the 10 best structures produced to date.. and there's a "details" button on the right side of each of the 10.

---
As for the Phase I to Phase II speedup question - take a quick look at the Phase II FAQ at http://www.distributedfolding.org/about.html (link on the left).

The two major speed improvements in my time on DF have been: the double the performance "use extra ram option" -rt switch. It almost doubled the speed by keeping everything in ram; and you need about 256Megs to use it with larger proteins on Win98SE.

The "speedup" by switching from Phase I to Phase II was the use of a different approach. With Phase I - we randomly created structures the exact same way that the current Gen 0 is created. If with a particular protein, we could get to 7.11A with 1 billion random structures, we'd need roughly 10 billion to get near 6.11A, and 100 billion to get near 5.11A. (During the beta test, we took a protein that around 2000 of us spent 5 weeks on and created 10 billion structures and got a low score of 7.11A. 50 of us donated a machine or two to the Beta project and spent just a week on each of the beta tests - and we had several beta candidates that got under 5.11A). By the way - what was the low score on the client that was actually chosen?

With Phase II - we start off with a pool of 10,000 random structures in Gen 0 using the Phase I techniques. For later gens (1-250) the best structure from the previous generation is selected. It's then played with, mutated, changed a little bit.. 50 times. The best resulting structure is then passed on to the next generation.

Taking a look at RMS over time graphs in the 10 Best Details pages, you'll see it start out high.. drop down to a valley.. and then climb up a hill and down into valleys a few more times before it hits Gen 250. So the changes over each generation do not always lead to a smaller structure each generation. When I was watching the details charts, I noticed a few of the best structures happening in the Generation 230 area which is why we keep folding out to gen 250.

------------
Using this technique, we get a much better result with much less work involved. (How soon till the next Casp challenge to see how we compare to the other challengers?)

**bwkaz** · 10-06-2003, 06:28 PM

Originally posted by Blackout
1) Create enough random folds to get a reasonably good energy score.
2) Thereafter, check the energy once at a residue number at some point after the halfway residue.
E.g. if the protein has 100 residues, check at residue 70.
3) If the energy is more than 100% percent worse than the current best energy, discard this fold, start the next.

I don't know for sure, but I think I remember someone posting that it was impossible to calculate the energy before the protein was done.

If that is true, then this suggestion cannot work.

It is definitely impossible to calculate the RMSD before the structure is done, because RMSD is calculated by taking all the distances that every amino acid was put away from its real location, squaring all the numbers, finding the average of them, and taking the square root of that average (RMSD is root-mean-square-deviation, AKA the root of the mean of the squares of the deviations of each amino acid). Since not all the AA's are placed before the protein is done, it's impossible to calculate the RMSD before the protein is done.

RMSD is not being used to score gen 0, but it is being used for the other gen's (at least, that was my understanding of why gen 1 always has a "score" of one third the score of gen 0, or thereabouts -- gen 0 is the energy, while gen 1 is the RMSD). So it still might work -- but not for gen 1 through 250, though I don't think you were suggesting that.

**Blackout** · 10-06-2003, 07:32 PM

Hi,

tpdooley, thanks for the link to the "10 best" and the details. I hadn't seen that before.

Howard, if the description of the new algorithm in the Phase II FAQ is what you meant by "you figured out the answer about speeding up phase I" then my suggestion was (again) not clear.

My 3rd suggestion is to keep the best 10 (or 20) folds from gen 0 which are not used, and add them to the next gen 0. This would avoid the following:

first gen 0 - best RMSDs:
13.0, 13.1, 13.3

second gen 0 - best RMSDs:
14.0, 14.2, 14.9

Currently, the second gen 1 would begin with an RMSD of 14.0, but by carrying the best 10 folds along, it would start with an RMSD of 13.1.

bwkaz, I expect it is possible to calculate RMSD or energy at any time. Of course, if it's done before all the residues are present, it won't be the final RMSD or energy of the protein, but an intermediate value could be useful in some circumstances.

However, since gen 0 takes a small percentage of the time to go through all 251 gens, I think my first two suggestions are of limited interest.

Michael

**tpdooley** · 10-06-2003, 09:55 PM

After someone complained about getting a score of 40? for their best gen 0 structure - Howard mentioned that the Gen 0 structures aren't being scored by RMSD - or at least the score isn't the RMSD value.

With enough creative searching, perhaps you can find his explanation.

**Brian the Fist** · 10-07-2003, 10:49 AM

Originally posted by Blackout

However, since gen 0 takes a small percentage of the time to go through all 251 gens, I think my first two suggestions are of limited interest.

Michael

That was my main point.
Also, the situation you describe is unlikely. With 10,000 samples, the best 'score' we expect will generally be very similar is separate batches of 10000. Its not worth the extra trouble and coding to save from previous gen 0's although I understand your idea and its perfectly valid of course.

Thread: a passel of suggestions

Thread Tools

Rate This Thread

Display

a passel of suggestions

no can see details

Re: a passel of suggestions

3rd suggestion reworded

Re: 3rd suggestion reworded

Posting Permissions