Hi again,

This continues from my first thread http://free-dc.org/forum/showthread....threadid=4251, .

Prokaryote, I didn't know there are multiple copies of some genes.
Interesting. I see what you mean about allowing mutations to occur in one copy of the gene while still having the protein available from the non-mutated gene.

. . . . . . . .

Howard, before I unveil my wee suggestions, let me spell out the main assumption on which they rest.

I think of the high level strategy of the d.f. algorithm this way; create enough random structures to be confident there is a good one to use as the starting point for the actual folding algorithm. Repeat this cycle about 200,000 times (= about 5 billion structures altogether). The key point being, it is not important to start with a fold with a fantastic energy value, simply a good one.

Here then, is a modified version of my original suggestion to speed up generation 0:

1) Create enough random folds to get a reasonably good energy score.
2) Thereafter, check the energy once at a residue number at some point after the halfway residue.
E.g. if the protein has 100 residues, check at residue 70.
3) If the energy is more than 100% percent worse than the current best energy, discard this fold, start the next.

An interrupted fold takes 30% less time than a completed one. If half the folds are discarded, generation 0 will complete 15% faster than it currently does (roughly speaking).

I realize that backtracking could occur right after the 70th residue and lower the energy, but I am guessing that the "discard energy" can be set high enough, 100% in my example, so that this would happen so seldom as to make the check worthwhile.

Many variations are possible. You could check if the energy is worse than 90% at residue 60 and again if it is worse than 120% at residue 80, etc.

I am making a number of assumptions here, the principle one being that, well before all the residues have been added, the energy of most bad folds is already significantly higher than that of a good fold.

. . . . . . . .

Here's a second suggestion.

I expect you have a statistical method for determining a "reasonably" good energy for the protein once 10,000 random structures have been created. (Not the best energy, simply one that can determine a good starting fold for gen 1.) Use the statistical information of the protein available after the first cycle is done.

1) Calculate a good energy value based on the first gen 0 to use as a threshold.
2) Pass the threshold to all subsequent gen 0s.
3) If a fold in gen 0 has an energy better than this threshold value, start gen 1.
4) The threshold could be refined as structures are created.

Call me an optimist with no knowledge of the inner workings of d.f. (the latter is definitely true), but I imagine this could reduce the number of gen 0 structures significantly. You could also pass this threshold value from the server to the clients after it is calculated the first time.

. . . . . . . .

And at no extra cost, a third suggestion.

It must happen occasionally, that a gen 0 has a good batch of folds and the folds in the next gen 0 are worse than average, and its best fold is only as good as the 9th best from the previous batch. But those better folds have been discarded.

Would it be worthwhile keeping the 10 or 50 best unused folds from each gen 0 and passing them along to the next? That should ensure that each gen 1 always starts with the best fold from all the gen 0s.

I find the idea of trying to shorten gen 0 appealing, because as it stands, 44% of all structures are created in gen 0, and therefore, in a sense, not used.

. . . . . . . .

BTW, you mention that a structure typically takes a second to complete. While I have seen that happen, on average mine take much longer and I am running the text client on a 1.7GHz Pentium 4 machine. At a very rough guess, I would say around 10 seconds per structure.

Two minor questions. If my calculations are correct (what in the name of Hollywood got me to write that?) there are:

22500 folds are created in a complete 251 generation cycle (10000 + 50 * 250)

132083 points are awarded for a complete cycle (50 + sum of floor(sqrt(x)*50)
where x is 1 to 250)

So does "Total work done so far:4,705,074,436" mean 4.7 billion points or structures?

Is there an explanation on the site of the energy graphs? For starters, I don't understand what the colours mean in the chart. (If you don't have a link handy, please don't waste your time on this. I will hunt around in my textbook/online soon. Nice word that, "soon".)

Interested to hear your remarks,


An addendum. I timed my text client and found:

gen 0
sec/structure 0.67

gen 233 to 238
sec/structure 26.75

Which means generation 0 completes in 1.9 hours
gens 1 to 250 complete in 92.9 hours

Hm. The gain in time by shortening gen 0 would not be what I was expecting. However, might there still be something to gain by collecting more than just the best fold from generation 0?


"questions about energy, generation 0, and a cheeky suggestion"