questions about energy, generation 0, and a cheeky suggestion

Printable View

09-22-2003, 05:21 PM
Blackout

questions about energy, generation 0, and a cheeky suggestion

Hello,

I have a few questions about how energy and best structure are determined.

In Phase II FAQ it states, "Only the best structure for each generation is uploaded to the server." Does that apply to generation 0 as well?

How is the best structure from gen. 0 currently selected? Is it based on the lowest RMSD or best energy?

How is the energy calculated and how important is the accuracy?

And, is it possible to begin calculating or estimating the energy of the protein as it is being assembled? How different is the energy of protein between its native and a 'typical' unfolded state?

If the energies are very different, and if energy can be calculated during assembly (and if not all structures from gen. 0 are uploaded) might it be possible to speed up generation 0 by doing the following?

First, an assumption: I assume that for each of the 10,000 iterations, after the protein fold is completed, the energy is calculated and compared to the current best energy which is updated if the new fold is better. After all 10,000 are done, the best fold is used as the seed to start generation 1.

The suggestion: After the first 200 (or 500) folds are complete, start calculating the energy while each protein is being built up. If it exceeds the current best energy, the rest of that fold is skipped. If calculating the energy after each new residue is too expensive, perhaps it could be done starting after the 20th residue has been added and then again after every 5th, or some similar scheme. I suggest starting this after 200 (or however many) because I'm guessing it is expensive to calculate the energy and therefore it would be better to quickly build up enough folds to have a reasonably good energy to serve as the basis for comparison.

I imagine this would speed things up because less time would be wasted resolving atomic clashes which seem (based on staring at the screensaver) to occur more frequently after the first few dozen residues have been added, and of course, many iterations would be interrupted so generation 0 would complete more quickly than it does now.

If RMSD is used to choose the best structure, it seems to me that this approach of interrupting structures in gen. 0 should be even easier to implement.

So many questions...

Michael Matisko

= = = = = = = = = = = =

http://www.distributedfolding.org/details.html

I made the most structures. Do I win?

It is not quantity, but quality that matters with our project. The 'lowest RMSD' is what you should get excited about as this is where the real science is. Generating large numbers of poor structures doesn't achieve anything. Of course, it is all random, you have no control over the generated structures, that's the whole point of creating this massively distributed task. To explain this briefly, we are sampling from the many possible conformations of a protein, and testing to see how much sampling is required to get something that looks like the true structure. ... There are literally trillions of possible shapes such a protein could take, hence the need for massive sampling.

http://www.distributedfolding.org/phaseiifaq.html

... Generation zero will remain entirely random, producing 10000 structures. The best structure from this generation is chosen to serve as the basis for the next generation. Note that the best structure is selected based on either RMSD to native structure or crease energy (will vary from version to version as we test the algorithm more). Taking the best structure and generating 50 near-neighbour structures produces generation one. ... Only the best structure for each generation is uploaded to the server.
09-22-2003, 05:29 PM
rsbriggs

Quote:

I made the most structures. Do I win?

Yes :D :thumbs:
09-22-2003, 07:45 PM
Paratima

Absolutely right, rsbriggs! Quality is in the hands of the scientists who wrote the program and the gods of odds. :D

All that we can influence is the number of results. Git aloong, little proteins! :mouserun: :mouserun: :mouserun:
09-23-2003, 12:04 PM
Brian the Fist

Re: questions about energy, generation 0, and a cheeky suggestion

Quote:

Originally posted by Blackout
Hello,

I have a few questions about how energy and best structure are determined.
...

Just a 'few' questions huh? Sounds like you need a Biochemistry text book, or perhaps a read of Chapter 1 of my thesis is in order (available on the About->Stuff section of the web site). I'm not going to give you a biology lesson here but I'll try to briefly address some of your questions. See the educational section in this forum as well (make sure you set it to view ALL messages).

The best structure in gen. 0 is the best 'score' - a combo of compactness, secondary structure accuracy to prediction, and pseudo-energy. Energy is a contact energy similar to many commonly used in the protein folding field and is quick to compute, from residue pairs close together in space interacting.

The pseudo-energy cannot be computed until the protein is completely built. See the Results section of the website for sample energy vs RMSD plots. RMSD 0 = folded, RMSD 20+ = unfolded.

I am not sure if you understand what RMSD is or not, so you might need clarification on this as well. We are currently folding proteins whose true structure is already known but for true prediction, you cannot compute RMSD (the distance to true structure).

I also think you miss understand the point of how this present algorithm works. Gen. 0 acts to seed the simulation - find a suitable starting point for the rest of the simulation. To make it shorter would be foolish as it would lessen our chances of finding a good starting structure and thus ruin the whole 250 generations if chosen poorly. On the other hand, we can almost guarantee a 'decent' starting structure will be found in 10000 samples. We do not know beforehand what a 'decent' score will be though, so we must just make our 10000 and take the best. Energy scores can vary widely from protein to protein in magnitude.

Hope this clears things up a bit and points you in the right direction for further reading.
09-24-2003, 04:06 AM
Blackout

deep discount questions

First, thanks for the reply Howard.

I am slowly wading through Principles of Biochemistry, by Horton et. al. and I downloaded your thesis a week ago and will finish chapter one soon. Incidentally, I was intrigued by your remark that if there are a finite number of folds that occur in proteins, and if they are all discovered in the nearish future, then ab initio folding schemes (schemes not relying on any prior knowledge of a protein or similar protein structure) would no longer be of interest, since the threading scheme should be able to reveal all remaining unknown protein structures. Might there still be a need for ab initio for designing brand new proteins?

Back to the thread... I understand RMSD to be the root mean square of the distances of each atom in the protein to where the atom is located in the protein's native fold. And yes, it is clear that RMSD is not available when dealing with proteins of unknown structure.

I was not clear in my suggestion about accelerating gen 0. I understand that the larger the set of structures in gen 0, the better the chance that the best structure of the lot will lead to the native fold in the 250 "improving" generations. Rather than reduce the 10,000 iterations in gen 0, I was hoping it would be possible to monitor the quality of each protein fold as it is being constructed, and if it becomes worse than the current best fold, then there would be no need to complete it. In other words, I meant interrupting individual folds, not the whole batch of 10,000.

I do, however, have two other suggestions/questions, one of which would try to reduce the number of iterations in gen 0. But I will read the threads in the educational forum before getting into those. :)

A couple of final, unrelated, questions (there was a sale on questions at the department store and I can't resist a bargain): I've never read any mention that there might be "useless proteins". I was just thinking, creatures evolve and are therefore in a state of flux. Is it fair to say that proteins go through gradual stages on their way to becoming very efficient? Or is it more likely that there are dramatic, random mutations, and presto, there's a nifty new protein in the species. And is it the case that some proteins that were needed by ancestors who are very different from the current creature are still produced, even though their function is no longer needed?

Yours in folding,
Michael Matisko
09-24-2003, 04:35 AM
prokaryote

Re: deep discount questions

Quote:

Originally posted by Blackout

.
.
.

A couple of final, unrelated, questions (there was a sale on questions at the department store and I can't resist a bargain): I've never read any mention that there might be "useless proteins". I was just thinking, creatures evolve and are therefore in a state of flux. Is it fair to say that proteins go through gradual stages on their way to becoming very efficient? Or is it more likely that there are dramatic, random mutations, and presto, there's a nifty new protein in the species. And is it the case that some proteins that were needed by ancestors who are very different from the current creature are still produced, even though their function is no longer needed?

Yours in folding,
Michael Matisko

Hi Michael,

I like the last set of questions! My take on it, after reading up on evolutionary theory, is that's part of the reason or explaination of why we have so many copies of single genes interspersed within and amongst chromosomes (as well as have more than one chromosome to boot). Having extra copies allows some to undergo mutation and yet still have sufficient quantities of unmutated and needed proteins around. The same can be said about having a entire extra copy of a chromosome. Then there's the fact that the mutation rate along a chromosome isn't evenly distributed so you have translocation of genes that occurs and via natural selection, those genes that most affect an organism's survival will not likely end up in those regions. Of course the addition of exons will also lessen the effects of mutations since it spreads out a genes location along a chromosome. Another question that I have is that folding doesn't necessarily occur in isolation, sometimes there's chaperone proteins that can affect which state a single protein may end up (as well as stuff like pH, temperature, presence and type of sugars, etc.) Isn't that one of the ways that prions become so nasty is that they act like catalysts? Hopefully, in the future algorithms and models could be developed that could investigate these other properties as well.

Anyway, good luck in your reading.

prok
09-24-2003, 11:57 AM
Brian the Fist

Re: deep discount questions

Quote:

Originally posted by Blackout
First, thanks for the reply Howard.
Might there still be a need for ab initio for designing brand new proteins?

"improving" generations. Rather than reduce the 10,000 iterations in gen 0, I was hoping it would be possible to monitor the quality of each protein fold as it is being constructed, and if it becomes worse than the current best fold, then there would be no need to complete it. In other words, I meant interrupting individual folds, not the whole batch of 10,000.

A couple of final, unrelated, questions (there was a sale on questions at the department store and I can't resist a bargain): I've never read any mention that there might be "useless proteins". I was just thinking, creatures evolve and are therefore in a state of flux. Is it fair to say that proteins go through gradual stages on their way to becoming very efficient? Or is it more likely that there are dramatic, random mutations, and presto, there's a nifty new protein in the species. And is it the case that some proteins that were needed by ancestors who are very different from the current creature are still produced, even though their function is no longer needed?

Yours in folding,
Michael Matisko

Good questions.

Yes, ab initio COULD always be useful for protein design, however, usually that is done instead by choosing a fold first, and than exploring 'sequence space', which is a bit more efficient. ab initio could be used to predict effects of mutations though, for example, if it were accurate enough.

I see what you meant for gen 0 now. Since each structure can backtrack while it is being built, we cannot 'end early' if it has a poor score partway through. More to the point, each structure takes less than a second to complete usually, so computing the energy every 10 (or whatever) residues would likely slow things down even if it led to aborted structures, not speed things up. This general issue has come up before. It is better to generate and keep everything, then toss out the garbage rather than try to only generate 'good' structures.

Only about 30-40% (dont quote me on that) of proteins have known or purported functions. The remaining ones may do nothing useful, or we may not have found out what they do yet. Obviously there's no way to know which! IT is very likely that at least some proteins are left over from evolutionary ancestors and are no longer needed. However, many of these will mutate over time, and assuming the mutation isnt harmful (which is unlikely if the protein is useless), the protein will eventually mutate into a non-coding piece of DNA and cease to be transcribed. At this point it will effectively be 'dead'. So probably most transcribed proteins DO have some function.
Generally mutations are very slow and gradual, but I believe there are cases when evolution has been observed to 'accelerate' for no good reason, such as the development of eyes (but again, check a textbook, dont quote me on this).

Hope that clears things up some more, and feel free to share any other ideas you have.
09-24-2003, 02:56 PM
prokaryote

Re: Re: deep discount questions

Quote:

Originally posted by Brian the Fist
Good questions.
.
.
.

Only about 30-40% (dont quote me on that) of proteins have known or purported functions. The remaining ones may do nothing useful, or we may not have found out what they do yet. Obviously there's no way to know which! IT is very likely that at least some proteins are left over from evolutionary ancestors and are no longer needed. However, many of these will mutate over time, and assuming the mutation isnt harmful (which is unlikely if the protein is useless), the protein will eventually mutate into a non-coding piece of DNA and cease to be transcribed. At this point it will effectively be 'dead'. So probably most transcribed proteins DO have some function.
Generally mutations are very slow and gradual, but I believe there are cases when evolution has been observed to 'accelerate' for no good reason, such as the development of eyes (but again, check a textbook, dont quote me on this).

.
.
.

Sorry, just had to quote... :D

Okay, seriously now. And somewhat off the original topic, but maybe pertinent to the SGA (simple genetic algorithm) for DF when it is implemented. From heirarchical selection concepts of evolutionary theory, many of the non-coding schemas seen in DNA are probably the result of selective pressures coming back down to the genetic level to prevent mutations from affecting the higher levels of selection. Consider an organism that has evolved a new schema involving teaming up with other organisms to form a multi-cellular single organism. Now, once this new organism evolves into this new niche and schema (becomes rather fit to the environment), it's also is going to have selective pressure to keep the cell line level (the selective level under it) from having so many changes that it disrupts this new schema. Therefore genetic solutions that minimize the impact of mutations at the cellular level will arise (such as transponds, introns and exons, multiple copies, etc). This same sort of level based selection extends from the gene, to the cell, to the organ, to the individual, to the group, to the species and finally to the clade. All of these exert levels exert selective pressures on the levels below them to minimize change and their effects to the level above them. At least that's my take on heirachical selection.

So what does this have to do with DF and it's implementation of a SGA (when it shows up :) )? Well level based selection may be implemented to capture good schemas found and maintain and protect them from mutations that may harm the fitness of the solution. I'm thinking multiple copies here and non-random mutation rate zones along the coded chromosome. Good schemas will migrate to these lower mutation rate zones and having multiple copies would allow some investigation of mutating these while preserving good folded structure. Also, to get around having to have nearly optimally tweaked SGA parameters, you may want to implement an island or punctuated equilibrium process to your SGA (see the DHEP project). Each island could have slightly different chromosome mutation rate zones and cross-over rates and code for these as well (say on another chromosome) that would co-evolve along with the structure schemas.

So what do you think Howard?

prok
09-26-2003, 05:08 PM
Blackout

Hi again,

Prokaryote, I didn't know there are multiple copies of some genes. Interesting. I see what you mean about allowing mutations to occur in one copy of the gene while still having the protein available from the non-mutated gene.

. . . . . . . .

Howard, before I unveil my wee suggestions, let me spell out the big assumption on which they rest.

I think of the high level strategy of the distributed folding algorithm this way; create enough random structures to be confident there is a good one to use as the starting point for the actual folding algorithm. Repeat this cycle about 200,000 times (= about 5 billion structures altogether). The key point being, it is not important to start with a fold with a fantastic energy value, simply a good one.

Here then, is a modified version of my original suggestion to speed up generation 0:

1) Create enough random folds to get a reasonably good energy score.
2) Thereafter, check the energy once at a residue number at some point after the halfway residue.
E.g. if the protein has 100 residues, check at residue 70.
3) If the energy is more than 100% percent worse than the current best energy, discard this fold, start the next.

An interrupted fold takes 30% less time than a completed one.
If half the folds are discarded, generation 0 will complete 15% faster than it currently does (roughly speaking).

I realize that backtracking could occur right after the 70th residue and lower the energy, but I am guessing that the "discard energy" can be set high enough, 100% in my example, so that this would happen so seldom as to make the check worthwhile.

Of course, there are many variations. You could check if the energy is worse than 90% at residue 60 and again if it is worse than 120% at residue 80, etc.

I am making a number of assumptions here, the principle one being that the energy of many bad folds, well before all the residues have been added, is already significantly higher than that of a good fold.

. . . . . . . .

Here's a second suggestion.

I expect you have a statistical method for determining a "reasonably" good energy for the protein once 10,000 random structures have been created. (Not the best energy, simply one good enough to be useful as a starting fold for gen 1.) Use the statistical information of the protein available after the first cycle is done.

1) Calculate a good energy value based on the first gen 0 to use as a threshold.
2) Pass the threshold to all subsequent gen 0s.
3) If a fold in gen 0 has an energy better than this threshold value, start gen 1.
4) The threshold could be refined as structures are created.

Call me an optimist with no knowledge of the inner workings of d.f. (the latter is of course true), but I imagine this could reduce the number of gen 0 structures enormously.

I find the idea of trying to shorten gen 0 appealing, because as it stands, 44% of all structures are created in gen 0, and therefore, in a sense, not used. It must happen with some frequency, that one gen 0 has a good batch of folds. The next gen 0 is a bit worse than average, and its best fold is only as good as the 9th best from the previous batch, but those better folds are not used.

. . . . . . . .

BTW, you mention that a structure typically takes a second to complete. While I have seen that happen, on average mine take much longer and I am running the text client on a 1.7GHz Pentium 4 machine. A very rough guess, I would say around 10 seconds per structure.

Two minor questions. If my calculations are correct (what in the name of Hollywood got me to write that?) there are:

22500 folds are created in a complete 251 generation cycle (10000 + 50 * 250)

132083 points are awarded for a complete cycle (50 + sum of floor(sqrt(x)*50) where x is 1 to 250)

So does "Total work done so far:4,705,074,436" mean 4.7 billion points or structures?

Is there an explanation on the site of the energy graphs? For starters, I don't understand what the colours mean in the chart. (If you don't have a link handy, please don't waste your time on this. I will hunt around in my textbook/online "soon". Nice word, soon.)

Interested to hear your remarks and a luverly weekend to you,
Michael
09-27-2003, 01:59 PM
Blackout

seconds/structure

An addendum to my previous post.

I monitored the text client and found these speeds:

26.75 seconds/structure - generations 233 to 238)

00.67 seconds/structure - generation 0

If that is a fair representation of the times, then the time spent calculating (on a 1.7GHz Pentium 4) is:

gen 0
= .67 * 10000
= 6700 seconds
= 1.9 hours

gen 1 to 250
= 26.75 seconds * 50 * 250
= 334375 seconds
= 93 hours

Hm. The time benefit from changing generation 0 wouldn't be as big as I thought.

Mulling it over,
Michael