questions about energy, generation 0, and a cheeky suggestion
Hello,
I have a few questions about how energy and best structure are determined.
In Phase II FAQ it states, "Only the best structure for each generation is uploaded to the server." Does that apply to generation 0 as well?
How is the best structure from gen. 0 currently selected? Is it based on the lowest RMSD or best energy?
How is the energy calculated and how important is the accuracy?
And, is it possible to begin calculating or estimating the energy of the protein as it is being assembled? How different is the energy of protein between its native and a 'typical' unfolded state?
If the energies are very different, and if energy can be calculated during assembly (and if not all structures from gen. 0 are uploaded) might it be possible to speed up generation 0 by doing the following?
First, an assumption: I assume that for each of the 10,000 iterations, after the protein fold is completed, the energy is calculated and compared to the current best energy which is updated if the new fold is better. After all 10,000 are done, the best fold is used as the seed to start generation 1.
The suggestion: After the first 200 (or 500) folds are complete, start calculating the energy while each protein is being built up. If it exceeds the current best energy, the rest of that fold is skipped. If calculating the energy after each new residue is too expensive, perhaps it could be done starting after the 20th residue has been added and then again after every 5th, or some similar scheme. I suggest starting this after 200 (or however many) because I'm guessing it is expensive to calculate the energy and therefore it would be better to quickly build up enough folds to have a reasonably good energy to serve as the basis for comparison.
I imagine this would speed things up because less time would be wasted resolving atomic clashes which seem (based on staring at the screensaver) to occur more frequently after the first few dozen residues have been added, and of course, many iterations would be interrupted so generation 0 would complete more quickly than it does now.
If RMSD is used to choose the best structure, it seems to me that this approach of interrupting structures in gen. 0 should be even easier to implement.
So many questions...
Michael Matisko
= = = = = = = = = = = =
http://www.distributedfolding.org/details.html
I made the most structures. Do I win?
It is not quantity, but quality that matters with our project. The 'lowest RMSD' is what you should get excited about as this is where the real science is. Generating large numbers of poor structures doesn't achieve anything. Of course, it is all random, you have no control over the generated structures, that's the whole point of creating this massively distributed task. To explain this briefly, we are sampling from the many possible conformations of a protein, and testing to see how much sampling is required to get something that looks like the true structure. ... There are literally trillions of possible shapes such a protein could take, hence the need for massive sampling.
http://www.distributedfolding.org/phaseiifaq.html
... Generation zero will remain entirely random, producing 10000 structures. The best structure from this generation is chosen to serve as the basis for the next generation. Note that the best structure is selected based on either RMSD to native structure or crease energy (will vary from version to version as we test the algorithm more). Taking the best structure and generating 50 near-neighbour structures produces generation one. ... Only the best structure for each generation is uploaded to the server.
Re: questions about energy, generation 0, and a cheeky suggestion
Quote:
Originally posted by Blackout
Hello,
I have a few questions about how energy and best structure are determined.
...
Just a 'few' questions huh? Sounds like you need a Biochemistry text book, or perhaps a read of Chapter 1 of my thesis is in order (available on the About->Stuff section of the web site). I'm not going to give you a biology lesson here but I'll try to briefly address some of your questions. See the educational section in this forum as well (make sure you set it to view ALL messages).
The best structure in gen. 0 is the best 'score' - a combo of compactness, secondary structure accuracy to prediction, and pseudo-energy. Energy is a contact energy similar to many commonly used in the protein folding field and is quick to compute, from residue pairs close together in space interacting.
The pseudo-energy cannot be computed until the protein is completely built. See the Results section of the website for sample energy vs RMSD plots. RMSD 0 = folded, RMSD 20+ = unfolded.
I am not sure if you understand what RMSD is or not, so you might need clarification on this as well. We are currently folding proteins whose true structure is already known but for true prediction, you cannot compute RMSD (the distance to true structure).
I also think you miss understand the point of how this present algorithm works. Gen. 0 acts to seed the simulation - find a suitable starting point for the rest of the simulation. To make it shorter would be foolish as it would lessen our chances of finding a good starting structure and thus ruin the whole 250 generations if chosen poorly. On the other hand, we can almost guarantee a 'decent' starting structure will be found in 10000 samples. We do not know beforehand what a 'decent' score will be though, so we must just make our 10000 and take the best. Energy scores can vary widely from protein to protein in magnitude.
Hope this clears things up a bit and points you in the right direction for further reading.
Re: deep discount questions
Quote:
Originally posted by Blackout
.
.
.
A couple of final, unrelated, questions (there was a sale on questions at the department store and I can't resist a bargain): I've never read any mention that there might be "useless proteins". I was just thinking, creatures evolve and are therefore in a state of flux. Is it fair to say that proteins go through gradual stages on their way to becoming very efficient? Or is it more likely that there are dramatic, random mutations, and presto, there's a nifty new protein in the species. And is it the case that some proteins that were needed by ancestors who are very different from the current creature are still produced, even though their function is no longer needed?
Yours in folding,
Michael Matisko
Hi Michael,
I like the last set of questions! My take on it, after reading up on evolutionary theory, is that's part of the reason or explaination of why we have so many copies of single genes interspersed within and amongst chromosomes (as well as have more than one chromosome to boot). Having extra copies allows some to undergo mutation and yet still have sufficient quantities of unmutated and needed proteins around. The same can be said about having a entire extra copy of a chromosome. Then there's the fact that the mutation rate along a chromosome isn't evenly distributed so you have translocation of genes that occurs and via natural selection, those genes that most affect an organism's survival will not likely end up in those regions. Of course the addition of exons will also lessen the effects of mutations since it spreads out a genes location along a chromosome. Another question that I have is that folding doesn't necessarily occur in isolation, sometimes there's chaperone proteins that can affect which state a single protein may end up (as well as stuff like pH, temperature, presence and type of sugars, etc.) Isn't that one of the ways that prions become so nasty is that they act like catalysts? Hopefully, in the future algorithms and models could be developed that could investigate these other properties as well.
Anyway, good luck in your reading.
prok
Re: deep discount questions
Quote:
Originally posted by Blackout
First, thanks for the reply Howard.
Might there still be a need for ab initio for designing brand new proteins?
"improving" generations. Rather than reduce the 10,000 iterations in gen 0, I was hoping it would be possible to monitor the quality of each protein fold as it is being constructed, and if it becomes worse than the current best fold, then there would be no need to complete it. In other words, I meant interrupting individual folds, not the whole batch of 10,000.
A couple of final, unrelated, questions (there was a sale on questions at the department store and I can't resist a bargain): I've never read any mention that there might be "useless proteins". I was just thinking, creatures evolve and are therefore in a state of flux. Is it fair to say that proteins go through gradual stages on their way to becoming very efficient? Or is it more likely that there are dramatic, random mutations, and presto, there's a nifty new protein in the species. And is it the case that some proteins that were needed by ancestors who are very different from the current creature are still produced, even though their function is no longer needed?
Yours in folding,
Michael Matisko
Good questions.
Yes, ab initio COULD always be useful for protein design, however, usually that is done instead by choosing a fold first, and than exploring 'sequence space', which is a bit more efficient. ab initio could be used to predict effects of mutations though, for example, if it were accurate enough.
I see what you meant for gen 0 now. Since each structure can backtrack while it is being built, we cannot 'end early' if it has a poor score partway through. More to the point, each structure takes less than a second to complete usually, so computing the energy every 10 (or whatever) residues would likely slow things down even if it led to aborted structures, not speed things up. This general issue has come up before. It is better to generate and keep everything, then toss out the garbage rather than try to only generate 'good' structures.
Only about 30-40% (dont quote me on that) of proteins have known or purported functions. The remaining ones may do nothing useful, or we may not have found out what they do yet. Obviously there's no way to know which! IT is very likely that at least some proteins are left over from evolutionary ancestors and are no longer needed. However, many of these will mutate over time, and assuming the mutation isnt harmful (which is unlikely if the protein is useless), the protein will eventually mutate into a non-coding piece of DNA and cease to be transcribed. At this point it will effectively be 'dead'. So probably most transcribed proteins DO have some function.
Generally mutations are very slow and gradual, but I believe there are cases when evolution has been observed to 'accelerate' for no good reason, such as the development of eyes (but again, check a textbook, dont quote me on this).
Hope that clears things up some more, and feel free to share any other ideas you have.
Re: Re: deep discount questions
Quote:
Originally posted by Brian the Fist
Good questions.
.
.
.
Only about 30-40% (dont quote me on that) of proteins have known or purported functions. The remaining ones may do nothing useful, or we may not have found out what they do yet. Obviously there's no way to know which! IT is very likely that at least some proteins are left over from evolutionary ancestors and are no longer needed. However, many of these will mutate over time, and assuming the mutation isnt harmful (which is unlikely if the protein is useless), the protein will eventually mutate into a non-coding piece of DNA and cease to be transcribed. At this point it will effectively be 'dead'. So probably most transcribed proteins DO have some function.
Generally mutations are very slow and gradual, but I believe there are cases when evolution has been observed to 'accelerate' for no good reason, such as the development of eyes (but again, check a textbook, dont quote me on this).
.
.
.
Sorry, just had to quote... :D
Okay, seriously now. And somewhat off the original topic, but maybe pertinent to the SGA (simple genetic algorithm) for DF when it is implemented. From heirarchical selection concepts of evolutionary theory, many of the non-coding schemas seen in DNA are probably the result of selective pressures coming back down to the genetic level to prevent mutations from affecting the higher levels of selection. Consider an organism that has evolved a new schema involving teaming up with other organisms to form a multi-cellular single organism. Now, once this new organism evolves into this new niche and schema (becomes rather fit to the environment), it's also is going to have selective pressure to keep the cell line level (the selective level under it) from having so many changes that it disrupts this new schema. Therefore genetic solutions that minimize the impact of mutations at the cellular level will arise (such as transponds, introns and exons, multiple copies, etc). This same sort of level based selection extends from the gene, to the cell, to the organ, to the individual, to the group, to the species and finally to the clade. All of these exert levels exert selective pressures on the levels below them to minimize change and their effects to the level above them. At least that's my take on heirachical selection.
So what does this have to do with DF and it's implementation of a SGA (when it shows up :) )? Well level based selection may be implemented to capture good schemas found and maintain and protect them from mutations that may harm the fitness of the solution. I'm thinking multiple copies here and non-random mutation rate zones along the coded chromosome. Good schemas will migrate to these lower mutation rate zones and having multiple copies would allow some investigation of mutating these while preserving good folded structure. Also, to get around having to have nearly optimally tweaked SGA parameters, you may want to implement an island or punctuated equilibrium process to your SGA (see the DHEP project). Each island could have slightly different chromosome mutation rate zones and cross-over rates and code for these as well (say on another chromosome) that would co-evolve along with the structure schemas.
So what do you think Howard?
prok