crease energy

**Brian the Roman** · 03-19-2003, 08:23 AM

Howard;

I've been looking at the crease energy (which I understand is our best method of picking the best conformation) and comparing it to the rmsd. When examining these it becomes very clear that we are handling the sampling side of the folding problem much better than we are handling the scoring side.

I say this because the rmsd graphs tend to go down in a nice basically even curve, from sharply vertical to basically horizontal, which is what you would expect based on a consistent scoring method. However, the crease energy over the same set of results is varying wildly. I conclude from this that not only is our yardstick for measuring the elevation of this spot on the world inaccurate (using the find the lowest spot on the earth analogy), it is also highly inconsistent - that is if you measure two spots that are pretty close in elevation, our measuring tool will often give highly different results.

To make matters worse, I'm pretty sure reality is actually worse than the graphs portray simply because we're using the rmsd to eliminate all of the worst results from each generation right now. So when we go ab initio and use crease energy we'll probably be throwing out the best value of each generation. So, I now think I understand why you said earlier in reference to the CASP5 results that it was clear we needed a better scoring algorithm.

It seems to me, however, that our guided sampling method is now so far ahead of the scoring that it is virtually pointless to continue to fine tune the sampling side until the scoring side catches up.

I'm assuming that you have reached similar conclusions, and yet to all appearances you still seem to be fine tuning the sampling method. I conclude from this that you don't know how to improve the scoring functions, but you have effort available so you might as well apply it usefully somewhere and the only place to do so is the sampling side.

I don't mean to be totally discouraging here (but I must admit I'm feeling that way myself), but it seems to me that we either need to find a way to start focussing our efforts on the areas that really matter (scoring) or wait until some bright guy comes up with an idea to do so.

One thing we could do, I suppose, is to apply all of our current scoring methods to every structure we sample of a know protein and recording all the results and then do some analysis to see which algorithms do best and if we can mix and match approaches from within one algorithm and apply it inside another. Yes, this would vastly decrease our sample size, but we already know we can handle this side of things once we get the scoring worked out.

Thoughts?

ms

**Brian the Fist** · 03-19-2003, 10:46 AM

You are indeed correct and I agree with most of what you say. However, we have found that the crease energy can perform better than it may initially appear if used properly. Just for reference, the crease energy of the native structure for the protein in the beta test is -4800 which none of the 5A RMSD structure have gotten close to yet. It is a great fallacy to draw conclusions on a scoring function after testing it on just a single protein.

We have thoroughly tested all our current scoring functions on a set of 17 different proteins of distinct folds to get a better idea of how they perform. Nevertheless crease energy is far from ideal.

However, as you may or may not know, this project is for the most part a chunk of my PhD thesis, which deals mostly with the sampling problem of protein folding. While I look at scoring as well, it being inextricably bound to sampling, it is not my major focus. That said, others in the lab are looking at the scoring problem in more detail and their findings incorporated into DFP as our scoring functions are tested and improved.

The beauty of the algorithm is it is trivial to 'plugin' a new scoring function, once we have one to try, without changing anything else in the algorithm. This will allow for rapid testing of new scoring functions on a large sample set after our preliminary testing on smaller sets of various protein folds.

**FEEDB0B0** · 03-19-2003, 11:18 AM

I thought I would take a rare opportunity to respond here too, in addition to Howard's comments.

Not long ago, Howard put detailed protein folding data from our beta project up. This data is keenly interesting. The values of protein scoring functions are plotted across the protein folding generation axis (the folding axis). If you look at that data - immediately you can appreciate the difficulty with scoring functions for protein folding. We are all seeing - in real-time - the evolving inconsistencies in how Crease Energy (and all the other terms too!) behave when RMSD is used as a scoring function. We have known these inconsistencies are there but now it is crystal clear. So far this "beta" it is giving us very useful scientific results already, and also it is helping educate volunteers like yourself about the problems of protein folding.

All along, we have divorced the two problems of protein folding - sampling and scoring - from one another. This is not the approach of other projects. We are using a divide-and-conquer approach. Divide protein folding into separable, computable problems - sampling and scoring. Conquer sampling first by making it independent of scoring variables. That is why we use probability methods in our sampling methods. After that we will work on conquering scoring, knowing that is it not affected by the choices we make in sampling. This is tricky to achieve, but rigourous attention to the separation of the two problems is bearing fruit here. Scoring must work independent from Sampling. Sampling must work independent of Scoring. Only then may they be combined to achieve predictive results.

So it is very gratifying to hear you mention that our sampling seems far ahead of the scoring. This makes me very happy to hear, as it validates our approach. However, I do need to argue why we need to continue working on sampling. One problem is that we know that the Crease Energy term works best when the protein fold is under 5 Angstroms RMSD. So far we seem to be hitting a wall in the Beta test code - we aren't really ever going under 5 Angstroms, even with RMSD as a scoring function "cheat". If that is the case, then the sampling is interfering with the scoring. So we must fix sampling to make it work better before we can expect the scoring function (Crease Energy or any other predictive scoring function) to zero in on the best structure.

The 5Angstrom-ish "sampling wall" may be a property of how we compute near-neighbors in each generation. We may need to sample more members per generation when we have compact structures. This will be addressed in a few more beta roll-outs. There is also a problem of "vanishing" secondary structure that is apparent in the data using the RMSD scoring function. Real proteins form secondary structure early in folding and keep it throughout the folding process. We are losing it! So we must devise a way to retain it, but still allow the protein to move and fold. This problem may cause the 5 Anstrom RMSD "wall" because the secondary structure isn't packing as closely as it needs to. We may try holding the secondary structure "fixed" as selected in the random sampling phase, and the protein move only at the other parts (bends) of the chain. That is on Howard's to-do list.

So having said all that, Howard and I are still working on exactly what experiments we roll out for Distributed Folding using the new algorithm. We have NOT yet taken a final decision that folding proteins with CREASE energy is what we will do when we roll out the new algorithm. We will clearly articulate to the group what experiments we will roll-out with the new algorithm, whether we are using real scoring functions or RMSD, and what proteins we will be targetting, so that everyone knows what we are computing, and how the data will be used and made available. We may be focusing on computing movies that can in turn be used to make new scoring functions by ourselves or other groups who want to collaborate with us.

Anyhow, thank-you for contributing your cycles to helping us solve the protein folding problem - every little bit helps! The beta system is providing a lot of new insight on the problem, and you folks are great beta testers! Do not be discouraged, as we are making great strides here!

Christopher Hogue

(a.k.a. Howard's Boss)
Scientist, Samuel Lunenfeld Research Institute and
Assistant Professor, Dept. of Biochemistry, University of Toronto

**cygnussphere** · 03-19-2003, 03:29 PM

Just jumping in here to re-affirm the warm fuzzys I get from the privilege of helping these and all other Good Humans who have chosen to use the gift of the grey matter between their ears for the benefit of the rest of humanity.
I look forward to the day when we have put our collective "Boot" in the "Ass" of the Protein folding problem and can unleash the "Fruit of Folding" on all of the unsuspecting diseases, genetic disorders and other biological research that need to be taught the lesson of the futility of resisting the motivated brains of the Good Humans !

**Brian the Roman** · 03-19-2003, 05:54 PM

Originally posted by Brian the Fist

However, as you may or may not know, this project is for the most part a chunk of my PhD thesis, which deals mostly with the sampling problem of protein folding.

No I was not aware of that and it certainly explains the sampling lean.

The beauty of the algorithm is it is trivial to 'plugin' a new scoring function, once we have one to try, without changing anything else in the algorithm. This will allow for rapid testing of new scoring functions on a large sample set after our preliminary testing on smaller sets of various protein folds.

Yes, I agree that is a real benefit of your approach. As I mentioned earlier you may want to consider a mulit-scoing function approach to get some extra guidance.

**Brian the Roman** · 03-19-2003, 06:18 PM

Chris;
First, thanks for your prompt response.

Second, I hope my earlier post does not cause any volunteers to leave the project - that certainly was NOT my intent.

Yes, I agree the beta is contributing usefull info. Furthermore I'm confident the sampling process can be stream-lined even more. I have seen a number of ideas tossed around this forum that could fine-tune the sampling side significantly beyond its current abilities. (Like AMD_*'s idea of using some clients to produce the generations quickly by giving up early, and then passing the most promising conformations on for more detailed study by other clients). So we are definitely making progress on the sampling side.

I was not aware of the 5A 'wall' and that does suggest extra sampling is worth it to see what happens once we get below it. The only issue, of course, is how will we know when we get below it once we go ab initio. Not getting below 5A may mean that, relatively speaking, there are extremely few conformations below that value.

I should mention that I don't consider it necessary for a beta to be used to test other scoring functions. If you wanted to plug in a different scoring function during normal processing that would be fine.

Any progress you make on the scoring side would be of great interest to me.

thanks

ms

**AMD_is_logical** · 03-20-2003, 05:21 PM

An important difference between RMSD and crease energy is that RMSD places no value on secondary structure (at larger values of RMSD) while crease energy encourages these low energy structures. My guess is that once we use crease energy as the scoring functions, along with using a hundred or so structures per generation, then secondary structure will form and hang around without any special effort.

**Brian the Fist** · 03-20-2003, 06:23 PM

I just want to comment that I think the graphs that I added recently have really made the (beta) project more interesting for everyone, and it sounds like they've helped everyone get a better grasp on exactly what it is we are doing/trying to do. It has helped get some of you more excited about the protein folding problem (as we are) and I'm glad of that because it really is a very interesting and complex problem. I hope the excitement will carry through to the rest of the users as well when we finish the beta. Then we would have the added benefit of not only making progress on the protein folding problem, but also educating the public on it and sparking more general interest in it. As many of you can see, in principle the problem is very simple and easy to understand. It is in the details and the vastness of conformational space where it quickly becomes difficult.

Anyways, I just wanted to comment on everyone's renewed interest in the science, and don't be shy to give out ideas. Who knows, maybe you will see a pattern in the data that we miss? Although we've looked at a lot of the different scoring functions and related values already, and their performances on a wide variety of protein folds, and we already have a pretty good idea of what works and what doesn't, we certainly haven't tried everything yet and someone out there could very well have a great idea we haven't thought of yet.

Keep up the good work

**Brian the Roman** · 03-21-2003, 06:29 AM

Howard;
as I've indicated before, it's primarily the science that interests me as that is what will solve the protein folding problem in the end.

Since we seem to be agreed that the intractible scoring issue is our biggest concern at present, I'm planning on putting some thought into it rather than the sampling side.

I find, however, that based on my knowlege limitations, the sampling side is actually easier for me to contribute to.

My one idea at present with respect to scoring is basically to apply multiple scoring algorithms in 'layers'. That is Use your fastes scoring algorithm first to quickly idnetify a sample pool of structures. We already do this as gen 0. Pick out the best of this pool (ideally the pool is from the entire project, not just a single client) and then apply multiple scoring algorithms to them with a weighting assigned to each algorithm based on our opinion as to its quality. The best conformations based on these results then get fed into the gen 1 clients which basically start the whole process over again.

This process implies several things:
1) some clients would need to be doing the applying multiple scoring methods while others are focussed on getting more samples. This could be done on a client by client basis or having the client do the multiple scoring at the end. The difference here between what the beta does and this approach is that you wouldn't only apply the different algorithm to the single best structure you've found so far but to the bext x%.
2) The sevrver will likely have to direct the clients as to what structure to continue sampling around next. Once you've gone to al the trouble of getting high-quality scores for the top x% we don't want to throw most of it away.

ms

**Brian the Fist** · 03-21-2003, 10:29 AM

What you describe is something like the genetic algorithm we have planned, but this will be for 'phase III', if the method we are currently working on proves to be insufficient (we're not about to abandon the current beta approach just yet of course...

)

Thread: crease energy

Thread Tools

Rate This Thread

Display

crease energy

Posting Permissions