The Science of Distirbuted Folding?

Printable View

Show 50 post(s) from this thread on one page

04-25-2002, 12:10 PM
Raj

The Science of Distirbuted Folding?

I was just curious to hear more about the science of distributed folding. I'm a grad student at Berkeley, but hang out at Stanford too (my wife is there) and have heard many of the big names in computational biology and protein structure prediciton (Michael Levitt, David Baker, etc) talk about work similar to what's going on here.

From what I can tell, Distributed Folding is creating a large decoy set, much like the original Park/Levitt decoy set. I guess I'm curious

(1) Why do you need distributed computing to do this? The Park/Levitt set (and other sets, eg Baker's) were created with much fewer resources. Are yours bigger or better? Based on the RMSds you quote, it doesn't seem that way.

I just want to make sure that we're not wasting our time here on inefficient code.

(2) WHy one needs a big decoy set in the first place? Levitt and Baker have each mentioned that decoy discrimination is the problem and that a bigger decoy set is not going to help.

Again, just curious whether all this makes any sense to be doing.

I think your idea to use distributed computing for biology (and proteins in particular) is great and I'm glad you're working on this. I just want to understand why. THanks! :)

Raj
04-25-2002, 12:38 PM
Scott Jensen

I'd also like to know how it differs from Stanford's Folding@Home project. Are you two doing the same thing or taking radically different approaches? If different, why do you think your approach is the better one? If the same, wouldn't it be better to just help one than divide volunteers between two projects? In other words, why should we help your project and not Stanford's? This really should be in your FAQs.
04-25-2002, 12:59 PM
Raj

I think I can answer Scott's question. Distributed Folding looks to be addressing the structure prediction problem, whereas Folding@Home is addressing *how* proteins fold. Different questions.

What I'm worried about is that while I (and others) believe Distributed Folding's methods make sense, they aren't new, and aren't going to help (beyond what's already out there). The problem in structure prediction is clearly decoy discrimination, not decoy generation.

Finally, if Distributed Folding hopes that their methods will advance structure prediction, they should enter CASP5 -- the competition to see which structure prediction methods are best. Levitt and (especially) Baker have done very well in this in the previous competitions. I don't think the scientists behind Distributed Folding have ever participated. If they're using all of our computers, it should be for a good reason, not just a "me-too", turning the crank, nothing new kinda science. IMHO, Distributed Computing should be about breaking barriers, not doing d.c. for the sake of saying you did it with d.c. (when there are better ways out there as it is).

Raj -- looking forward to hearing the good reason, as I know there must be one! :)
04-25-2002, 02:26 PM
brsullivan

If you read the website, you will see that they have every intention of entering CASP 5.

"PHASE III - True blind testing
In mid-2002, prediction for the CASP 5 competition will be under way, with true blind targets of unknown structure. We will attempt to predict these by generating up to 10 billion conformers of each and choosing the ones with lowest energy to submit as the predictions. This will be a true and final test of the power of distributed computing. Results from CASP 5 should be available in early 2003." :thumbs:
04-25-2002, 02:59 PM
Scott Jensen

Hmmm. So we're not doing cutting-edge research? :(
04-25-2002, 03:29 PM
Aegion

Quote:

Originally posted by Scott Jensen
Hmmm. So we're not doing cutting-edge research? :(

What we are currently doing is validating the software and prediction methods used in the project before we enter true blind testing. We need to generate evidence showing the accuracy of the software's results before the data we generate in the true blind testing of the project. I'll wait for Brian to comment further on the science of the project.
04-25-2002, 05:50 PM
Brian the Fist

I will pipe in here to answer some of Raj's questions (and they are good questions which he poses).

Firstly, if you wish to learn more about the algorithm and our approach please check out the TRADES web site:

http://bioinfo.mshri.on.ca/trades

Also (if you like) read the papers under the 'references' therein. These are also reference on the distributed folding page under About -> Science

There are many differences between us and folding@home but basically, we are interested in raw structure prediction while they are focussing on the dynamics of protein folding. As to which you should run, that is entirely up to you - do both if you wish. Both have admirable, but simply different, goals.

Our purpose however is not to create protein decoys. Our method does indeed have some similarities to the Park/Levitt decoy set (if I remember correctly) but goes beyond that adding secondary structure bias and backbone dependent rotamers among other things. The goal of the initial phase (phase I) of the project is to a) get a working distributed computing client that is bug-free, cross-platform, and user-friendly with all the features that users desire, and b), following up on our January 2002 Proteins paper (see above) we wish to see how far one can get through brute force sampling. To our knowledge, no one has attempted to sample conformational space to the extent that we now have (1 billion samples). In our paper we have provided estimates of the size of conformational space and wish to validate our estimates and see if this scales up to 1 billion structures (which we are in the process of analyzing right now). At the same time we wish to investigate how well the scoring functions we have chosen work at discriminating our own structures from each other, so we can find ways to improve them.

Youe mention 'based on the RMSDs you quote.. your decoys are bigger, not better' but I am not sure to what you are referring here so please clarify.
We were able to obtain a 2.03 A structure for Villin, which we believe is the best ever generated without any a priori knowledge of the structure. We've also obtained a 3.64 A for 1ENH, a 54 residue helical protein, which is very good. However, the critical question is, could we have picked out there structures from the pool of 1 billion if we hadn't known the native structure? Or if not the best one, at least one of the best? Our present goal is to answer that question, most likely in the affirmative. CASP 5 will be a true test of the method for ab initio proteins where no similar structures exist amongst known proteins. This is the most difficult type of protein structure to predict, and most methods fail miserably on such ones (other than David Baker in some cases).

We have other advances which we will be using to enhance the algorithm, hopefully in time for CASP, as well so we are by no means finished yet.

Thanks for your interest and your questions!
Let me know if you have any more comments
04-25-2002, 07:36 PM
Raj

Thanks for the info Howard (Brian?). It does explain a lot and thanks for taking the time to answer my questions.

You mentioned that

"We were able to obtain a 2.03 A structure for Villin, which we believe is the best ever generated without any a priori knowledge of the structure."

but then said that

"However, the critical question is, could we have picked out there structures from the pool of 1 billion if we hadn't known the native structure?"

So, I think from your wording, that you *are* using the native state structure and DF is *not* doing structure prediction (at the moment), but merely sampling (which I called decoy set production), but either way, it's not structure prediction.

I guess my point is that sampling is interesting, in my understanding sampling is not the problem for structure prediction. In CASP, people often throw away better structures than they submit since they can't discriminate between structures (D. Baker, personal comm).

So (not that this should matter to anyone), to me DF is interesting, but it's unclear whether it will have any impact in CASP, since DF is doing the part which is not a problem (sampling) and not addressing the real problem (decoy discrimination).

Anyway, I'm happy to find out that I'm wrong. I guess we'll know once CASP5 comes around! With all this sampling, do you guys expect to do better than Baker? If you took the CASP4 targets, could you predict the structure (w/o any information of the native state, i.e. true structure prediction) better than Baker? If not, you might not expect to do so well in CASP5.

Raj
04-26-2002, 10:54 AM
Brian the Fist

Raj,

(Yes it is Howard, Brian the Fist is just an alias). You are correct that we are indeed dealing with sampling. But we are NOT using the native structures to help us predict in any way.You are wrong in saying that sampling is not a problem however as it is a very real one.

Most people divide the protein structure prediction problem into 2 distinct parts - sampling and scoring. Sampling is basically answering the question of how does one cleverly explore conformational space to ensure something "native-like" appears in a pool of samples. We are right now addressing that question. Namely, how much can we sample and how good a structure can we get by doing so. We are now at a point where we can reliably say, for an unknown structure, based on its length and secondary structure prediction, what the minimum RMSD will be in a pool of 1 billion samples, for example. Or how much do we need to sample it in order to get a 6A structure, etc. This is a very important question and preictive ability on this can be very powerful.

We have separated sampling from scoring with our method however. Scoring is the second part of the problem, and basically answers the question of how do we choose that one 'best' structure out of the 1 billion, once we've sampled. The fact that we've separated the two problems allows us to use any scoring function(s) or combination thereof, that we choose, after the sampling has been done.

You are correct in saying that in our current 'phase' however, we are not truly predicting structures. I have stated this elsewhere in the past however - we are exploring the sampling problem, and the limits of our sampling algorithm to see if it truly scales up to 1 billion samples (and doesnt just keep sampling a smaller set of structures, that is). From our point of view, it has been highly successful so far, as we have exceeded out previous best samples in all 5 different proteins by 1.5-2 A RMSD by doing these larger samples.

As for CASP 4, please look at Phase II of our 'schedule'. Also, if you took the time to check, you would find that we did indeed participate in CASP4, and attended (but didnt participate) in CASP 3. In CASP 4 we submitted 2 homology modelling predictions (which were amongst the best submitted) and 2 novel folds, one of which was the second best prediction submitted (I believe it was roughly 8A for a 110 residue helical protein).
We hope to submit much more than 4 protein predictions this time around and plan to invest more (human) resources into it this time.

Again, please look through our web site and our publications if you are interested in what we are doing and how it works, it is all explained in gory detail.
04-26-2002, 12:09 PM
Raj

Dear Howard. Thanks for taking the time to reply. I personally really appreciate it. Moreover, I think others have similar questions as well.

I think we disagree on the need for sampling in protien structure prediction. Also, from what I can tell most of the community agrees with me that current sampling methods are sufficient. It will be interesting and exciting for you to prove all of us wrong!

Also, I wasn't going to bring this up, but since you mentioned it, IMHO your CASP4 results are nice, but not overwhelming. I guess this gets at my original point: is DF going to be the best protein structure prediction method out there? It certainly is going to be the most computationally demanding. After all this computation, if it's not the best (i.e. #1 for many structures, not #2 for one), then is all this really worth it? Is it worth using 100x the resources of everybody else to be #3?

I'm giving you guys a hard time since I think these are questions worth considering. You guys are clearly working hard and doing a great job at running a DC project. The question you should ask yourself is whether the job you're doing is worth doing? Also, it's important to ask how you'll know whether it's worth doing. If you do well, but not spectacularly well at CASP5, doesn't that mean that perhaps you should be putting efforts into methodological aspects, rather than DC (since everyone else, especially those doing better get by w/o DC?)? I feel strongly that there is a "me-too" aspect to DC and people should not be putting efforts into DC projects that are *less successful* than non-DC projects!!!

I'm not saying I know the answer to the "is this worth it" question, but I think it's worth asking and worth knowing how you'll be able to demonstrate that all this (using thousands of people's PC's) is worth it.

I guess you'll say that we might need to wait until CASP5 to see, and I think I've talked enough and don't want to belabor the point. Perhaps it makes sense to revisit this when the CASP5 results are released at the end of the year. Good luck! I hope you prove me (and the rest of the protein structure community) wrong!

Raj
04-26-2002, 01:09 PM
Scott Jensen

Raj,

I too want to know if my computer is being put to good use here. Saying DF is able to accomplish what they hope to accomplish (in other words, giving them the benefit of the doubt and good luck), what does that really mean? How valuable to the scientific community would DF's contribution be? How does it compare against Folding@Home's value to science?

You seem to be quite knowledgeable in this field and nice enough to answer layman questions. Could you please answer these? Thanks.
04-26-2002, 02:22 PM
Raj

Hi Scott,

Your questions are the right ones to be asking IMHO. Here's my answer. I'm sure the DF guys might have a different perspective and I'm not sure there is any "right" answer. Well, actually, there may be (see below).

(1) "Saying DF is able to accomplish what they hope to accomplish (in other words, giving them the benefit of the doubt and good luck), what does that really mean?"

I was trying to pull some specifics out of them on this one. Do they expect to be #1 overall in CASP? It doesn't take much to be #1 for a particular protein (and definitely not much to be #2 in a single protein -- many groups can claim that). I want DF to come out and say "we're going to be the best at CASP5, as judged by the CASP5 evalaulators, or we'll admit we were wrong and perhaps agree that the rest of the community might be right that decoy discrimination is the real problem.

(2) "How valuable to the scientific community would DF's contribution be?"

If they're #1 at CASP5 across the board with nobody else close, they'll be heros. David Baker's group did that in CASP4 and got such recognition (there's a good article in Wired Magazine about this last year). If they're in with the rest of the pack, then they'll be forgotten.

I haven't seen anything yet that makes me think they'll be able to beat all others in CASP, and note that they're not claiming they will. However, if they're not the best and they're using all this CPU time, it seems like a waste (why use 1000s of computers using DF's algorithms when Baker can do it with much less).

(3) "How does it compare against Folding@Home's value to science?" They're asking different questions, so it's a little apples & oranges. In terms of Folding@Home's comparison to its competition, it has already done what's never been done and is well recognized for that. I think if you asked the big shots in protein folding, they'd agree that Folding@Home has been able to use DC to do what's never been possible before.

Anyway, I agree DF should be given the benefit of the doubt, but for only so long. If they do poorly in CASP5 (or just average/above average), I think the honeymoon is over!

Raj
04-26-2002, 03:05 PM
Brian the Fist

I will just make a few more brief comments in response to the above discussion.

There is more to our method than random sampling and then scoring afterwards. We have ways to reduce the search space which we intend to make use of when they have been properly tested and are ready. Our CASP predictions will not be generated in the same way as the present 'testing' phase is, but will involve more complex techniques. We cannot say how we will perform at CASP though we hope to do well of course. Will the extra computing power we have over, say, David Baker, pay off? All we can say is maybe.

But as DF users, please keep in mind, our sole goal here is not to "win CASP" but to do scientific research.

Raj, I believe you misunderstand what I said previously, we are not trying to convince people that we need to sample more and more space in a random manner until we get a good structure. It is already clear that for larger proteins/domains (over 100AA say) we might in some cases have to sample trillions or more structures to sample something at random which is as good as a homology model might be, for example. Our purpose right now is to establish the limits of the scoring functions we have been looking at, and see how well they compare to each other, and if they can work together to help discriminate near-native structures.

We have already gained much useful insight into our sampling method and the scoring functions we are working with, and will shortly be putting together a paper to report our initial results. We will need large amounts of computing resources to make use of more complex scoring functions which may involve mini-energy minimizations and so on, so even when we reduce our search space and generate large structures, we can assure you your CPUs will be put to good use and not wasted on pointless activity.
04-26-2002, 05:27 PM
Scott Jensen

Raj,

I have honestly read over Howard's last reply a number of times and I cannot make heads or tails of it.

"reduce the search space"

"establish the limits of the scoring functions we have been looking at"

"make use of more complex scoring functions which may involve mini-energy minimizations"

I'm sure if you asked the mathematicians behind the mathematic dc projects, they would say:

"we can assure you your CPUs will be put to good use and not wasted on pointless activity"

But outside of their little world of mathematics, you'd be hard pressed to convince a layperson like me that my computer couldn't be better used to help, say, Folding@Home. So, again, I wonder what value DF represents to the scientific community. Are they blazing new trails or simply sandpapering the underbellies of wooden boats to get an additional .01 knots per mile.

Even your own statement of:

"If they're #1 at CASP5 across the board with nobody else close, they'll be heros."

Why would they be heros? To whom? For what? What value does this research mean to science? How significantly would it improve science? New trails blazed or just another .01 knot per mile?

Thank you for your time and patience in explaining this to a layman like me.
04-26-2002, 06:00 PM
Aegion

Quote:

Originally posted by Scott Jensen
But outside of their little world of mathematics, you'd be hard pressed to convince a layperson like me that my computer couldn't be better used to help, say, Folding@Home. So, again, I wonder what value DF represents to the scientific community. Are they blazing new trails or simply sandpapering the underbellies of wooden boats to get an additional .01 knots per mile.

Even your own statement of:

"If they're #1 at CASP5 across the board with nobody else close, they'll be heros."

Why would they be heros? To whom? For what? What value does this research mean to science? How significantly would it improve science? New trails blazed or just another .01 knot per mile?

Thank you for your time and patience in explaining this to a layman like me.

Ok first of all let me try to give a slight simpler explanation of what the project is working on.

With the completion of the Human Genome Project, we are now capable of predicting what the amino acid sequences of different proteins in the human body will be. However, while we know the order of the amino acids, proteins do not remain in a two dimensional shape in the human body. They "fold" into different three dimensional shapes for various proteins in order to serve their functions inside the human body. It is extremely difficult to predict how these proteins will fold, but it is vital for determining how these proteins function. Devising a way to quickly and accurately predict how a protein will fold could potentially allow researchers to devise treatments for diseases such as Alzheimer's and AIDS.

The Distributed Folding project utilizes a newly developed software algorithm to try to predict the shape that a protein would be mostly likely to fold into. An additional piece of software analyzes the projected results and determines which of the various projected results is most likely to be correct. By utilizing a distributed computing project, it is possible to quickly create a database of billions of possible structures for a protein and obtain an accurate picture of what that folded protein would look like.

What Folding at Home attempts to do is simulate the actual act of folding the protein. While this potentially lets researchers know almost exactly what the folded protein would look like, this is extremely computationally intensive, and heavily limits the number of proteins that can be analysed as well as their size. It does allow researchers to learn more about phenomenons such as protein misfolds, but requires enourmous processing power to actually simulate a protein fold. The Distributed Folding Project uses a variety of algorithems and techniques to predict what the folded protein would actually look like. This potentially allows researchers to determine the approximate shape of a folded protein with far less computational power, allowing the shape of a much greater number of proteins to be discovered. Once the approximate shape of a protein is discovered, this should allow medical researchers to devise treatments for a wide variety of diseases and aliments such as cancer and certain viruses. Instead of having to resort to time consuming study of proteins by electron microscope to determine their shape, Distributed Folding creates the possibility of being able to find out the shape of a very large of proteins could lead to the rapid developements of a variety of medical treatments, potentially increasing the lifespan and quality of living for millions of people. I believe this is an accurate description of both projects, but if anyone more scientificly knowledgable notices any errors, feel free to correct me.
04-26-2002, 07:18 PM
jkeating

"I have not failed. I have found 10,000 ways that do not work" - Thomas Edison

I am not a scientist. But as I understand it, DF is in the proof-of-concept phase. We will not know how well Dr. Hogue and Howard's method works until we spend some time and CPU cycles and explore the method.

Quote:

But outside of their little world of mathematics, you'd be hard pressed to convince a layperson like me that my computer couldn't be better used to help, say, Folding@Home.

The good folks over at amdmb.com have two rules for distributed computing:

Rule #1 - Have Fun
Rule #2 - See Rule #1

If you are not happy here, please, find something that will make you happy.

<puts on flame suit>
I have done F@H. And G@H and F@H2. And to me the DF project is like a breath of fresh air compared to the attitudes coming out of Stanford. As I said above, I don't understand the science involved, but I'm HAPPY to be here. I'm more than willing to give DF my computer time to prove or disprove their theories.

Quote:

I haven't seen anything yet that makes me think they'll be able to beat all others in CASP, and note that they're not claiming they will. However, if they're not the best and they're using all this CPU time, it seems like a waste (why use 1000s of computers using DF's algorithms when Baker can do it with much less).

This is a very small DC project, about 7,000 users. FAH2 claims over 30,000 users, United Devices claims about 1,500,000 users and I believe SETI is also over 1,500,000 users. Why are you so concerned about this being a "waste" for our little project? (no, I'm not associated with the institutions running this project, but I like to think I've made a significant contribution. Not just with my computers, but with others whom I know and have brought them on the project with their computers.)

If you are more comfortable running FAH, by all means do so. But it is considered very bad manners to recruit on one message board for a different project.!!

jkeating
04-26-2002, 07:50 PM
GOLDENBALLSAINTYORK

NI!
RAJ asked some pertinent questions. hmmmm?
Howard answered him most pertninetly(sp.) very well!
Me thinks Raj is a serendipitous individual who should spend mor e time on a manufacturing (assembly) line!
As has been alluded to before...we enjoy (each) whatever project we are working on. ???
At least we get a response from Howard, however trivial the queries on the message board!!!
You don't get that from F@H!
As for SETI!
Wot a load of W&8.k!!!!!
:spank:
ERR ..nuff said...SKINHEAD BIO CRUNCHER U.K

p.s. OI!...YOU GOT A PROBLEM WITH THAT !!

N!
GBSY of KWSN :|party|:
04-26-2002, 08:11 PM
Scott Jensen

jkeating,

My fun is knowing that I'm helping science in a meaningful way. That you like helping a dc project without understanding what it is trying to accomplish and how significant or insignificant that contribution represents is your choice.

As for Folding@Home, I was very deeply involved with them for a year and got financially burnt by their irresponsibility while trying to set up a charity for them. I was just using their project as a comparison.

As for recruiting for F@H, where did you come up with that for either myself or Raj?

Both Raj and I are wanting certain questions answered to see if this is a project we want to help. To tell us that we shouldn't ask questions would be VERY bad form.

This thread has so far been a very nice Q & A. Please don't lower it to a "Love it or leave it!" level.
04-26-2002, 08:17 PM
Aegion

Quote:

Originally posted by Scott Jensen
jkeating,

My fun is knowing that I'm helping science in a meaningful way. That you like helping a dc project without understanding what it is trying to accomplish and how significant or insignificant that contribution represents is your choice.

As for Folding@Home, I was very deeply involved with them for a year and got financially burnt by their irresponsibility while trying to set up a charity for them. I was just using their project as a comparison.

As for recruiting for F@H, where did you come up with that for either myself or Raj?

Both Raj and I are wanting certain questions answered to see if this is a project we want to help. To tell us that we shouldn't ask questions would be VERY bad form.

This thread has so far been a very nice Q & A. Please don't lower it to a "Love it or leave it!" level.

I think the issue they may have had with Raj was that he seemed awfully confident in declaring the project a waste of time even before he had extensively studied the Science. (Alot of the stuff that Howard explained is available on the two science pages about the project set up by Howard.) I hope my explanation was helpful, please let me know if there are specific issues still unclear for you.
04-26-2002, 09:00 PM
Scott Jensen

Aegion,

In reading Raj's posts over again, I see his statements as being rather nice challenges. Challenging assumptions and reasons. Very politely worded and knowledgeable challenges at that. To say nothing about being nice enough to answer my questions and helping me understand what's Howard and him were talking about.

As for the comparison between DF and Folding@Home, I'm not quite getting you. Am I correct that what you're saying is both are going to help science to the same end but "simply" DF will be doing it in a quicker (less computationally intense) way than Folding@Home?

Also, if I'm getting you right, DF is guessing at protein structure whereas Folding@Home is not. How do you know if that guess is right? Isn't accepting anything but the exact protein a bit wrong?

For all Folding@Home's faults (and I could list plenty), one of the main reasons that attracted me to them was that they were trying to beat IBM's Blue Gene at it's own game before it even got into the ring and the results would be made free to scientists as opposed to owned by corporations. Is DF also taking on Blue Gene? Personally, I like helping the Davids of the world take on the Goliaths.
04-26-2002, 09:14 PM
bwkaz

I know this wasn't asked to me, but this is the way I understand what he said:

Quote:

Originally posted by Scott Jensen
"reduce the search space"

Cut down on the number of possibilities that we'll have to look at. If they can cut out a large number of possible structures out right from the beginning, or a short time into the search (using data that they've collected so far), then that reduces the number of possible fold patterns they need to look at. They've just reduced their search space.

Quote:

"establish the limits of the scoring functions we have been looking at"

How good are their scoring functions? IIRC, they are attempting to correlate the total energy of the resulting fold patterns with "correctness", as measured by root-mean-square deviation from the known (RMSD). If the energy that they are computing is lowest for the lowest RMSD structures, and the energies correlate well otherwise, then they can use the energy (as computed by their scoring functions) as a predictor of how close to the actual structure a given random candidate is. So they'll submit a few of the lowest-energy structures in something like CASP, and say that because their energy is low, they should be really close to the actual.

Quote:

"make use of more complex scoring functions which may involve mini-energy minimizations"

Same thing, sort of. I guess they're looking at alternative ways of scoring proteins or something? That's what it sounds like to me anyway. Some of these alternates use the protein energy (not sure about the "mini-" part) and say that the close-to-actual ones are the ones with the minimal (mini-)energies.

Hope this makes some sense. Howard, is this actually even close? I think it is, but that doesn't actually mean anything. :crazy: ;)
04-26-2002, 09:26 PM
Brian the Fist

Raj brings up some valid points which we should certainly address. We are doing "real" science here. As Aegion says however, it is clear (at least to me) that Raj has not had a good look at our TRADES web site or our publications which answer at least some of the questions he has been asking, and I would ask that he do so if he truly wants to discuss the plusses and minusses of our approach (however the newest parts of our approach are not yet published..) I also have another paper in preparation comparing some of the energy scoring functions we are using, which I would happy to forward to anyone interested as soon as it is ready.

We encourage any constructive criticisms of our algorithms however, and if anyone has any suggestions on how they may be improved (I've already gotten the odd one, but unfortunately nothing we hadn't already thought of :p ).

To elaborate a bit on my admittedly somewhat vague previous description, our future plans in layman's terms borrow a bit from what we learned at previous CASPs from the more successful methods (including David Baker's).

Instead of building up the structures one residue at a time as they are now, there will be a certain probability of building a whole stretch, of say 5-10 residues, at once. This stretch would be in a shape that is commonly found in known proteins. Thus it is like using bigger lego pieces to build our proteins. Bigger pieces means less of them and thus less possible structures to search through. Choosing these building blocks is extremely important and non-trivial however. We intend to combine this with something similar to a 'genetic algorithm' (search for this on Google and you'll find tons of better descriptions than I can provide here) which basically boils down to iteration. We build a set of structures, pick out good features according to our energy scores, and then use this information to help us build a new set of structures which will be more energetically favourable. This repeats until you get a set of structures which are "optimized" for the particular scoring function(s) we choose.
We are also doing work in our lab on species specific scoring functions based on statistical contact potentials, and looking at other ways to approach the scoring problem (ie. how do we find that 'best' structure from the set that we generate).
04-26-2002, 09:30 PM
Aegion

Quote:

Originally posted by Scott Jensen
Aegion,

In reading Raj's posts over again, I see his statements as being rather nice challenges. Challenging assumptions and reasons. Very politely worded and knowledgeable challenges at that. To say nothing about being nice enough to answer my questions and helping me understand what's Howard and him were talking about.

As for the comparison between DF and Folding@Home, I'm not quite getting you. Am I correct that what you're saying is both are going to help science to the same end but "simply" DF will be doing it in a quicker (less computationally intense) way than Folding@Home?

Also, if I'm getting you right, DF is guessing at protein structure whereas Folding@Home is not. How do you know if that guess is right? Isn't accepting anything but the exact protein a bit wrong?

For all Folding@Home's faults (and I could list plenty), one of the main reasons that attracted me to them was that they were trying to beat IBM's Blue Gene at it's own game before it even got into the ring and the results would be made free to scientists as opposed to owned by corporations. Is DF also taking on Blue Gene? Personally, I like helping the Davids of the world take on the Goliaths.

Folding@Home can also be applied in a couple other ways other than simply using the outcome of the protein fold as the information. Basicly you don't need to determine the exact shape of a folded protein absolutely perfectly for pharmaceutical companies to sucessfully use the information to devise new medicines, and this is why the Distributed Folding Project is using its current approach. A major purpose of the current phase of the project is to demonstrate that the software the project is currently using can accurately predict how far off from the true protein structure a protein actually is. If we can determine that the software is capable of predicting this accurately, we can be confident of our results when we enter the next phase of the project in which we deal with predicting the structures of proteins which we do not have complete knowledge of its shape. IBM's Blue Gene is using a less innovative method of determining the shape of a folded protein, but The Distributed Folding Project offers the possibility for people with access to smaller amounts of computing resources to also be able to determine the shape of a substancial numbers of folded proteins.
04-27-2002, 03:29 AM
Scott Jensen

Aegion,

I was under the impression that IBM's Blue Gene will be doing computational work like what Folding@Home is currently doing. Actually folding the proteins and not "just" taking guesses at their possible structure. Has the Blue Gene project changed recently?
04-27-2002, 04:53 AM
plaidfishes

I am not a biologist

But I am a fair computer geek. From having read the papers here and at F@H, I think F@H and DF are answering fundamentally different questions. It is the equivelent of asking "How do I get to Seattle?" against "Show me a map of Washington". Both valid but imply differnet purposes.

F@H is asking the question, "How exactly do protiens fold, whats the process, what comes first?" DF is asking, "How can we rapidly determine the final shape so the drug guys can get to work?"

For the more techno-geeks among us, F@H is the equivalent of research into compiler optimization. Very useful for a small group of researchers and has a lot of deep follow-on effects. DF is working on the equivelent of a database engine. This is what a much larger number of people need so they can get on with their work. Calling both of them software doesn't make them competitors.

So they really are asking different questions but in at least one case, they focused on exactly the same protien: Villin. Howard may be too modest to say this so I will take the opportunity-

F@H best RMSA = 3.3 with 30,000 users.
DF = 2.03 with 7000 users.

Which one do you think is wasting cycles?.:haddock:

However, I do think the two projects are more likely to feed off each other than to compete. And if your crunching motivation is more personal because of someone with a disease, you will have to do your homework. Alzhiemers, vCJ seem to be folding problems which is the F@H domain. DF is a good choice for those of us working because of an orphan disease or where they don't know anything about the protiens or interactions since it will make good info available sooner
04-27-2002, 07:15 AM
plaidfishes

Just a follow up

The F@H guys have a preliminary paper posted at their site which had the 3.3 figure from above. I did not intend quite as much of a put down as that looks like on rereading. I have read a lot of papers and that one has a feel like a major breakthough paper in its field. It has a what I assume to be the first description of just how the process of folding works and the sequence of events.

For the DF project, one finding in particular could be useful.

"Full formation of the core typically appears simultaneously with the establishment of final hydrogen bonds. "

As I understand your algorythm, the first part of the problem is there is way more folds than any amount of computer power could ever hope to process. So you take a sample of the folds. This part has been done before which is what I think was meant by the "Large Decoy Set". You have added to the sampling a group of (rules???) in FoldTraj which eliminate obvious dead ends from getting sampled. This results (hopefully) in a much higher likely hood of getting a better result faster.

So far, we have seen a change on the current protien where it looks like you adopted some rule that make the samples choosen a lot more twisty. This looks like it might be a winner of an idea since we are getting fairly good results rather quickly.

Is there a chance you will be adding a hydrogen bond selector rule as a followup to F@H findings? Or are their results still too preliminary? I did notice that they had a lot of "sidetracked" sequences happening.
04-27-2002, 11:39 AM
Brian the Fist

You've got most of it right (more or less) :p .
The major bias we use right now is something very old and fairly simple, called "secondary structure prediction". If you look at a bunch of protein structures in the PDB (using the software Cn3D or Rasmol - I can post links if you don't already know where to get these) you will see they are generally made up of 2 larger "motifs" or "secondary structures" - helices (those twisty shapes that look like a spiral staircase or a fun waterslide) and sheets (usually drawn by the software as parallel or anti-parallel arrows).
From teh sequence of the protein, there are AI methods (using neural nets for example) to predict where these 2 types of structures will occur in the protein, with about 75% accuracy (on a good day). The catch is, because it is not a perfect prediction, we cannot use it.. at least not use it as the gospel truth. So using some fancy probabilistic techniques, we are able to bias the structure building (give it a nudge in the right direction) but not force it. Thus only a certain population of the sampled structures will match the original prediction. This allows us to eliminate sampling many improbable folds. Using fragments that I mentioned earlier will further eliminate some improbable structures.

I am very pleased that some of the non-hard-core-biologists (I think) are understanding the fundamentals of what are project is doing, and the differences between ours and F@H - that example of Seattle and Washington was amazingly accurate :jester:

In our 2002 Proteins paper (in Table IV), we actually were able to estimate the exact number of samples required for each of 17 different proteins, in order to get 1 structures within 6A RMSD (or any other arbitrary cutoff for that matter). That was part of the basis from starting this project off with 1 billion samples. for example, we estimated we'd need 10 billion samples of the current protein to get a 6A structures (which we've already exceeded! Hooray! - these are very approximate order of magnitude estimates..) Similarly for 5PTI we predicted 10 billion samples needed to get a 6A structures (we got a 5.2A with 1 billion). Also, these larger samples allow us to make more accurate estimates. We are very excited about the results so far, and hope it will only get better as we continue to make the algorithm "smarter" :cheers:

Oh, and as for the hydrogen bond question, the client does actually look for H-bonds in the structures (you may have seen an error in the error.log about unsatisfied H-bond donors, at some point). Because of the way in which our structures are formed though, H-bonds are relatively uncommon. This is because our structures are "unrefined". Think of our structures as the result of taking a tree and bending a few branches here and there. Because it is only slightly distorted, it is relatively easy to get it back to its normal shape. Refining the structure uses energy calculations to "relax" the structure to a more natural shape. Usually visiually the change is hardly noticeable. It is during this refining that most of the relevant H-bonds would actually form, so they are unfortunately hard to detect in our raw structures. We must rely on other signs of structural quality.
04-28-2002, 05:47 PM
Raj

Lots of posts and interesting things to respond to. I'll try to be brief on each.

plaidfishes: I think you might be missing the point in your FAH vs DF comparison. FAH is not doing structure prediction, but predicting *how* proteins fold, they've done a lot more work and don't use all their users for a single project, and have succeeded at the problem they're addressing. Indeed, my wife saw Alan Fersht (super big shot in protein folding) give a talk and he said wonderful things about FAH. I think you have to remember that structure prediction (which DF is working on) is a different problem. FAH could be used for structure prediction, but that's not the idea -- it's a bonus on top of what they are trying to address.

I'm also not sure I agree with you in terms of DF and disease. It seems that DF would not be doing anything related to any disease any time soon -- nobody is really actively pushing structure prediction to address diseases (yet -- but people would love to if the accuracy got better, say comparable to crystal structures).

Aegion: Blue Gene and FAH are tackling the same problem. it's interesting (at least to me) that FAH has beat IBM!

Scott: keep up the good work. :)

Howard: I didn't mean to imply that you're not doing "real science" -- I don't think anybody doubts that. What I was asking was are you doing THE BEST science -- cutting edge, head and shoulders above your peers. You're definitely using more computational resouces. Are you getting something more for it? I'm happy with the answer "let's wait and see", but I was trying to find out when we'd know. To me, CASP5 would pretty much tell us.

I didn't mean to imply that DF is a waste of any resources, but rather is it the best use of our resources. I know there is no right answer to this, but I wanted to get some convincing feedback.

For FAH, the answer to the analogous question is that

(1) FAH is the only project to ever fold proteins starting from sequence using MD -- a goal which is often associated with Nobel prizes and the like

(2) FAH has beaten IBM Blue Gene at this goal

(3) FAH is hailed in the academic community as a triumph

To me, these are the sort of answers I was looking for in terms of "is DF doing the best science". However, it's early for DF. Here's to a kick ass showing in CASP5!

Raj
04-28-2002, 06:22 PM
Aegion

Quote:

Originally posted by Raj

I'm also not sure I agree with you in terms of DF and disease. It seems that DF would not be doing anything related to any disease any time soon -- nobody is really actively pushing structure prediction to address diseases (yet -- but people would love to if the accuracy got better, say comparable to crystal structures).

Raj

What you are completely missing is what the long term goal of the project is. Once a certain degree of accuracy regarding predicting protein structures is acheived while dealing with structures we don't have prior knowledge of, we can start predicting the structures of different disease related proteins from which different medicines can be derived by pharmaceudical companies. The software can also be used by different pharmaceudical companies to research medicines for specific diseases once the concept is proven. Admitedly the project is not going to lead to new medicines being created in the next few months, but long term it has revolutionary implications for the treatment of different disease if the project is a success.
04-28-2002, 07:01 PM
Raj

Aegion: while I agree that drug design, etc may be a "long term goal" of the project and agree with you that this won't be happening in a few months, the real question is when will it be happening?

This gets back to my original point: will DF be doing cutting edge work such that this is possible. For structure prediction to be useful for tackling disease, atomistic resolution will likely be needed (1-2A). NOBODY can do this right now. If DF can, that will be great, but it will require heroic efforts to pull this off.

Also, I want to stress is that CASP was created because many people in protein structure prediction had claims that they could predict structures (like DF's claims/hopes), but in the end, these were only claims. When it came time to test the method, it didn't work so well.

If DF is going to have any relevance for disease, it will *have* to do revolutionarily well at CASP5. If it does that well (#1 out of all teams), then DF is off and running (and would be doing amazing work). If it's just another team, then I have little hope for its success in terms of making an impact related to disease.

Finally, to say, "that's a goal for the future" is a little like saying "the check is in the mail." If DF kicks ass in CASP5, then we'll know the check cleared! (if not, then, well, you can guess ...)

Raj
04-29-2002, 01:11 AM
Scott Jensen

Raj,

Glad to see you back here stating your case. :)

Among the many things I'd like to know is if DF (or Folding@Home for that matter) would significantly help find a cure for Multiple Sclerosis. There's another distributed computing project that is currently getting off the ground and asking for my help (I'm a marketing consultant). Their project is aimed at finding a cure for MS. Here's the project's URL:

http://curems.no-ip.org

My contribution would be to get them as many volunteers, corporate sponsors, and individual donors as possible. Combination of advertising, in-person pitches, and public relations.

However, if DF and/or Folding@Home would be doing more for MS than this project, I'd like to know so I can make an informed decision on whether or not to help this new dc project. They're right now working on the programming and seem likely to get funding from various MS societies.

In case you're wondering, I've offer my help to DF but that has fallen on deaf ears.
04-29-2002, 01:22 AM
plaidfishes

This isn't a run for the Nobel Prize

Howard is a grad student in BioInformatics not just biology. As a bona-fide computer scientist, he works on computer applications for Biology.

Fact:
Computer Scientists don't get Nobel Prizes.

Fact:
Computer Scientists get recognition of their work from other computer scientists and professionals for astonishingly clever ways to solve previously impossible problems.

From the Wikipedia: "In complexity theory, the complexity class NP-complete is the set of problems that are the hardest problems."

The statistical prediction of protein folds is a problem that many suspect to be NP-complete. In plain english, "The problem is impossible to solve". Howard and Dr. Hogue believe they have proven that it is in fact solvable. They show in thier papers that the problem is logarithmic rather than exponential. Their reasoning looks pretty solid to me. Early indications are that they are doing better than they thought they would. Of course, in the world of Computer Science, it aint nothing unless you can actually do it. From my point of view, its real science...cutting edge, head and shoulders above your peers type science. But as is always the computer scientists fate, the only thanks they will get is from the users. We crunchers are not Howards users. His users are all of those other scientists waiting for the shape of all 30,000 proteins so they can get to work on new cures. They don't care how the protein got that way, they just want to know what it is. If Howard is in fact right, they will have it as quickly as they can get people to help crunch.

As a distributed computing effort, many things about DF are appealing to the people who do the crunching. It has a good solid benifit to humanity if it succeeds. No garrantees that it will succeed, but look at SETI, they haven't found any aliens and might never. It runs nicely on a lot of different machines so there aren't too many bugs. The researcher even spends his Sundays :notworthy answering questions. I doubt many of Howards crunchers understand more than the first sentance of his papers. But they know when the people associated with an effort are smart, friendly people, when the software works, when the guy in charge is actively engaged there is a good chance things are going to be fun on the project. That Howard uses mind boggling huge amounts of computer power simply doesn't bother the farmers since they think everyone should have lots of computer power.:thumbs: For casual crunchers, it is a cool screen saver that a lot of people like watching. All of these reasons have to do with the fact that it is not the computers that get this job done, it is the people running the computers. None of whom are going to be eligible for a Nobel Prize but do it anyway.

Sometimes, I think a Nobel Prize for CS would be a good thing. But then I realize that the scientists in fields with big prizes sometimes get into Nobel Fever and forget that it isn't about a shiny medallian, it's about the advancement of humanity.
04-29-2002, 01:46 AM
plaidfishes

Sorry minor change

OOPs, slight mistake: Biochemistry/Biomolecular Structure is Officially what he is going for a Ph.D in but his chops are as an EE. All of his papers are computer related. His professor is Bioinformatics and he seems to have a beowolf cluster hanging around as a toy. And he wrote a computer game. I think this is conclusive evidence that he is really a computer geek who works on biology problems. (I hope we can call him Doctor real soon.)
04-29-2002, 01:59 AM
Scott Jensen

Re: This isn't a run for the Nobel Prize

plaidfishes wrote:
"I doubt many of Howards crunchers understand more than the first sentance of his papers."

Which is what is great about this thread! I've learned far more from this thread than anything so far presented on DF's website's FAQs or elsewhere. Some of that because the elsewhere overflows with techno-babble (the technical papers) and some of that because it is simply not addressed.

Now along comes Raj who seems quite knowledgeable about the technical and scientific aspects of this project and raises questions about its value. He's even been nice enough to answer questions from laymen like me. That Howard, Aegion, and other knowledgeable people have calmly and respectfully responded to Raj's questions is fantastic! Again, I'm learning just what on Earth is being done by this project by way of this thread. A lot of this thread should be put in the FAQs ... especially the DF vs. Folding@Home stuff.

To say the least, I look forward to each email notification that a new reply has been added to this thread and eagerly click on the link to read them. To all that are trying to meaningfully contribute to this discussion, thanks!
04-29-2002, 04:53 AM
Shaktai

I just want to say that I have been following this thread and find it very interesting and helpful. Just a couple of comments.

Scott, it appears that you are looking for a definitive answer right now, as to whether or not it is worth pursuing this project or to go the MS project. I followed that link, and checked it out, and to be perferctly honest, I think you are getting the cart before the horse. You have asked some very good and valid questions, and because of that have elicited responses that are helpful to us all. However, DF is still in its development stages. It is basic science that is examining one or more possible solution(s) to a complex problem. Until that process of examination is completed, we will not know for certain if it has validity. No doubt Casp 5 and DF's success or lack of will have a lot to say about if they are on the right track. But for now, the answer to your question does not exist.

I think jkeating said it well, when he posted the famous Thomas Edison quote a few posts back. The DF project is valid if it succeeds and valid if it doesn't succeed, because even a lack of success teaches us one more way that doesn't work. Personally, I am willing to place my bet that it will succeed, but I am a layman, not a scientist, though I do have a basic understanding of scientific method. With any of the current medical research and scientific projects, there is a chance that they may not succeed.

I would suggest that you check out both projects and then follow your heart. It looks to me that the MS project has the greater short term potential for one specific disease problem, but there are no guarantees. Reviewing their website, it appeared that they projects is in its infancy, and not yet very well organized. However on the long term, if successful, DF has the potential to provide the basic scientific information that will expidite research and the development of cures for multiple diseases and conditions.

This longer term potential is of greater interest to me. As to whether one participates in this project or that one, it comes down to what will bring you the greatest personal satisfaction. In the end, we will benefit from F@H, and Genome and DF and the many others that exist or will come along. We will benefit either by their successes or by their failures (learning one more way to not do it.). But we will benefit. No well meaning effort is every wasted (are you listening Raj?)

If you want guarantees, then I think you need to look elsewhere. The only certainty, the only real guarantee in life is change. You roll the dice, take your chances, and learn from the experience. That is life. I figure that DF is as good a bet as any, and I enjoy the project and the tremendous support and positive feedback from everyone involved.

I look forward to CASP 5 and the answers it will bring. In the meantime I will crunch for DF or whatever project gives me the greatest personal satisfaction.

(stepping down off my soap box)
04-29-2002, 07:58 AM
bwkaz

Re: This isn't a run for the Nobel Prize

Quote:

Originally posted by plaidfishes
The statistical prediction of protein folds is a problem that many suspect to be NP-complete. In plain english, "The problem is impossible to solve".

Well, not impossible, just REALLY REALLY HARD. If statistical prediction is indeed NP-complete, then it can't be done in better than exponential time, by any algorithm known. This of course doesn't mean it can't be done, just that no one knows how. The interesting thing about NP-completeness is that if you find a way to do any (truly) NP-complete problem in polynomial time, then because of some properties of NP-completeness (that I don't know, I just know about), you will be able to solve every NP-complete problem in polynomial time.

Which means, if statistical prediction is in fact NP-complete, and Howard found a way to do it in logarithmic time (which is actually better than polynomial)...

:notworthy :notworthy :cheers:
04-29-2002, 11:07 AM
Brian the Fist

Structure prediction has been shown to be NP-complete 10 years ago:

Ngo, JT and Marks, J. Protein Eng. 1992; 5:313-321.

Incidentally, the concept of NP-complete-ness was developed right here at U of T :notworthy in our Comp. Sci. Dept.

We have not solved the problem logarithmicallym what we have done is reduced the 'N-body' problem from N squared to NlogN. The N-body problem, commonly encountered in astrophysics, for example, is as follows: Given a set of N objects (stars, planets, atoms, whatever), compute all the pairwise interactions between them (for planets this might be gravity, while for atoms this may be electrical charge interactions). Normally there are N(N-1)/2 pairwaise interactions to compute. But by taking advantage of the fact that objects far away in space have negligible interaction energy, and by using a tree-like data structure (described in detail in our 2000 publication) we are able to reduce this to O(NlogN).

The impact of this is that when we double the protein length (e.g. going from 36-residue villin to the 76-mer that we are doing now), it only takes double the computation time to build one structure. More 'simplistic' methods which were O(N squared) would take 4 times as long. As the proteins got bigger and bigger, generating confomers would become prohibitively slow with an N2 method.

The size of conformational space does still grow exponentially with protein size though, as does the amount of sampling we need to do to get a good structure, so the problem is still exponential presently. In fact, a very crude estimate we have obtained from our data is 2^N, where N is the length of the protein. This is roughly the number of samples required to get a 0.5A structure (essentially perfect). Our current research goals are to focus our sampling on 'likely' good areas of conformational space.

For those who have been reading my CV, you are pretty well correct in saying Im a computer geek (and also a math geek) who does biology; I wanted to apply my skills to a problem that was meaningful to me. Thanks for the inadvertant plug for my game too! :haddock: (http://bioinfo.mshri.on.ca/people/feldman/)

No, we do not expect to get the nobel prize for DF (That will come later :p ) but we do hope the DF project will, in the end, make a difference, and prove the power of distributed computing for sampling and scoring protein structures.

Lastly, please keep in mind F@H has been running for well over a year now (correct me if Im wrong here) while we have just begun, so I think it is a little premature to start comparing the results of the two projects (how long did it take for F@H to get their first results - I remember visiting their site many times, clicking on the Publications link, and getting a 'coming soon' message). Now that we have stable software and happy users we can begin to focus on developing the algortihm further, in time for CASP prediction hopefully.
04-29-2002, 11:28 AM
Raj

(1) plaidfishes: Interesting post. It's clear that you've though about this, but I guess I disagree with you in several areas, but perhaps it's just me.

First, protein structure prediction is not really considered to be a computer science problem and neither is distributed computing. Protein structure prediction is a problem is molecular biology and distributed computing involves no new CS. The interesting aspects have to come from their relevance to biology. Oh, and there are nobel prizes awarded to biologists (medicine and physiology), but not that that matters. I brought that up to stress the importance of the problem.

Second, are you familiar with DF's peers? To claim they are head and shoulders above them, you need to be able to make a true comparison. Are you familiar with how DF did in CASP relative to their peers? If you did, I think you might want to retract your statement. Perhaps distirbuted computing will make the difference. Let's see how CASP5 turns out. From my posts, it's clear I'm willing to be that CASP5 might not be the DF-fest that one would like.

So, in summary, I'm not sure that the average DC user is capable to judge the science going on here. One needs to look to how the scientific community views it. That's again why I mention CASP. If DF can't get the community excited about their work, it's probably not that great. It's too early to tell right now. We'll know by the end of the year when the CASP5 results are announced.

(2) "protein folding is NP complete" This is an interesting question. Certainly, blindly testing all the possibilities is NP complete. However, somehow Nature seems to solve the problem easily. From experiments, it's clear that Nature *does not* look through all of the possibilities and thus Nature's algorithm appears to not be NP complete. Indeed, the folding rate of proteins largely appears to be independent of length (see a recent JMB by Plaxco, Simmons, and Baker).

Thus, Nature's dynamics of folding evades this issue. This is another reason why Nature's method is interesting (to me, and many in the field) and on reason why Blue Gene and FAH are looking to understand it. (again, this is not to say that BG and FAH are trying to do structure prediction!!! but rather to understand how Nature does it)

(3) Shaktai. Good point that "no well meaning effort is wasted" BUT, we do have a limited amount of resources in the world, and one should be judicious with its allocation. Again, I should stress that DF deserves it's chance (even if there are naysayers that say it won't work). Just if it's shown not to work (or at least not to be as good as other methods, eg Baker's), it's time to move on and give someone else a chance (and allowing their well meaning effort not to be wasted).

(4) I look back at my own posts and many aspects seem to have a negative tone. Sorry about that, it's not really fair or appropriate. I'm trying to just bring up the important isues, which I feel have been skirted. As I've said before, let's hope that DF kicks ass in CASP and then we'll have the validation necessary!

Raj
04-29-2002, 12:04 PM
Scott Jensen

Shaktai,

What interests me is advancing science in a meaningful way.

When I first became aware of distributed computing, I didn't do any. The reason was because at that time there seemed only to be SETI@Home and deciphering projects. Neither I consider to be of any scientific merit. Then along came Folding@Home and that seems to be aimed for great things. And it has done great things. I became heavily involved in that project and even worked for a year to set up a charity for it. Unfortunately at the eleventh hour before officially launching the charity, they dropped their connection with me and left me holding a rather large bill from a law firm that was getting the charity legally set up. However, even though mistreated and burnt by Folding@Home, I still find distributed computing a great idea and way to help advance science thus have looked for other dc research projects I can help. DF being one of them. DF is one that my computer now crunches for and I've offered to help them with their marketing. That the offer fell on deaf ears is too bad, but that's OK. There are other dc projects.

The MS dc project is only one of a few that I'm right now seriously considering. I was referred to it by someone that is on top of the dc field and the people behind it are very interested in bringing me onboard. However, I am not a scientist and wonder about how scientifically meaningful their MS dc project would be. For me to get all fired up to help out such a project I need to know it will meaningfully contribute to science. If DF and/or Folding@Home will do more to help MS then I'd like to know that so I can weigh that dc project accordingly. Again, the MS dc project is only one of the ones I'm considering right now. I'm this way because a couple years ago at the age of 36, I became very ill for a half a year and during that time I decided to stop chasing the buck and start looking for ways to make this world a better place. The charity for Folding@Home was my first attempt in that direction. I feel that dc projects BADLY need a good marketer since how productive they are is directly related to how many volunteers they get. Getting them corporate sponsors and individual donors is simply icing on that cake. That no current dc project has a full-time marketer is what amazes me beyond belief due to the need for more volunteers alone.

Now that you and others know where I'm coming from, if some nice knowledgeable scientific person could look over that MS dc project website (http://curems.no-ip.org) and tell me its scientific merit and how it compares against DF and/or Folding@Home, I'd GREATLY appreciate it. Thanks.
04-29-2002, 01:52 PM
FEEDB0B0

On inefficient code, NP Complete and the F@H vs DF issue

Hi folks,

Ah FUD (Fear, Uncertainty, Doubt).

Let me (oh yea, that Dr. Hogue guy) weigh in.

Scientific research is, by definition filled with uncertainty. We don't mind FUD here. Happy to sit through it, be patient, and have our project prove our point. Mind you we haven't yet published our results from 5 billion sampled structures yet, so most scientists in the folding community have no idea what we are up to.

DF is only one of my major projects. That's why you don't hear very much from me - I have a very active group. We have had 2 Science and 1 Nature paper come out in Dec and Jan on the topic of molecular assembly information - the Biomolecular Interaction Network Datbase (www.binddb.org). I have published recently in Cancer Research on bioinformatics discoveries relevant to mechanisms of DNA damange in colorectal cancer. So you folks should know you aren't dealing with any slouches or also-rans here. We are doing biomedically relevant research. By supporting DF and my group, you support everything we are trying to do, not just sampling proteins. We may one day turn around the DF platform and use it to simulate molecular assembly. So watch this space.

Are we doing cutting edge software or just me-too distributed computing? Oh please. Check our paper on the MoBiDiCK infrastructure (oops, we published in the CS literature, I guess it doesn't exist...). You will kindly note that it indicates our intentions and work towards distributed computing applied to protein folding and it clearly pre-dates the entire F@H project. We weren't first, but we aren't me-too, we've been deliberately staging this for some time now. I'm confident by our user accolades that we've done the right thing in waiting till it was ready before releasing our software.

But, maybe this was the wrong strategy. Apparenty I need to do science with press releases, not in scientific publications, cause no-one reads them.

So two clarifications

1) No - we haven't solved the NP-complete problem yikes! - We just remembered that there are some good O(NlogN) solutions to O(N*N) problems, and we came up with a novel twist on one of these. This is what allows our code to scale to the sizes of proteins we are currently doing and beyond.

A "good" O(NlogN) solution is rather like using a phone book (a sorted ordered list) to look up a phone number in a big city (actually O(logN)). Imagine how long it would take if the book wasn't sorted in alphabetical order and printed on a big roll of paper instead of pages. Try to find a number then... Our method is a "treecode" algorithm - a bit different from the Barnes-Hut algorithm but in the same category.

A lot of existing protein folding code has little bits of O(N*N) coding gotchas. You can tell by either reading the source code, or by the way it doesn't do large proteins. F@H code cannot do large proteins, so I assume there are fundamental O(N*N) gotchas in the underlying code. Nothing that a good rewrite wouldn't fix though. Trouble is Pande's group didn't write it - someone else did - and as a result they may have a tough time troubleshooting it.

Anyhow "treecode" is why we can hit a better villin with fewer volunteers than F@H - because we can crank out proteins faster on fewer CPUS. We wrote it from the ground up, Howard and myself, to stamp out the O(N*N) dependencies - we didn't solve an NP-complete problem.

2) As to the acceptance of the F@H method being a true protein folding computation. This means that the computation is actually nudging the protein all the way with fine adjustments from an unfolded squiggle to a perfectly folded protein, as judged by some "energy" computation.

One can question the "truth" of a F@H simulation taking into account the paper that Pande published in the Journal of Molecular Biology (vol 313 151-169). On page 163 there is a little disclaimer that states: "...there were also other structures of the same minimum energy...In other words we could not use the total potential energy as the sole indicator of folding."

SO can they really fold anything when their potential energy score doesn't work? Can you trust a "folding" simulation to say how a protein is moving when it isn't truly predictive? Picture a weather simulation that suggests that the US tornado alley region (Texas, Kansas, Dorothy, Oz) is really computed to suggest it lies along the east coast (Maryland, NJ, NY). Would one go write a long paper about the apparent destruction "pathway" of the "hypothetical coastal tornadoes"? Apparently that's OK to do in protein folding. Problem is you can see the true path of a tornado, but not the true folding path of a protein. We have to live with unvalidated simulations in this field. But one should not overspeak of the "truth" of these simulations until their predictability is shown with some certainty.

Finally, if one were using Raj's arguments, one might say that F@H would be wasting resources on performing inaccurate, unvalidated folding simulations. This is, of course, silly because once they beat out the O(N*N) problems out of their code, and once we have determined a proper scoring function methodology, then F@H may be able to produce give scalable, validated protien folding simulations. That's exaclty why I am very hopeful for their success.

Science happens in small incremental improvements, and you all happen to be helping push a few more of those onto the stack by helping DF or F@H.

Cheers and many thanks to all of you!
Christopher Hogue

Show 50 post(s) from this thread on one page

All times are GMT -4. The time now is 07:38 PM.