The Science of Distirbuted Folding?

**FEEDB0B0** · 04-29-2002, 01:52 PM

Hi folks,

Ah FUD (Fear, Uncertainty, Doubt).

Let me (oh yea, that Dr. Hogue guy) weigh in.

Scientific research is, by definition filled with uncertainty. We don't mind FUD here. Happy to sit through it, be patient, and have our project prove our point. Mind you we haven't yet published our results from 5 billion sampled structures yet, so most scientists in the folding community have no idea what we are up to.

DF is only one of my major projects. That's why you don't hear very much from me - I have a very active group. We have had 2 Science and 1 Nature paper come out in Dec and Jan on the topic of molecular assembly information - the Biomolecular Interaction Network Datbase (www.binddb.org). I have published recently in Cancer Research on bioinformatics discoveries relevant to mechanisms of DNA damange in colorectal cancer. So you folks should know you aren't dealing with any slouches or also-rans here. We are doing biomedically relevant research. By supporting DF and my group, you support everything we are trying to do, not just sampling proteins. We may one day turn around the DF platform and use it to simulate molecular assembly. So watch this space.

Are we doing cutting edge software or just me-too distributed computing? Oh please. Check our paper on the MoBiDiCK infrastructure (oops, we published in the CS literature, I guess it doesn't exist...). You will kindly note that it indicates our intentions and work towards distributed computing applied to protein folding and it clearly pre-dates the entire F@H project. We weren't first, but we aren't me-too, we've been deliberately staging this for some time now. I'm confident by our user accolades that we've done the right thing in waiting till it was ready before releasing our software.

But, maybe this was the wrong strategy. Apparenty I need to do science with press releases, not in scientific publications, cause no-one reads them.

So two clarifications

1) No - we haven't solved the NP-complete problem yikes! - We just remembered that there are some good O(NlogN) solutions to O(N*N) problems, and we came up with a novel twist on one of these. This is what allows our code to scale to the sizes of proteins we are currently doing and beyond.

A "good" O(NlogN) solution is rather like using a phone book (a sorted ordered list) to look up a phone number in a big city (actually O(logN)). Imagine how long it would take if the book wasn't sorted in alphabetical order and printed on a big roll of paper instead of pages. Try to find a number then... Our method is a "treecode" algorithm - a bit different from the Barnes-Hut algorithm but in the same category.

A lot of existing protein folding code has little bits of O(N*N) coding gotchas. You can tell by either reading the source code, or by the way it doesn't do large proteins. F@H code cannot do large proteins, so I assume there are fundamental O(N*N) gotchas in the underlying code. Nothing that a good rewrite wouldn't fix though. Trouble is Pande's group didn't write it - someone else did - and as a result they may have a tough time troubleshooting it.

Anyhow "treecode" is why we can hit a better villin with fewer volunteers than F@H - because we can crank out proteins faster on fewer CPUS. We wrote it from the ground up, Howard and myself, to stamp out the O(N*N) dependencies - we didn't solve an NP-complete problem.

2) As to the acceptance of the F@H method being a true protein folding computation. This means that the computation is actually nudging the protein all the way with fine adjustments from an unfolded squiggle to a perfectly folded protein, as judged by some "energy" computation.

One can question the "truth" of a F@H simulation taking into account the paper that Pande published in the Journal of Molecular Biology (vol 313 151-169). On page 163 there is a little disclaimer that states: "...there were also other structures of the same minimum energy...In other words we could not use the total potential energy as the sole indicator of folding."

SO can they really fold anything when their potential energy score doesn't work? Can you trust a "folding" simulation to say how a protein is moving when it isn't truly predictive? Picture a weather simulation that suggests that the US tornado alley region (Texas, Kansas, Dorothy, Oz) is really computed to suggest it lies along the east coast (Maryland, NJ, NY). Would one go write a long paper about the apparent destruction "pathway" of the "hypothetical coastal tornadoes"? Apparently that's OK to do in protein folding. Problem is you can see the true path of a tornado, but not the true folding path of a protein. We have to live with unvalidated simulations in this field. But one should not overspeak of the "truth" of these simulations until their predictability is shown with some certainty.

Finally, if one were using Raj's arguments, one might say that F@H would be wasting resources on performing inaccurate, unvalidated folding simulations. This is, of course, silly because once they beat out the O(N*N) problems out of their code, and once we have determined a proper scoring function methodology, then F@H may be able to produce give scalable, validated protien folding simulations. That's exaclty why I am very hopeful for their success.

Science happens in small incremental improvements, and you all happen to be helping push a few more of those onto the stack by helping DF or F@H.

Cheers and many thanks to all of you!
Christopher Hogue

Thread: The Science of Distirbuted Folding?

Thread Tools

Rate This Thread

Display

Threaded View

On inefficient code, NP Complete and the F@H vs DF issue

Posting Permissions