The Science of Distirbuted Folding?
I was just curious to hear more about the science of distributed folding. I'm a grad student at Berkeley, but hang out at Stanford too (my wife is there) and have heard many of the big names in computational biology and protein structure prediciton (Michael Levitt, David Baker, etc) talk about work similar to what's going on here.
From what I can tell, Distributed Folding is creating a large decoy set, much like the original Park/Levitt decoy set. I guess I'm curious
(1) Why do you need distributed computing to do this? The Park/Levitt set (and other sets, eg Baker's) were created with much fewer resources. Are yours bigger or better? Based on the RMSds you quote, it doesn't seem that way.
I just want to make sure that we're not wasting our time here on inefficient code.
(2) WHy one needs a big decoy set in the first place? Levitt and Baker have each mentioned that decoy discrimination is the problem and that a bigger decoy set is not going to help.
Again, just curious whether all this makes any sense to be doing.
I think your idea to use distributed computing for biology (and proteins in particular) is great and I'm glad you're working on this. I just want to understand why. THanks! :)
Raj
This isn't a run for the Nobel Prize
Howard is a grad student in BioInformatics not just biology. As a bona-fide computer scientist, he works on computer applications for Biology.
Fact:
Computer Scientists don't get Nobel Prizes.
Fact:
Computer Scientists get recognition of their work from other computer scientists and professionals for astonishingly clever ways to solve previously impossible problems.
From the Wikipedia: "In complexity theory, the complexity class NP-complete is the set of problems that are the hardest problems."
The statistical prediction of protein folds is a problem that many suspect to be NP-complete. In plain english, "The problem is impossible to solve". Howard and Dr. Hogue believe they have proven that it is in fact solvable. They show in thier papers that the problem is logarithmic rather than exponential. Their reasoning looks pretty solid to me. Early indications are that they are doing better than they thought they would. Of course, in the world of Computer Science, it aint nothing unless you can actually do it. From my point of view, its real science...cutting edge, head and shoulders above your peers type science. But as is always the computer scientists fate, the only thanks they will get is from the users. We crunchers are not Howards users. His users are all of those other scientists waiting for the shape of all 30,000 proteins so they can get to work on new cures. They don't care how the protein got that way, they just want to know what it is. If Howard is in fact right, they will have it as quickly as they can get people to help crunch.
As a distributed computing effort, many things about DF are appealing to the people who do the crunching. It has a good solid benifit to humanity if it succeeds. No garrantees that it will succeed, but look at SETI, they haven't found any aliens and might never. It runs nicely on a lot of different machines so there aren't too many bugs. The researcher even spends his Sundays :notworthy answering questions. I doubt many of Howards crunchers understand more than the first sentance of his papers. But they know when the people associated with an effort are smart, friendly people, when the software works, when the guy in charge is actively engaged there is a good chance things are going to be fun on the project. That Howard uses mind boggling huge amounts of computer power simply doesn't bother the farmers since they think everyone should have lots of computer power.:thumbs: For casual crunchers, it is a cool screen saver that a lot of people like watching. All of these reasons have to do with the fact that it is not the computers that get this job done, it is the people running the computers. None of whom are going to be eligible for a Nobel Prize but do it anyway.
Sometimes, I think a Nobel Prize for CS would be a good thing. But then I realize that the scientists in fields with big prizes sometimes get into Nobel Fever and forget that it isn't about a shiny medallian, it's about the advancement of humanity.
Re: This isn't a run for the Nobel Prize
plaidfishes wrote:
"I doubt many of Howards crunchers understand more than the first sentance of his papers."
Which is what is great about this thread! I've learned far more from this thread than anything so far presented on DF's website's FAQs or elsewhere. Some of that because the elsewhere overflows with techno-babble (the technical papers) and some of that because it is simply not addressed.
Now along comes Raj who seems quite knowledgeable about the technical and scientific aspects of this project and raises questions about its value. He's even been nice enough to answer questions from laymen like me. That Howard, Aegion, and other knowledgeable people have calmly and respectfully responded to Raj's questions is fantastic! Again, I'm learning just what on Earth is being done by this project by way of this thread. A lot of this thread should be put in the FAQs ... especially the DF vs. Folding@Home stuff.
To say the least, I look forward to each email notification that a new reply has been added to this thread and eagerly click on the link to read them. To all that are trying to meaningfully contribute to this discussion, thanks!
Re: This isn't a run for the Nobel Prize
Quote:
Originally posted by plaidfishes
The statistical prediction of protein folds is a problem that many suspect to be NP-complete. In plain english, "The problem is impossible to solve".
Well, not impossible, just REALLY REALLY HARD. If statistical prediction is indeed NP-complete, then it can't be done in better than exponential time, by any algorithm known. This of course doesn't mean it can't be done, just that no one knows how. The interesting thing about NP-completeness is that if you find a way to do any (truly) NP-complete problem in polynomial time, then because of some properties of NP-completeness (that I don't know, I just know about), you will be able to solve every NP-complete problem in polynomial time.
Which means, if statistical prediction is in fact NP-complete, and Howard found a way to do it in logarithmic time (which is actually better than polynomial)...
:notworthy :notworthy :cheers:
On inefficient code, NP Complete and the F@H vs DF issue
Hi folks,
Ah FUD (Fear, Uncertainty, Doubt).
Let me (oh yea, that Dr. Hogue guy) weigh in.
Scientific research is, by definition filled with uncertainty. We don't mind FUD here. Happy to sit through it, be patient, and have our project prove our point. Mind you we haven't yet published our results from 5 billion sampled structures yet, so most scientists in the folding community have no idea what we are up to.
DF is only one of my major projects. That's why you don't hear very much from me - I have a very active group. We have had 2 Science and 1 Nature paper come out in Dec and Jan on the topic of molecular assembly information - the Biomolecular Interaction Network Datbase (www.binddb.org). I have published recently in Cancer Research on bioinformatics discoveries relevant to mechanisms of DNA damange in colorectal cancer. So you folks should know you aren't dealing with any slouches or also-rans here. We are doing biomedically relevant research. By supporting DF and my group, you support everything we are trying to do, not just sampling proteins. We may one day turn around the DF platform and use it to simulate molecular assembly. So watch this space.
Are we doing cutting edge software or just me-too distributed computing? Oh please. Check our paper on the MoBiDiCK infrastructure (oops, we published in the CS literature, I guess it doesn't exist...). You will kindly note that it indicates our intentions and work towards distributed computing applied to protein folding and it clearly pre-dates the entire F@H project. We weren't first, but we aren't me-too, we've been deliberately staging this for some time now. I'm confident by our user accolades that we've done the right thing in waiting till it was ready before releasing our software.
But, maybe this was the wrong strategy. Apparenty I need to do science with press releases, not in scientific publications, cause no-one reads them.
So two clarifications
1) No - we haven't solved the NP-complete problem yikes! - We just remembered that there are some good O(NlogN) solutions to O(N*N) problems, and we came up with a novel twist on one of these. This is what allows our code to scale to the sizes of proteins we are currently doing and beyond.
A "good" O(NlogN) solution is rather like using a phone book (a sorted ordered list) to look up a phone number in a big city (actually O(logN)). Imagine how long it would take if the book wasn't sorted in alphabetical order and printed on a big roll of paper instead of pages. Try to find a number then... Our method is a "treecode" algorithm - a bit different from the Barnes-Hut algorithm but in the same category.
A lot of existing protein folding code has little bits of O(N*N) coding gotchas. You can tell by either reading the source code, or by the way it doesn't do large proteins. F@H code cannot do large proteins, so I assume there are fundamental O(N*N) gotchas in the underlying code. Nothing that a good rewrite wouldn't fix though. Trouble is Pande's group didn't write it - someone else did - and as a result they may have a tough time troubleshooting it.
Anyhow "treecode" is why we can hit a better villin with fewer volunteers than F@H - because we can crank out proteins faster on fewer CPUS. We wrote it from the ground up, Howard and myself, to stamp out the O(N*N) dependencies - we didn't solve an NP-complete problem.
2) As to the acceptance of the F@H method being a true protein folding computation. This means that the computation is actually nudging the protein all the way with fine adjustments from an unfolded squiggle to a perfectly folded protein, as judged by some "energy" computation.
One can question the "truth" of a F@H simulation taking into account the paper that Pande published in the Journal of Molecular Biology (vol 313 151-169). On page 163 there is a little disclaimer that states: "...there were also other structures of the same minimum energy...In other words we could not use the total potential energy as the sole indicator of folding."
SO can they really fold anything when their potential energy score doesn't work? Can you trust a "folding" simulation to say how a protein is moving when it isn't truly predictive? Picture a weather simulation that suggests that the US tornado alley region (Texas, Kansas, Dorothy, Oz) is really computed to suggest it lies along the east coast (Maryland, NJ, NY). Would one go write a long paper about the apparent destruction "pathway" of the "hypothetical coastal tornadoes"? Apparently that's OK to do in protein folding. Problem is you can see the true path of a tornado, but not the true folding path of a protein. We have to live with unvalidated simulations in this field. But one should not overspeak of the "truth" of these simulations until their predictability is shown with some certainty.
Finally, if one were using Raj's arguments, one might say that F@H would be wasting resources on performing inaccurate, unvalidated folding simulations. This is, of course, silly because once they beat out the O(N*N) problems out of their code, and once we have determined a proper scoring function methodology, then F@H may be able to produce give scalable, validated protien folding simulations. That's exaclty why I am very hopeful for their success.
Science happens in small incremental improvements, and you all happen to be helping push a few more of those onto the stack by helping DF or F@H.
Cheers and many thanks to all of you!
Christopher Hogue