Hogue team CASP 6 strategy
June 24, 2004

The following is a summary of the approaches the Hogue team will be applying for CASP. For those interested, the CASP 6 web site is located here: http://predictioncenter.llnl.gov/casp6/Casp6.html We are putting all our available resources into it this time around, for a full force effort. Note that we have two teams, an automated server team, HOGUE-HOMTRAJ, and a ‘manual’ team, HOGUE-STEIPE.

The HOGUE-HOMTRAJ team, consisting of Howard Feldman, Michel Dumontier, Kevin Snyder, and Christopher Hogue, is using our automated homology modelling server, located at http://homtraj.blueprint.org/, to make all predictions, where suitable templates can be found in the PDB. It makes use of a program called SAM developed by Kevin Karplus at UCSC (http://www.cse.ucsc.edu/research/compbio/sam.html) to find possible template structures in the PDB, using a Hidden Markov Model approach. It also provides an alignment between the CASP target sequence and the template. Following this, our own method is applied to construct 3D models of the query sequence using information from the template(s), and the results are mailed back to CASP.

Additionally, targets are forwarded to Armadillo, our domain prediction server developed by Michel Dumontier (http://armadillo.blueprint.org/) to participate in the domain prediction portion of CASP.

The ‘manual’ prediction team consists of a group effort by a number of teams here at Blueprint, in addition to collaboration with Boris Steipe’s lab (http://biochemistry.med.utoronto.ca/steipe/) at the University of Toronto, the Distributed Folding community (http://www.distributedfolding.org/), and also involving a test of a scoring function developed by Dr. Brendan McConkey at University of Waterloo (http://www.science.uwaterloo.ca/biol.../mcconkey.html). Predictions will be made as follows:

For those targets where HomTraj was able to find templates (structures with similar sequences – the “easy” part of protein structure prediction) , we will manually tweak the alignments to see if we can improve upon the automated method. In addition, we may try some different approaches for choosing the best structure that we generate, including using a scoring function developed by Dr. McConkey.

For the remaining targets which must be predicted with an ab initio approach, we will be incorporating Dr. Steipe’s protein motif library, in the form of protein fragments added to the Trajectory Distributions. Michael Brougham, a summer student, has been working on integrating these fragments with the Foldtraj algorithm. Briefly, these motifs consist of sequence patterns, 3-15 amino acids in length, which have been clustered based on 3D conformation. Building structures from these fragments, matched up by sequence pattern, will produce more protein-like structures. This will improve the results in generation 0 of our method, and they may also be used in future generations with different weights. Alternatively they may work sufficiently well that further generations are not required (like in Phase I). This remains to be determined after some testing is performed. We are able to build about 50 million structures per day on our own cluster while testing this new approach.

Additionally, Michael Matan of the Seqhound group here, Florence Wu (a summer student), and several of the BIND (http://www.bind.ca/) curators will be assisting us with the new function prediction category of CASP. This entails prediction of binding sites, binding partners, protein function, post-translational modifications, and so on. This may also in turn assist 3D structure prediction efforts as well. For example, a predicted DNA-binding protein should have a DNA binding motif on its surface somewhere.

With all this going on, we had considered a number of possibilities for how the Distributed Folding Project (DFP) would play a role. We could:

a) Take the small ab initio targets suitable for prediction with DFP, and run the current DFP algorithm on them (but without the native structure or RMSDs of course, which are unknown now)
b) Revert to the Phase I algorithm, which has done better than Phase II in some cases and was much simpler
c) Implement Steipe’s fragments into the Phase II algorithm and use that
d) Shut down the DF project entirely to focus on manual CASP-6 predictions

After much debate, we have decided to continue with the project, and go along with a combination of a) and b), and possibly c). That is, we will keep the present algorithm (which does NOT use RMSD anywhere in it for scoring or driving the generations – it is a true blind test), and run it on the dozen or so targets suitable for DFP (i.e. too hard to predict any other way, and relatively small). However we will also increase the generation zero size to about 50,000, to make it a bit more like phase I, increasing the initial sampling we do. The fragment integration will probably take too long for us to get working reliably and stably in time for the CASP targets, but if we are able to, they may be added later. Additionally, McConkey’s scoring function may be swapped in for crease energy (which we have been using until now) after we have done more testing with it. Elena Garderman will continue to make all changes to the software as needed.

We will likely end the present protein early, and expect to start work on the first CASP target towards the end of the week of July 5. We will then proceed at a rate of one target per week, with targets ranging in size from about 50-150 residues. We realize that some users may not be used to, and may be unable to keep up with, this fast pace of changeovers, but we intend to make it as painless as possible, and it is unfortunately necessary due to the time constraints imposed by CASP. If you cannot keep up, we suggest you try a different DC project until CASP has ended in early September rather than wasting CPU cycles on structures we have already submitted to CASP.

We are very enthusiastic and excited about this opportunity to test our newest ideas and methods, and hope to at least top our CASP 5 results, if not come out on top overall. We feel that the inclusion of the motif library, and the improvements we have made to our HomTraj server, will significantly improve our performance since CASP 5. We are also excited to see how we do with function prediction, which is included for the first time this year in CASP. With our BIND interaction database and other bioinformatics resources, we feel we have a distinct advantage in this category.