homology modeling [Archive]

Brian the Roman

03-22-2003, 07:18 AM

Howard;
Do you currently use any homolgy modeling approaches to guide the sampling?

I was thinking a bit about the sampling side. There are only 20 AAs. When put together in various orders we get different proteins. However, any protein with over 20 residues must be reusing some of the AAs. So why not create a database of all proteins with know conformation (PDB?) and then find subsets of the sequence of interest in the database and use the conformations of those pieces to guide us. So we'd only need to randomy sample unknown sequences.

For example: say our protein contains a sequence ALA, ARG, ASP, CYS, ASN. We look in the db and find another protein whose native fold we know also contains this sequence. Then we could use the conformation of the middle three AAs as a starting point and only randomly sample for the rest, and thus effectively reduce the sampling space. Obviously, you could find multiple chains in the db each of which we could use to guide us. This info could be loocked up by the server before the protein is sent to the client so the client would only need the info about the known sequences, not the entire db.

This approach is similar to homology modeling is it not?

ms

Brian the Fist

03-22-2003, 11:54 AM

You've got it a little bit confused but close. As you may recall, for CASP we submitted 39 homology modelling predictions using a modified version of our algorithm (and not requiring distributed computing). HM entails looking for proteins with high sequence identity to your protein of interest, loading up there structure, and then swapping the sequence to the new sequence. Essentially you keep the entire fold fixed and just change the amino acids on the backbone. Provided the sequence identity if above about 35% and your alignment of the 2 sequences is good, you end up with a very good structural model for the unknown (typically 2-4A RMSD from the true fold).

What you describe could be called 'motif detection'. One problem with this is that the same sequence does not always fold into the same shape in different proteins - even for a short stretch of say 5 amino acids. This is because of non-local interactions between parts of the protein that are not near each other in sequence but are in 3-D space and thus have an influence on the final conformation. There is in fact provision in out 'trajectory distributions' to hold fragments - fixed stretches of residues like you describe - and to sample them probabilistically. Thus even though the same sequence could have, say, 3 different known conformations, we could sample all 3 of them probabilistically, and even sometimes choose them totally at random (since it is possible they won't fold into any of these trhee shapes).

The main problem is that there aren't any good databases of these 'motifs' that we have been able to make use of yet. David Baker's 'I-sites' library is somewhat like this but is unfortunately in a somewhat un-usable file format. We could also make a motif database ourselves, but we simply haven't gotten to it yet (you listening Chris?? :) ). It is definitely a good idea though and would be easily added to our method. It is something that will hopefully be tried in the future.