PDA

View Full Version : Optimization idea from way back, though slightly changed.



m0ti
12-14-2002, 06:17 PM
A while back I suggested using a genetic algorithm approach to try to improve upon results. The suggestion was probably impractical since it required more client updates than would probably be practical. From experience, most of the problems people have with their clients are at protein changes, even when the algorithm stays the same; so keeping the number of updates to a minimum is a necessity. At the time, I also commented that it was more akin to hill -climbing than to a proper GA.

In any case, I now suggest a relatively small change which should be fairly simple to implement without requiring more updates to occur.

The following change would be required to the client:

A top 10 list (sorted) of folds is maintained. I don't know what information DF usually saves but this should include the way the protein was folded (i.e. the individual steps in the fold).

Half of the time the client randomly folds. If a fold is better than the worst member in the top 10 list it replaces that member.

The other half of the time the client performs the following:

1) pick a random fold from the top 10 list.

2) Pick a random spot in the way the protein was folded (e.g. a fold consisted of 79 individual steps. It picks the location after the 33rd step). The steps prior to this are taken as being part of the current fold and the rest of the fold is completed. If this fold is better than the top 10 fold chosen (specifically the one that we are trying to improve on) then it replaces it.
Note: if the new fold is better than the worst fold in the top 10 list, it does NOT replace it. This is because we want to maintain distinct search spaces.

Of course, the ratio of 50/50 could be changed. The top 10 list could also be made larger (top 100), or could be converted to a cluster approach:

10 clusters (i.e. families):

Each cluster contains 1-10 members. These members are sorted according to performance.

Each cluster has a rank of it's best performing member (or the average ranks of all of its members). The clusters are sorted by rank.


When a fold is found that is better than the worst-ranked cluster in the top 10 clusters, a new cluster ("formed" around the new fold) replaces the worst-ranked cluster.

A cluster is randomly chosen, and a fold is randomly chosen within it, and a new fold created from it. The resultant fold can replace the worst-performing fold within that cluster if the cluster is full. Otherwise, the resultant fold is added on to the cluster.

The basic idea is to trade-off between exploring and searching.

The current algorithm where every fold is completely random consists of a purely explorative approach.

The changes I suggest still maintains the explorative part, but areas of interest found are searched more thoroughly, by spending more time in these areas. Most likely, the majority of the topology is smooth "enough" for such an approach to realize gains.

Since each client can maintain its own list, no extra updates are required.

bwkaz
12-14-2002, 08:09 PM
The only issue with that is, how many local minima are there in the search space? If it's full of them (like, if there are millions), then a 100% exploratory algorithm (as you put it) is actually better, because any searching piece of the algorithm will get misdirected into trying to improve on these local minima, which probably isn't going to happen.

Granted, I don't know a lot of the science behind the process, but my comments are based on what I have been able to gather about it.

Then again, isn't the purpose of the DF project to see how well random searching on thousands of PCs works for this problem? Maybe an alternative algorithm would be appropriate for another research project then...

Scoofy12
12-15-2002, 12:45 AM
sounds like a good idea; the only problem with it is that it requires an accurate top 10 list, which you dont get when folding "ab inito" ie with no prior knowledge of the protein structure.

of course if you had a sufficiently good scoring function for determining just how good a protein is, or how likely it is to be "right" then it might still be a good idea.

m0ti
12-15-2002, 03:32 AM
bwkaz: the chances of hitting a local minima with a purely exploratory approach are very small, especially when considering the number of folds the average user makes per protein. It is much more likely that most of the best folds they have found are not local minima. In any case, the approach is certainly reminiscent of simulated annealing: take a good solution, and take a random walk from that solution. In this case, we are in fact taking the good solution, "unravelling" it back a random number of steps and then making the random walk. This approch can even result in a purely exploratory fold if the random step chosen is before the first fold. This approach tends to offer a more concentrated search in the immediate area of a fold, though the search space is still very, very large around each fold, and with a fairly high probability there exists a deeper local minima within that search space, even if the entire fold isn't unravelled.
Thus, we have 10 search spaces which we know contain "good" solutions (the best we've found so far) but, by maintaining the exploratory phase, we are in no means limited to these.

Actually, the entire approach as a whole rather reminds me of the EM algorithm (it only reminds me, it obviously doesn't conform to the true form of EM):

E-step: random fold generation, saving the best folds in the top 10 list

M-step: trying to maximize the results produced in the area of the folds in the top 10 list.

Looking at it this way one could then dedicate say the first 2500 folds to the E-step, and then the next 2500 folds (250 per member of the top 10 list) to the M-step.

Still: IF the M-step is in general less worthwhile than th E-step, it is always possible to adjust the ratio of time spent in each part. Instead of 50/50 it could be 80/20, 90/10 or maybe 95/05.
It is perhaps a better idea to try out at these higher percentages since the vast majority of folds will still be created in the same way as today.


Scoofy12: we are in any case limited to how good our scoring function is. If the scoring function is poor, than a purely explorative approach will STILL give bad results since the folds are not being graded even approximately properly. This approach should still improve on these cases. Though you are correct, it would be more correct to speak of the topology created by the scoring function over all the possible folds.

m0ti
12-15-2002, 04:43 AM
it won't let me edit my past message (since more than 60 mins have gone by).

What I wanted to add on is that seeing as most users have already found the best fold that they are going to get by about 3-4 billion structures, the E-step obviously stops contributing as much as time goes on. This is fairly obvious since the chances of finding a better fold than the current best through a purely random approach tend to decrease with time. In these cases, however, the M-step is much more likely to be of help as it is concentrating the search in the area around the fold.

When choosing the spot to which to unfold to during the M-step, it is likely more beneficial to have a distribution which is biased towards the later steps (i.e. the protein is not unfolded too far in the majority of cases), though I believe, that even a uniform or normal distribution should make the M-step more valuable than the E-step particularly as the number of folds already produced increases.

bwkaz
12-16-2002, 09:20 AM
But we don't want to find a local minimum -- we want to find the global minimum (or as close as we can get), right?

Granted, if the user isn't at a local minimum, then searching toward one will lower their RMS deviation a bit, but that can only go so far. If the value of that local minimum isn't low enough (~6A, I think, was the number quoted in the past for some proteins -- though this depends on the protein length as well, of course), then they're wasting their time anyway.

And of course, you don't know whether they're wasting their time until you find that minimum...

It's an interesting idea, in any case.

m0ti
12-16-2002, 11:10 AM
I guess I'll have to correct myself once again: the topology is the value of the scoring function over all the possible folds created by foldtraj (for the current protein)

Yes, that's true we do want to find the global minimum.

However, reaching the global minimum is EXTREMELY difficult. The global minimum is most likely at the native fold of the protein. And yes, most likely the topology is riddled with (perhaps millions of) local minima. Some of course will be closer to the global minimum than others.

You have to remember that what we're doing here isn't anything at all like gradient descent; it's more like educated exploring. In gradient descent we would be trying to reach local minima in a straight line approch (well, actually gradient descent zig-zags but I'm talking in principle). We aren't making a straight line attempt to reach the closest local minima (gradient descent), we are still exploring though in a new manner:

Instead of searching the whole fold-space (with fairly equal probability) we are searching a sub-space, or more accurately a probabilistic sub-space. Let's say that we're using a uniform distribution to pick the step in the fold that we're trying to improve, to keep it simple. Let's also view the fold space as being two dimensional (for visualization purposes only). Then in this case the sub-space will be concentric rings of the same width where the closest rings have the highest probability (assuming progress from an unfolded protein is equally likely in all directions). What this means is that we will tend to produce new folds which are close to the centre.

Granted, if there isn't a BETTER solution somewhat close to the centre, we may be "wasting time," and that's only if exploring the whole search space (i.e. the E-step) is more likely to produce a better solution than looking close to an already good solution. If we consider that almost never will a user have found a fold which is already a local-minima (I could be wrong on this. Don't know how foldTraj works!), then there is likely to be a better solution relatively close to it and this could likely be found quicker through the M-step.

The E-Step is retained so that new, potentially better sub-spaces can be found, and here it is logical to consider the entire space with equal probability when searching for locations with better energies (i.e. we have no prior information as to where the native fold may lie. Therefore, we must assume that all folds are equally likely to be the native fold).

I'll also reiterate that as time goes on the chances of finding better and better sub-spaces diminishes as we touch on more and more of them, and it becomes more important to search more thoroughly the good sub-spaces already found.

Brian the Fist
12-16-2002, 12:02 PM
Hi all, some good ideas here, I'm very impressed. Just so you know, we presently have two ideas for modifying the sampling. One is still the Genetic Algorithm, I think I have worked out a way to do it with minimal updating. but haven't had the time to code it yet. The other is actually quite similar, at least in spirit, to what m0ti has suggested. By taking low energy structures, we are able to generate 'nearby' structures by gently wiggling the protein. This can tell us whether we are in a steep or shallow energy well, and in general tells us about the energy surface in the neighbourhood of the protein structure in question. This in turn can help enhance our scoring and weed out false positives. The best-energy-scoring of these nearby structures could then 'replace' the original structure and the process repeated as well. I should have time to implement one or both of these approaches in the near future and we are excited to see how much better (or worse?) they will perform.

m0ti
12-16-2002, 12:41 PM
Hi Howard!

Nice to see you guys continuing to move forward and keep on putting out better and better clients!

Sounds like what you're doing is quite similar to what I suggested, though it's closer to a gradient-descent type method than a (more random) sub-space exploration method. Which is pretty good, since there are a lot of well-studied methods for gradient-descent out there in use.

I guess then you'll be doing something along the lines of:

1) Make a random fold.
2) If not "close" to score of the best fold go back to 1.
3) Wiggle protein about and try to find the steepest descent direction and follow that a short ways. Repeat till at local minima.
4) if better than best fold signify current fold as best.
5) go back to 1.

Of course, you'll probably be supplementing that with more practical things in use (newton's method, etc) which are a good deal faster.

I hadn't considered checking the immediate area (for stablity/smoothness I suppose) to check if the fold found was a false positive though it does make sense. That should have some benefit to the performance of the scoring system.

bwkaz
12-17-2002, 12:37 PM
Oh, OK, m0ti, I see what you were saying. Finally. :p

m0ti
12-17-2002, 01:17 PM
bkwaz: good! :rotfl:

It's general optimization stuff for real world problems. No a priori information and having to try and find good solutions. At least some of my studies were worthwhile! :jester:

Now why don't we sit down to a beer: :cheers: