Optimization idea [Archive]

m0ti

08-30-2002, 07:07 AM

Hi,

As I understand it the current method is completely bruteforce, i.e. use the algorithm (and random numbers) to generate many structures.

I was thinking that you could use strucutres that have good pseudo-energies to try and generate more candidates, in a fairly typical optimization (global minimum search problem) sense.

For example, you could, say, have 10M structures generated and from those select the top 10 structures, and reissue them as part of the problem space (let's say that 1/2 of the people are folding the original protein at all times, to generate more good structures, and the other 1/2 are advancing already found structures, or at the beginnig 100% are generating the original, and then it drops to 90% and then 80% and so on). These top 10 structures are "unfolded" to the point where, say, only 10% of the folds have been made, and these would be the starting structures for the reissue. 10M structures later, and the top 10 are again selected, and taken back to the 20% location and reissued. (of course at the same time 10M more structures of the original protein are completed, and the top 10 of those enter the 10% phase).

It would basically be a pipeline:

0%-->10% --> 20% --> 30% --> ... --> 100%

In terms of viewing it as a global minimum problem, this should (statistically!) help out quite a bit. After all, a good structure probably is about average for the possible folds along its path. This method basically envolves refining each stage to produce a more optimal solution.

Logistically, it would probably be a bit more complicated to run things, but it shouldn't be too hard.

Brian the Fist

08-30-2002, 08:44 AM

We are already internally tinkering with a 'genetic algorithm' type approach using the best structures generated, crudely similar to what you describe. We have not yet come up with a good way to parallellize this yet though, as the important things to consider are that we don't want the client to have to contact the server to get work, and we don't want it to have to download updated, refined maps of conformational space every few hours, which is unfeasible for most people.

We will hopefully come up with something in the next month or two that will work well in parallel and improve on the current brute force approach.

wirthi

08-30-2002, 01:10 PM

Hi,

The "Distributed Particle Accelerator Design" Project (www.stephenbrooks.org) seems to have similar troubles. Their Clients also uses genetic algorithms AFAIK; the client stores all done workunits in a file and generates new (better) results out of this database. If the user wants to, he can add some (250) of the best results returned to the server up to now to this file, what will significantly improve the quality of the client's results. The file of this "best results" is published every now and then.

Distributedfolding could do the same, and let the user update its client's database every day or so (manually or automatically, if desired). Of course this will cause a lot of bandwidth to be consumed ...

m0ti

08-30-2002, 06:02 PM

I agree that the solution suggested definitely does have a similarity to genetic algorithms, and indeed the idea of a sequence of folds is very suitable for genetic algorithms. Despite this, I tried to maintain a less genetic algorithm approach, principally due to the large number of illegal structures that crossover would generate, and these would have to be detected. I think that the overhead in finding the illegal structures could be quite high.

Perhaps, a somewhat mixed approach could be used. When a client uploads data it receives back one (or several) proteins which performed well. It could then unfold the protein a random number of steps back, and try folding it forward.

My original idea, in many ways, is more like simple hill-climbing with backtracking. The best guesses so far are greedily chosen, and are backtracked to an earlier starting point and ran from there, with the expectation that even better results exist.

Of course, any of these solutions could be implemented on both a local and a global level. In other words, as pointed out above, the client would by and large use its own "database" of structures (that it itself generated) most of the time, but once in a while it would get structures from the global level and use those as well.

There really are innumerable ways of tackling this problem.

It's good to hear that you're investigating them, and I hope that the algorithm can be improved.

Mikus

08-30-2002, 07:06 PM

It seems to me that currently the client is starting (randomly) anew with each structure it builds.

Isn't one technique to start randomly on the first of a series of structures, but upon identifying a "worthwhile" one to then try to "improve" on the following try upon the previous structure ?

Only when the following try builds a "worse" structure would the client jump to a new (random) starting point.

[The advantage is that the client might be "converging" toward a local optimum that is reachable from wherever the original (random) starting point was.]

mikus

Brian the Fist

08-31-2002, 12:24 PM

Unfortunately, it is well known that local minima riddle the protein energy landscape, and must be avoided if you ever hope to reach the global minimum.

Anyways, glad to see people are thinking about the science again. I see some good ideas here. Feel free to continue this useful thread.

m0ti

08-31-2002, 02:24 PM

Well, of course local minima riddle the topology. If they didn't it would make it a much, much simpler problem (greedy hill climbing (well, in this case "valley dropping?") would do the trick).

In any case, there is no real chance at covering the entire research space.

The foldtraj algorithm effectively is a heuristic suggesting the next step to take as part of an attempt on a search.

It could improve things by trying to use common techniques used for large search spaces. Probably most of them will have a fairly similar effectiveness (on average), so I guess that it'd probably be best to choose the one which was easiest to implement, in code, but more importantly, logistically (in terms of amount of updates people have to go through, and the amount of work that has to be done behind the scenes).

What may be an interesting step (which should first be implemented server side), would be to cluster the solutions (K-means anyone?). One could then come up with a primitive map of the topology and perhaps use that as a clue to where good structures may lie.

Hmm, there really are just so many techniques... still I guess it's good to get 'em out there. It may produce something that's cheap and easy enough to do, or it may help jump start some really good idea.

runestar

09-05-2002, 11:58 PM

Howard,

I was just curious if any other protein folding groups in general have contacted you in relation to possibly working together... i.e. a meeting of mind so to speak.

Best,

RuneStar½

Scoofy12

09-06-2002, 12:19 AM

Originally posted by runestar½
Howard,

I was just curious if any other protein folding groups in general have contacted you in relation to possibly working together... i.e. a meeting of mind so to speak.

Best,

RuneStar½

hehe... you must not be familiar with academia :) It can be as competetive as any industry, depending on the situation. And of course anyone else participating in CASP would be a competitor :) But if you refer to the F@H project it's been pretty well established that the goals are different from DF (check the threads in the "Educational" forum) and not well suited to working together.

runestar

09-06-2002, 12:48 AM

Well, I wasn't talking about F@H... that's why I said in general. There's quite a number of folding projects from what I can see... I can't believe they ALL are completely different from each other. Seems to me if there are a number of areas that overlap, it would be rather silly not to at least consider collaberation that may result in a better project than either one individually...

RS½

tpdooley

09-06-2002, 04:17 AM

If you're trying to keep us all from optimizing dead ends - and instead randomly trying to find the best structures.. why not spend 90% of the time trying to create new structures, and the other 10% working on the best of the random generated structures, and the best of the optimized structures based on the random structures.. ?
(always giving it a bit of new blood to experiment with..)

jlandgr

09-06-2002, 08:45 AM

Originally posted by Scoofy12
But if you refer to the F@H project it's been pretty well established that the goals are different from DF ]
Agreed. I would be interested to know, however, if there is more of an overlap with G@H, the sister project of F@H. After all, G@H is working on structure prediction, too, if I understood it correctly.
And, yes, I would be very interested to see what a genetic algorithm could do to refine the most promising structures, as Howard hinted at above. I hope problems that persist at the moment will be overcome :). Personally, and I'm not on broadband, simply 64 k ISDN, I would find a WU upload & download once per day and not exceeding ~ 1 MB acceptable. But I agree that some may not want this or even rely on the current "auto-caching", i.e. creation and saving of work without an internet connection and simply sneaker-net once in a while.
If a genetic algorithm approach was simply used to refine the best structures, could the groundwork of brute force sampling to find the candidates to feed the GA still be done by users unwilling/unable to have lots of internet traffic, much like tpdooley suggested?
Whatever, good luck Howard on coming up with a solution,
Jerome

MAD-ness

09-06-2002, 06:23 PM

The guy doing the Distributed Particle Accelerator project is doing something similar to what was suggested here. It isn't an genetic algorithm, I don't think, but he takes the best results and packages them up and users can download them (optionally) and use them as the 'seed' data.

I don't know enough to go into more detail on it though.