New client that much better??

**Kileran** · 06-30-2003, 11:37 AM

In the old days, i remember that each protien we folded had varying attributes. 1) speed 2) best rmsd.

so it was nothing to see a protien pass having as little as half the RMSD of the last one.

is this the same randomization we are seing now? an rmsd of 4.89? that's low by ver 1 standards. is it because the new method is that much better?

Kileran

**Paratima** · 06-30-2003, 12:09 PM

Yes.

**Kileran** · 06-30-2003, 01:18 PM

thanks for that well thought out answer, but i asked an A or B question, not a yes or no. nice to see you read it.

I'l phrase it more to-the-point

A) lower number due to the type of protien we're doing

B) Lower number due to the new method being better

**DPRGI-Federico83** · 06-30-2003, 01:27 PM

B

**rsbriggs** · 06-30-2003, 01:38 PM

To quote your original question

Is it because the new method is that much better?

To quote paratimas original answer

Yes

**prokaryote** · 06-30-2003, 02:10 PM

As long as we're talking about differences between Phase I and Phase II...

Let's see if how I think that DF is doing this protein structure thing is even close to reality. I'm prepared for flaming, correcting, slapping about etc., so fire away!

Let's pretend that the protein we care about is only composed of 3 amino acids, call them Larry, Curly and Moe.

Since there's only 3 of them then no matter how they are arranged, they can fit on a table top (3 points determine a plane).

Draw a graph coordinate system on the table top (X axis and Y axis)

Put Moe at the intersection of the two axes (the point (0,0)) and call this point the origin.

Since we've fixed the location of Moe relative to the X and Y axes we can now specify a position of Larry and Curly relative to Moe.

Say Larry is 3 units to the right of Moe and 1 unit up (3,1) and Curly is 4 units to the right of Moe and 3 units up (4,2).

When Larry, Curly and Moe are in this arrangement, there is a certain amount of force necessary to hold them there or that there is a certain amount of potential energy within the bonds between the acids. DF is somehow translating this force or bond energies for each amino acid to a distance measurement in Angstroms and finding what the standard deviation of all of them are (that would be the RMSD or root mean square distance value that is shown). Let's say that they have a value of 10 Angstroms in this arrangement.

What phase I of DF did was pick random placements (within some given biochemical constraints such as Larry could only be next to Moe and Curly (given sequence), no collisions of atoms allowed, probabilistic bond angles, rotations, energies, etc) of Larry, Curly, Moe and about 100 or so of their friends and calculated the RMSD for that arrangement. It did this for each structure that it randomly picked or generated.

Back to Larry, Curly and Moe. This is kind of a hard part. For this example, we're going to create a 6 dimensional surface in a 7 dimensional hypercube that will represent the energy for each possible arrangement of Larry, Curly and Moe.

I guess the best way to describe this is to use a 3D example. Lets say that each arrangement of the stooges can be mapped to one spot on another table top. In other words when Moe is at (0,0), Larry is at (3,1) and Curly is at (4,2) in the first table top, we'll arbitrarily assign this configuration to the spot (1,1) on this new graph, but now we're going to add one more dimension, height above the table that represents the RMSD value. Imagine that there's a sheet of very stretchable plastic suspended above the table and we have a ruler demarked in units of RMSD. So at the table top point of (1,1) we push down on the sheet at this point until it is at 10 units of RMSD above the table and stays there. Now do this for every possilbe arrangement of the stooges.

That deformed plastic sheet is the energy surface. What DF was trying to do was find the lowest point on the entire surface. This lowest point would be associated with the lowest amount of energy that is required to hold Larry, Curly and Moe in a particular arrangement or the lowest amount of potential energy within the intra-amino acid bonds. I,m guessing that in nature, proteins naturally want to be at this lowest energy state or at least in a nice deep well on the plastic sheet from above. Therefore, the shape of the protein and other chemical parameters (the arrangement of Larry, Curly and Moe) at this point is most likely going to be the way a protein will end up after it has folded.

The best or lowest RMSD points for each different protein are what was submitted to the CASP5 contest. DF did okay (about middle of the pack I think).

Back to the 6 dimensional surface in a 7 dimensional hyper cube. In reality, the arrangement of Larry, Curly and Moe can be describe by just stringing together all of the coordinate numbers (and/or bond parameters) along with the RMSD number into one long vector with 7 places (0, 0, 3, 1, 4, 2, 10). The first 6 numbers are the table top coordinates for the stooges and represent a 6 dimensional surface (One wierda$$ table). The 7th number is the energy or how high above the table this one arrangement of the stooges is located and is what gives the 6D surface it's bumpiness. It really is the same as our 3D example above, just that no-one can imagine what a 6 dimenional surface looks like, at least I can't when sober!

Okay that was DF phase I. A bunch of random points sampled on our bumpy surface.

For DF II, what I believe is going on is that initially (generation 0), 10,000 points are picked at random and their energy is measured (the same as in DF I).

Only now, the lowest energy structure(s) are picked. By the use of a simple genetic algorithm, the area around this point is explored (it'd probably end up being similar to a steepest gradient search with some abilitiy to get out of localized minima due to the way SGAs function what with mutation and cross-over) and a lower energy state structure is returned (it would be the structure associated with the lowest energy state found within this region close to the best structure from generation 0). So instead of exploring random points, we're exploring random points and the area near them (or random dimples).

I think that this method is fundamentally no different than DF phase I, only instead of a sampling a point we're now selectively looking at a small "good" area. What I think that will happen is that we have a very similar distribution of RMSD values only shifted lower (a good thing), but I don't thing that Phase II takes full advantage of SGA's or evolutionary programing.

What I think could be done is as follows:

Step one first generation set of the algorithm, carry out the same calculations as is done in DF phase II.

Step two, report back the 50 structures from generation 250 for every client.

Step three, add these 50 structures to a population or database of "good" structures, say the top 1,000,000 or whatever.

Step four second and later generation sets, when you work on your next 250 generations, the generation 0 is not randomly generated, instead 50 individuals or structures are randomly chosen from the population of 1,000,000 good structures and you continue evolving these through to generation 250. I believe that this employs a variation of the concept of evolutionary theory called punctuated equilibrium (where, in this case, the perturbation to a genetically stable isolated population is the introduction of new members from another population that has also evolved in isolation).

Step 5, repeat steps 2 through 4 until no further appreciable improvement in RMSD is found.

The benefit to this method is that the space between isolated dimples can also be explored and that isolated dimples themselves are explored.

Or so I think, but I could just be full of @#$%.

Flame away!

**Dyyryath** · 06-30-2003, 02:54 PM

Nobody's going to accuse prokaryote of not thinking his answer through before posting.

**tpdooley** · 06-30-2003, 04:00 PM

When we used this 96 structure protein during the Phase I stage, we got a roughly 7.11A structure. That's with 2000+ users running however many computers running the Phase I client for 4-5 weeks.

With one of the beta clients, a mere 50 users in a week got a low score of 4.5A (it wasn't held together properly, so we're probably not using that particular beta client).
With the old client, to get roughly 1A lower scores, Howard mentioned we'd have to have 10 times the production. So our low Beta client scores are all the more impressive for being better than 50? trillion folds from the Phase I client.

**prokaryote** · 06-30-2003, 04:41 PM

I don't think anyone would argue that Phase II isn't producing "better" results as measured by RMSD

(which in and of itself is a whole 'nother ball of wax (how do you determine how good a structure is if you don't know what you're looking for in the first place; you have to go on assumptions about the nature of protein folding, laws of entropy and physics, that your energy surface is an accurate representation of what a protein will see and hope that you're right ... as well as run test cases which I'm sure is being done with the protein that we're studying).

What I'm getting at is that my understanding (And I'm probably way off base) is that the current phase II methodology is fundamentally no different that phase I. That doesn't mean that it won't return better results (It'd better since we're concentrating looking at the steepest gradient around selected good points). This is still a random initial search of the energy surface (gen 0) and a localized search around an "interesting" point (gen 1 - 250). By concentrating computational effort around the interesting points (Which are not likely to be at a local minima, hence the SGA used to explore the nearby area) you better dang well get better results. But it is still a random search on picking the initial point. Thus I'd expect a shift in the distribution of RMSD's (for the better, or lower in our case), but the shape of the distribution and the convergence rate will probably be the same (instead of converging to some value of say 8 or 9, we'll converge to say 1/2 of that), the rate of new best values will be approximately the same as the rate of new best values found using phase I.

What I'm saying is that what each client returns from the end of 250 generations of "evolotuion" is the same as say a population of creatures that are allowed to evolve in isolation to a particular environment (the local best point from gen 0). Without any kind of external perterbation (aside from the random mutation and some outside parameters "laxness", "structure" and "levels" induced, the population of structures quickly reaches an "dynamic" equilibrium point where not much more progress will occur and will probably just oscillate around the local minimum (for the RMSD values). You'd probably see that for a while a structure's RMSD value gets lower throught the generation then bottoms out and starts bouncing around this value or probably getting worse as the outside parameters, i.e. "laxness" are played with (which I think is essentially changing the energy surface or changing the rules in mid-game).

To me, these outside parameters are an attempt to induce a perturbation to this population in an attempt to expand the search area about the "best" point from gen 0. However, and I'm sure that Howard and company will quickly correct me if I'm wrong, what I think is happening is that the perturbation is not preserving the energy surface and will corrupt validity of the results as it pertains to the original energy surface.

Instead, another means of inducing a perturbation to an isolated population that exists in some sort of dynamic equilibrium is to expose it to individuals from another population that also evolved to a dynamic equilibrium. This will allow the best schemas (in the case of protein folding a schema would be a short segment of amino acids and their local structure that have very low potential energies to combine with those of the other population and hopefully you will find a global minima much quicker than just looking at areas around randomly distributed "good" points. Following a punctuated equilibrium methodology will, I believe, allow you to explore the areas on the energy surface bounded by these randomly distributed "good" points as well. This would be, I think, at least an order of magnitude more efficient.

Of course, I don't have an understanding of what is doable with respect to the chemical properties of protein bonding in a native medium so like I said, I could just be spinning my wheels and full of it.

prok

**Paratima** · 06-30-2003, 09:11 PM

Dayum!

**bwkaz** · 06-30-2003, 09:30 PM

Originally posted by prokaryote
(which in and of itself is a whole 'nother ball of wax (how do you determine how good a structure is if you don't know what you're looking for in the first place; you have to go on assumptions about the nature of protein folding, laws of entropy and physics, that your energy surface is an accurate representation of what a protein will see and hope that you're right ... as well as run test cases which I'm sure is being done with the protein that we're studying).

Actually, it's pretty simple.

You do these calculations for quite a while, on a control population of proteins, and you measure both the RMSD (since you know the real structure) and the value of some of your projected scoring functions (crease energy is, last I heard, what they're using now). You then figure the correlation between the two.

This measurement has been done already, with the Phase 1 client. Back when we were crunching for CASP 5, the scoring function wasn't RMSD (since for CASP 5, no one knew the real protein save the judges), it was crease energy, or it was crease energy plus something else.

We're back to RMSD, but that's for a different reason (which, I believe, is explained in the Phase 2 FAQ on the DF website under About).

Without any kind of external perterbation (aside from the random mutation and some outside parameters "laxness", "structure" and "levels" induced, the population of structures quickly reaches an "dynamic" equilibrium point where not much more progress will occur and will probably just oscillate around the local minimum (for the RMSD values).

This is exactly what the protein folding "landscape" looks like -- it's absolutely riddled with local minima. Which is why the client doesn't just pick a point and then go downward ("downward" is fairly easy to calculate -- just take the gradient of the landscape, and head in that vector direction

), but rather, it allows some rising of the RMSD (actually crease energy -- the gen 0 structures are scored by RMSD, but energy is used after that, IIRC) before it finds a different local minimum.

To me, these outside parameters are an attempt to induce a perturbation to this population in an attempt to expand the search area about the "best" point from gen 0. However, and I'm sure that Howard and company will quickly correct me if I'm wrong, what I think is happening is that the perturbation is not preserving the energy surface and will corrupt validity of the results as it pertains to the original energy surface.

If by "energy surface", you mean the landscape (the thing riddled with local minima), then no, I don't believe it's getting changed by messing with laxness or whatever.

Laxness isn't a parameter that affects how good a given structure ends up being, it's a parameter that affects whether the structure can even be built or not by the algorithm. With a given laxness level, certain arrangements of amino acids are disregarded because atoms get too close to each other. Raise the laxness, though, and that exact same structure might become acceptable.

Instead, another means of inducing a perturbation to an isolated population that exists in some sort of dynamic equilibrium is to expose it to individuals from another population that also evolved to a dynamic equilibrium.

Ooooh, nifty idea! Sounds like Phase 3 to me...

Howard? Listening anymore?

**RaginSteveK** · 06-30-2003, 09:36 PM

the phase II FAQs do a really good job, esp for us non -biochemical types;

RMSD usually relate to a variance, not an absolute coordinate system..
the deviations of each amino acid placement from the ideal
[ usually some function of SQRT of sum-of-squares]

**prokaryote** · 06-30-2003, 10:34 PM

Hi bwkaz, thanks for the response

. It's great stuff to think about isn't it.

Okay, here's a couple of counter-points etc.;

Originally posted by bwkaz

This is exactly what the protein folding "landscape" looks like -- it's absolutely riddled with local minima. Which is why the client doesn't just pick a point and then go downward ("downward" is fairly easy to calculate -- just take the gradient of the landscape, and head in that vector direction ), but rather, it allows some rising of the RMSD (actually crease energy -- the gen 0 structures are scored by RMSD, but energy is used after that, IIRC) before it finds a different local minimum.

That's partially what SGA's (Simple Genetic Algorithms) are good at (gradient searches, able to eventually overcome local minima to find the more global minima (or maxima as the case may be), but given that you don't have many schemas to start with (limited initial population given the expanse of possible configurations for the protein), you can only explore a local area around the "best" point. Depending upon the mutation rate and the cross-over rate of course, but most of those are non-sensical and probably would be quickly discarded by the fitness function used I would think. Thus you're left with each client exploring only a dimple in the energy surface (really it would be the fitness function surface whatever is used to measure best fit energy, distance etc

)

Originally posted by bwkaz
If by "energy surface", you mean the landscape (the thing riddled with local minima), then no, I don't believe it's getting changed by messing with laxness or whatever.

Laxness isn't a parameter that affects how good a given structure ends up being, it's a parameter that affects whether the structure can even be built or not by the algorithm. With a given laxness level, certain arrangements of amino acids are disregarded because atoms get too close to each other. Raise the laxness, though, and that exact same structure might become acceptable..

I think that it is, since if a structure originally can't be built, then it doesn't have an associated RMSD (or it's RMSD is imaginary). By allowing the particular structure, you've changed/added to the local energy surface something that wasn't there originally. You've changed the environment/rules of the game. The individual should be deemed unfit and booted out of the gene-pool perhaps? The other way to think about it is that it allows structures that can't occur perhaps it may thus lead to a false minima that is normally unattainable and isolated to the protein? Of course, it could be that the structure building algorithm is not that precise. I don't know.

I'm still of the opinion that it was a way added later in the development of the SGA to get around their (SGA's) tendency to quickly converge to a solution (not necessarily the optimal solution) where the fitness function differences become miniscule but overriding and schema selection disappears nearly entirely. The parameters perhaps provide a nudge, if you will, to get it out of a local well that is too deep. If it truly is the global minima then it will return to it eventually. Unfortunately, I think that the nudge was applied to the energy surface (see above) instead of to the genome. Hence the idea about using members from other populations. Maintains the originality of the energy surface maybe.

Originally posted by bwkaz

Ooooh, nifty idea! Sounds like Phase 3 to me...

Howard? Listening anymore?

Just doing some thinking.

**IronBits** · 06-30-2003, 10:52 PM

Originally posted by Dyyryath
Nobody's going to accuse prokaryote of not thinking his answer through before posting.

Whatever you do, do not let Larry Loen hear about this forum

**prokaryote** · 06-30-2003, 10:57 PM

Hi RaginSteveK,

Originally posted by RaginSteveK
the phase II FAQs do a really good job, esp for us non -biochemical types;

RMSD usually relate to a variance, not an absolute coordinate system..
the deviations of each amino acid placement from the ideal
[ usually some function of SQRT of sum-of-squares]

Not a biochemical type myself, just have done some work with SGA's, Stats, AI, etc.

Checked out the FAQ's but not too much was divulged about the methodology used for phase II. Have to deconstruct what may be being used by looking at the results and their behaviour over time. Of course, it's understandable that Howard and company don't want to divulge all prior to publication etc.

RMSD = Root Mean Square Deviation = standard deviation = adjusted average variance if you will. Distance is a magnitude and in finding the RMSD, you minimize the sum of the squared differences, so it doesn't matter if it's a fixed coordinate system or not really.

**prokaryote** · 06-30-2003, 10:58 PM

Originally posted by IronBits
Whatever you do, do not let Larry Loen hear about this forum

Don't get it?

**Rids** · 07-01-2003, 07:28 AM

I do

**magicfan241** · 07-01-2003, 02:25 PM

Let me clairify IB. Larry Loen is proably the singular person with the HIGHEST average words per post count in the universe. And the majority of his writings mean as much as these do to me (a 10th grade student who hasn't bio, chem, or physics yet)

When I run acroos one of his posts, I have three options:
1)

2)

3)

or I could do this to IB:

OR

Your choice, what would have done to master of butt targets?

(j/k)
magicfan241

**IronBits** · 07-01-2003, 09:52 PM

Not a bad thing prokaryote, keep up the great posting!
I was only kidding around.

**prokaryote** · 07-02-2003, 12:22 AM

Hey no probs!

I know I can be long winded and the posts can get lecture like. Sometimes I fall asleed reading what I've posted

But seriously, I'd be interested in how the DF team has handled some of the issues as I'm sure that they've already delt with them to begin with.

One other question though is why not take the 50 best from the 10,000 initial and use the SGA to get decent schema from them? Increase the number of cross-overpoints and you decrease the size of the good schemas giving more versatility to protein structure construction. Then you could also implement the isolated population thing with the 1,000,000 good unit data base.

**IronBits** · 07-02-2003, 12:42 AM

From AMDZONE
AMD has announced core math library availability.
ACML is composed of highly optimized numeric functions for mathematical, engineering, scientific, and financial applications. The library, initially available with a FORTRAN interface, is comprised of a full implementation of optimized Level 1, 2 and 3 Basic Linear Algebra Subroutines (BLAS), Linear Algebra Package (LAPACK) as well as Fast Fourier Transforms (FFTs) in single-, double-, single-complex and double-complex data types. The 1.0 release will support 32-bit and 64-bit Linux as well as 32-bit Windows(r).
http://www.amdzone.com/releaseview.cfm?ReleaseID=1113
That should help the client go faster

**Mikus** · 07-02-2003, 02:45 AM

How does Phase II "feel" to me compared to Phase I ?

To me it seemed that in Phase I *all* structures had an equal probability of being a "good" one. In Phase II this seems to be true only of Generation_0 structures. [I believe that if the best structure found in Generation_0 is "not good enough", then Generations 1-250 will thereby be fated to produce disappointing RMSD values.]

And I am suspicious of the worth of follow-on generations once a minimum RMSD value has been reached. [I was unlucky enough to have generations 246 thru 250 of one cycle take more than an hour apiece on a fast CPU -- at that point the RMSD was not dropping, and was much higher than its earlier minimum (found already at Generation_60 or thereabouts).]

mikus

**Ned** · 07-02-2003, 10:33 AM

Mikus said:

And I am suspicious of the worth of follow-on generations once a minimum RMSD value has been reached.

While I've seen several examples that would back you up, I've seen others where the minimum occures much later. I think that the project would have to go through their results to come up with a meaniful method of "WHEN TO QUIT" examining the "local terrain around this valley".

One meaningful statistic which argues against you is that 7 of the top 10 current best RMS of A values were found at generation 250.... To me that would suggest that 250 generations is TOO EARLY TO QUIT.... Now the project should examine the results of those runs and determine how many minima/ maxima were encountered in the process of getting there and determine a CRITERIA for WHEN TO QUIT versus terminating the run at a fixed number. (Just my $0.02 worth!)

Ned

**bwkaz** · 07-02-2003, 06:59 PM

Originally posted by IronBits
From AMDZONE

<blah blah, math stuff>

That should help the client go faster

Actually, it may not. The vast majority of the client's time is spent chasing pointers, not doing math, which is why CPU-level optimizations like vector math support (and SSE instructions, for that matter) don't help it hardly at all. Howard has said that a couple of times already.

Well, actually, it was true in Phase 1. Now that I think about it, I'm not sure whether or not it's still true in Phase 2, though I would guess that it is.

**Welnic** · 07-02-2003, 08:35 PM

Originally posted by Ned
Mikus said:

...snip

One meaningful statistic which argues against you is that 7 of the top 10 current best RMS of A values were found at generation 250.... To me that would suggest that 250 generations is TOO EARLY TO QUIT.... Now the project should examine the results of those runs and determine how many minima/ maxima were encountered in the process of getting there and determine a CRITERIA for WHEN TO QUIT versus terminating the run at a fixed number. (Just my $0.02 worth!)

Ned

The Generation number in the top ten is how many generations have been completed in the run that output that RMS, not the generation that the low RMS came from. You can use this to see if there is any hope of improvement of a particular number. If you click on the View Details link and scroll down quite a ways you can see the RMS vs generation graph.

**Mikus** · 07-03-2003, 12:22 AM

Originally posted by Welnic
The Generation number in the top ten is how many generations have been completed in the run that output that RMS, not the generation that the low RMS came from. You can use this to see if there is any hope of improvement of a particular number. If you click on the View Details link and scroll down quite a ways you can see the RMS vs generation graph.

Right. I feel it would be much more meaningful for the top ten listing to indicate the specific generation when the low RMS was found.

Then one would __not__ have to go through extra steps (ten times !!) to gauge the worth of always doing 250 generations.

mikus

p.s. I think comparing the dfGUI charts of time_per_generation and RMS_per_generation is instructive.

**AMDPHREAK** · 07-03-2003, 10:54 PM

Wouldn't it make the most sense to scrutinize the Gen 0 results the most before using them as the basis for 250 generations of work?

I would love to see a variable gen 0 sampling - i.e. either

A) allowing a user-set value between 10K and 100K structures, (or 100 million for that matter) then choosing the best structure... or
B) continuing until a specified RMSD or similar judgement factor is reached. (generate until a sub-20 RMSD is reached, or a crease energy of XYZ, etc)

I dunno what effect this would have on answering Howard and Co.'s phase-related questions (like how many random samplings are needed, or other such non-protein specific queries), but 10K seems too little. Most of my gen 0 RMSD's are in the 40's, horrible even for Phase 1.

Although I do enjoy being a top 100 team all by my self now that the scores have reset.

**tpdooley** · 07-04-2003, 07:00 AM

It'd also be interesting to see what kind of structures are most likely to be helped through the Phase II engine, and become record breakers. And then see if we can sort through the 10k gen 0 structures for the lowest AA structure with similar attributes to those record breaking starter structures..

**Brian the Fist** · 07-14-2003, 06:25 PM

Originally posted by AMDPHREAK

I dunno what effect this would have on answering Howard and Co.'s phase-related questions (like how many random samplings are needed, or other such non-protein specific queries), but 10K seems too little. Most of my gen 0 RMSD's are in the 40's, horrible even for Phase 1.

Technically, we cheated a bit here - the gen. 0 'RMSD' is in fact a fitness score ranging from 0-100 and is NOT RMSD. From gen. 1 and onwards it is the true RMSD. Obviously it would be even more confusing to try and explain this in the graphs and such so we just don't bother mentioning it, but a fitness score of 40 aint so bad in this case. Sorry for the confusion.

**Brian the Fist** · 07-14-2003, 06:29 PM

Prokaryote more or less has the right idea, though I think he may be using RMSD and energy interchangably in some places, while they are 2 very different entities.

And to clarify, we are NOT using a genetic algorithm (yet). No crossover or mutation take place.

We intend to use a GA later on, perhaps first just locally, but eventually mixing and matching structure fragments from all participants (as hinted at by Prok.). First we wanted to try out this simpler approach, and get a generally iterative algorithm working. The main issue with using a true GA is that it will basically alienate all our off-line folders. We may still come up with a clever way for this to work to some extent though. Anyhow, that won't come for a while still until we work out the kinks in the current method and see just what it is capable of. It will be a relatively simple task to switch to a GA now though since the hard part was getting the iterations working well (which they almost are...)

**prokaryote** · 07-14-2003, 06:41 PM

Cool. Thanks for the info Dr. Feldman. May want to consider the isolated population / punctuated equilibrium approach for offline crunchers. When they do check back in they can get a new population and contribute what they've evolved.

Wasn't sure if you and the crew were translating RMSD into energies or not. Both could be used for a fitness function I suppose though.

prok

**Mikus** · 07-18-2003, 01:59 AM

For the second time since Phase II started, my single machine ended up doing what I would call a massive amount of "unproductive" work. It took more than 30 hours (on a 1.5 GHz machine) to go from generation 220 to generation 250. And all the time the RMSD stayed more or less level at 8.6 or thereabouts.

Unless there is information to be gained from beating and beating and beating a dead horse, I suggest that the DF client be shut off (and a new cycle begun from generation 0) whenever 10 generations take more than 10 hours, and the RMSD value does not appear to be dropping -- provided that a substantially better RMSD value was reached earlier in that same cycle.

mikus

**Brian the Fist** · 07-18-2003, 12:50 PM

Please remember that the RMSD doesn't always go down. That is why we need 10000 folders and can't just do it ourselves. Only by repeating the experiment many times in parallel will a few 'get lucky' and reach a low minimum. You cannot tell beforehand which way it will go. And remember when predicting unknown proteins as we will ultimately do, you cannot compute the RMSD as the structure is unknown, and your whole point is moot. Energy will be used instead, and an increase in energy does not always mean the structure is getting much worse.

**Mikus** · 07-18-2003, 06:30 PM

Think of the visual image of a square-root sign. It starts with a dip, but then has an extended horizontal tail.

What I'm saying is that if a lower RMSD has been achieved (the "dip"), but then the RMSD climbs and then stays level from then on (the "tail"), I fail to see what additional information is being contributed by further processing.

I don't mind having my computer process such a "tail" if the time/generation is "reasonable". What concerns me is putting in "excessive" computation (many many many repeats per structure, most of which go to more than 200000 alternate configurations) -- when the result achieved (no RMSD improvement) seems totally "useless" when expended on the __LAST__ generations of that cycle.

mikus

**Brian the Fist** · 07-19-2003, 01:04 PM

I repeat, you do not KNOW what it is going to look like until you have carried out the computation. Do not judge the performance based on the few sets of 250 generations you have seen on your own machines - remember their are thousands of these and the protein energy landscape is a multidimensional complex surface. It is not as simple as going up and down.

**Mikus** · 07-19-2003, 04:02 PM

I repeat, you do not KNOW what it is going to look like until you have carried out the computation. Do not judge the performance based on the few sets of 250 generations you have seen on your own machines - remember their are thousands of these and the protein energy landscape is a multidimensional complex surface. It is not as simple as going up and down.

Assume the best RMSD calculated for generation 250 ends up being numerically close to the best RMSD calculated for generation 249. And assume that a *significantly* lower RMSD was calculated at least 30 generations previously. [Remember -- 250 is the last generation in the cycle -- NO further client processing will be performed using the output of generation 250.]

Given that DF would be seeing the results of generation 249, may I ask: Of what __USE__ was it to have a relatively fast CPU spend more than an hour to calculate that final generation 250 ?

I have difficulty understanding that this hour of CPU time was not "wasted".

mikus

**Paratima** · 07-19-2003, 04:42 PM

Mikus:

Are you expecting someone to write you a check for that "wasted" hour on your system?

Go back and re-read the FAQ and the ABOUT sections on the DF site. This is a research project, not an engineering project. There are many excellent texts available that define the difference. In the very nature of research, a lot of time and effort may be expended without "positive" results. We still learn from the "negative" results, which includes what you refer to as "wasted" time.

You said earlier:

p.s. I think comparing the dfGUI charts of time_per_generation and RMS_per_generation is instructive.

It WILL BE instructive, but not until we've done enough of them to create a decent sample. "Decent" here being determined by Howard and the other project staff, not us, the horsepower.
Like that Roman dude said to Ben-Hur, "You exist to row. Row well!"

**bwkaz** · 07-19-2003, 06:51 PM

Originally posted by Mikus
Assume the best RMSD calculated for generation 250 ends up being numerically close to the best RMSD calculated for generation 249. And assume that a *significantly* lower RMSD was calculated at least 30 generations previously. [Remember -- 250 is the last generation in the cycle -- NO further client processing will be performed using the output of generation 250.]

Given that DF would be seeing the results of generation 249, may I ask: Of what __USE__ was it to have a relatively fast CPU spend more than an hour to calculate that final generation 250 ?

So... err... how would you "fix" this? You do not know at the beginning of any generation what the best RMSD for that generation will be -- if you did know this, we wouldn't be running this client, after all.

In order to not spend those hours on gen 250, the client would have had to know, before gen 250 started, whether or not to do it -- and I'm sure you can see the impossibility of that.

**Mikus** · 07-19-2003, 07:55 PM

Howard pointed out that the algorithm will eventually be tried on "unknown" proteins - for which a "deviation distance" (e.g., RMSD) cannot be calculated. So eventually a different measure (such as energy) will have to characterize the "goodness" of the calculations.

But as long as RMSD is being used to gain admittance to the Top 10, I figure that if the best RMSD calculated in a previous generation in this cycle was 2.0 (or more) below the best RMSD calculated in generation 249, then NO WAY will the best calculated RMSD for generation 250 drop by 2.0 (or more) from the generation 249 RMSD. (Yes, I am making a "gut prediction", based on so far not being aware of any "steep" drops of an RMSD value that has risen from a previous "dip".)

mikus

p.s. There is one Top 10 entry for which the RMSD was still dropping after generation 240. That case might have produced an even better RMSD if it had kept on going beyond generation 250.

**prokaryote** · 07-19-2003, 11:18 PM

Hi Mikus,

Can't answer for the current algorithm since Howard has stated that we are not yet using a genetic algorithm. However, when we do, you can't predict with much certainty what the next generation's individuals fitness function will be like (it may improve (likely) or it may not.

What is usually done is that an integration of the last "x" generations is taken and the delta of the fitness function is examined. If the rate of change is not "large enough" then further evolution or generations are terminated. This is similar to the argument that you are making right?

I'm not sure what kind of search is being done with the current algorithm, but I bet that it is similar to a steepest gradient search with some random noise thrown in, maybe even using some sort of annealing procedure. The point is, is that the RMSD or energy surface for the point selected from generation 0 (whatever people are calling the fitness measure these days) is not a nice smooth valley or dimple. There are bumps within the dimple and the steepest gradient search may get hung up in a local minima. You don't know unless you nudge it a bit to get it out of that small pit within the dimple. If it is, you'll see a "flat" line in the energy graph produced by dfGUI. The ups and downs along the flat portion will reflect the random noise confounded with any surface variabilities.

Since each dimple is unique (they are independent from other dimples and not part of the same population (Howard correct me if I'm wrong on this please, as there may be some similarity for other dimples that are "nearby")), you can't use a statistical sample to claim knowledge about a particular dimple based upon what other dimples have experienced. The only way to know with any degree of confidence is to execute "x" number generations or explorations in a psuedo random (downhill biased) walk about the current dimple in question.

Now is 250 the ideal number or is it something less or more? I don't know, you need to know something about the variance experienced between generations to say something meaningful about a particular dimple. If there's large variability, then more generations will be needed and if there's smaller variability then less. I'm guessing that "250" is a somewhat arbitrary number since we humans (with 10 fingers and 10 toes (most of us at least)) are biased towards numbers ending in "0" and "5". It was probably established to catch worst cases.

I really don't think that a global value can be established. It will depend upon each test case (my thought at least).

prok (yawn!)

Thread: New client that much better??

Thread Tools

Rate This Thread

Display

New client that much better??

Giving Up Too Quickly

Re: Giving Up Too Quickly

Re: Re: Giving Up Too Quickly

Posting Permissions