Smallest RMSD structure: The King of the Hill

**Scott Jensen** · 05-06-2002, 12:30 AM

OK, I've got to admit it. One of the things I always check each day is my smallest RMSD structure. At present, that's 6.30. Oh well, I'm only crunching with a single 500 MHz Dell computer. Anyway, what I also always check is the current title holder.

Just yesterday (or the day before), Millenko (4.73) dethroned Zaphod (4.97) as the new title holder. Zaphod held it for so long I didn't know if s/he (it?) was ever going to be surpassed.

What I think would be fun to know is the date when a member of the Top Ten made that list. And possibly how many days the King of the Hill has been on top. Perhaps even when they're dethroned, the "how many days on top" stat could still be given for them.

Yes, yes, this isn't going to contribute to the science ... directly ... but I would find such fun facts enjoyable to know and I think such things are good for keeping crunchers' interests up ... which is going to be harder and harder to do as the proteins get bigger and bigger thus the longer they'll take to finish. That and it would be a very cheap little additional pat on the back for those that did get to make the Top Ten ... especially for present and past Kings of the Hill.

On a related matter...

I assume as we crunch through this big protein that new Kings of the Hills will be less and less frequent as each new King lowers the bar more and more. I wonder if any of our math wizards here can give a good time-estimate and odds of a cruncher dethroning Millenko. If some team were to creat these odds each time a King is dethroned and another one is crowned, please give the URL to its webpage as I'd like to check it when a dethroning takes place. Better yet, if such a fun fact could be sent in an email, I'd love to sign up for that emailing list.

**pointwood** · 05-06-2002, 06:46 AM

Í think these are excellent ideas!

**Brian the Fist** · 05-06-2002, 09:56 AM

Sounds like a fair idea.
To whet your appetite for now, a very crude estimate (order of magnitude) for the probability of getting a low RMSD structure with <= X angstroms RMSD is:

exp(-exp((15.5-X)/3.06))

where exp(x) is the 2.71828... raised to the power x. The origins of this formula are explained in our 2002 paper. For example, for X=4.73 as it does now, you get 2.1 x 10^-15
Since we've only made 1.5 x 10^9 so far it is a crude estimate though

In fact is you sub in this number and solve for X, the EXPECTED best RMSD to date is 6.16 A.
Anyways, the more mathematically inclined folks should get the idea by now.

**Scott Jensen** · 05-06-2002, 04:40 PM

Howard,

I didn't know there was already a math formula for this.

Could you rig it into a four-layer stat for the King which shows King's RMSD score on top, date of the crowning, number of days on top, and odds of beating it before the cut-off happens? For past Kings, you'd just leave off the last stat since I think people would be just interested in the odds of beating the current title holder. As for those that make the Top Ten but never earn the title -- even for a second -- of King, they'd get just the two top layers of stats.

Also, I would assume the odds would steadily increase as the cut-off line gets closer and closer. Does that formula take that into effect as well?

**Arnim Weber (Rechenkraft)** · 05-07-2002, 06:37 AM

Very nice idea !!!

**Scoofy12** · 05-07-2002, 04:47 PM

Originally posted by Scott Jensen
Howard,

Also, I would assume the odds would steadily increase as the cut-off line gets closer and closer. Does that formula take that into effect as well?

I assume you mean the odds against getting the winning structure? I think what howard means when he says the odds are 2.1 x 10^-15 is that this is the probability, for any given generated structure, of that structure beating the current king. sooo... that means that on average you would have to generate 1 over that amount (4.76 x 10^14) of structures to beat 4.73 angstroms. hence howard mentioning that the amount of structures we actually generated to get that is several orders of magnitude lower (we generated 300,000 times FEWER structures than expected). a crude estimate indeed, but great that we are doing better than expected

**Scott Jensen** · 05-07-2002, 09:31 PM

Originally posted by Scoofy12

I assume you mean the odds against getting the winning structure?

Yes.

Originally posted by Scoofy12

I think what howard means when he says the odds are 2.1 x 10^-15 is that this is the probability, for any given generated structure, of that structure beating the current king. sooo... that means that on average you would have to generate 1 over that amount (4.76 x 10^14) of structures to beat 4.73 angstroms. hence howard mentioning that the amount of structures we actually generated to get that is several orders of magnitude lower (we generated 300,000 times FEWER structures than expected). a crude estimate indeed, but great that we are doing better than expected

So if the number of structures projected to be needed to be generated to beat the current King is one or more than the number of structures to be accepted before the cut-off for the protein is done, it means the odds of beating the King is projected as being zero?

And if the number of structures by Howard's formula were one or more less than the number still to be accepted before the cut-off line, the odds of someone beating the current King would be 100%?

Is this right?

I was hoping for some kind of odds statement like "50/50 chance", "1 in 23 chance", or "42,943 to one chance". These odds steadily increasing as the cut-off line gets closer and closer. Could Howard's formula be interpreted in a way to give such odds?

**Brian the Fist** · 05-08-2002, 10:45 AM

I can just make up some numbers if you want

Anyways, I know what you are asking for (more or less) and will attempt to oblige in the not-too-distant future.

**plaidfishes** · 05-10-2002, 03:22 AM

"What are the odds of knocking off the current king?" goes right to the heart of the problem as I see it. From my inadequate understanding, Howards best explaination of how this all works suggests that the RMSA's he is getting right now are wildly improbable. It is on the order of winning the lottery type improbable. But he has "won the lottery" several times in a row now. If somebody won the real lottery 5 times in a row, most people would start wondering why it is so easy to win.

Does the chances of knocking off the current king approach zero? My rule of thumb estimate says no. There are still pretty frequent changes to the top ten. Since the numbers returned from theory don't match the current experimental numbers, the theory may be too incomplete to use for such predictions. As Howard says, given the number of structures crunched to date, the best RMSA SHOULD be around 6.16. Personally, mine is 5.59 and I have run only 800,000 structures. If these results are validated, it is a major discovery. In essence, it implys that somebody running a handful of computers could get a good enough structure within a few weeks.

He has a few things to get done before he celebrates. He has to validate that nobody cheated with the structures sent in to him. He has to make damn sure he isnt getting some seriously goofball result deep in his software that is screwing up the numbers. Next, he must come up with a very convincing explaination as to why the software is 300,000 times better than theoretically expected. This last one may not be easy.

He has a good solid theoretical basis for believing the DF approach, if given billions of structures, will find a close enough structure (defined as less than 6.0 RMSA) His calculations suggest that the number of structures required are roughly:

Small: 1 Billion
Medium: 10 Billion
Large: If it works on small and medium he can get lots of computers.

Considering the nature of the project and his answers to the "current king" problem, he probably expected the results would be something like 5.5 as the average "winning" RMSA with a couple of hundred non-winning structures below 6.0. As an experienced researcher, he also probably tossed in a hefty fudge factor.

Instead of behaving as planned so he could write the dissertation, the experiment has returned some unexpected results.

Small: Thousands of RMSA way way way below 6.0. Like 2.03
Medium: RMSA below 5.0 with less than 20% of structures complete. Thousands of structures at less than 6.0.

This is the classic rasion d'etre of experimental science. To find the point where your results are significantly different from theoretical expectations.

Remember, Howard is a grad student, working on his doctoral thesis. If he can survive the garanteed incredibly brutal examination of his answers to the above problems,

he gets to be a PhD.

My bet, is that he will be able to show his code isn't wacked. It would seem that cheating is unlikely since his method requires proof of work completed. So the really hairy problem for Howard is explaining why his software is 300,000 times faster than expected. It is not likely to be just his fudge factor. Either there is something fundemental missing in folding theory on the biology side or something fundemental missing in the computer science. Or both. I also bet that Howard will wait for CASP 5 before committing himself completely. As a completely blind test he can validate the software. If it does well, he wouldn't need an explaination for the results, he would have independant substantiation.

**GOLDENBALLSAINTYORK** · 05-10-2002, 09:55 PM

Hmmm..
I'm not into going into the math(s)..But it does seem a strange distribution, viewing from the seat of the pants...
I don't really understand the SPA they are doing here...but I would've expected more changes in the RMSDA than what we have seen to date!?

One has a feeling for these things...being a cenobite and not math or penguin compuer thingey!

Bio, Stats, Maths,..and Linux!! eeeeek

**GOLDENBALLSAINTYORK** · 05-10-2002, 10:04 PM

Forgot to say!
Results don't follow a normal dist..
I shall check back in a week!!:sleepy:

**GOLDENBALLSAINTYORK** · 05-11-2002, 09:11 PM

Another day

Any chance of a graphical RMSDA frpom all users!

**Brian the Fist** · 05-12-2002, 04:13 PM

We have previously found (and published; Proteins 2002) that RMSD follows approximately an Extreme Value Distribution (see Google on that one). If you were to measure the height of the tallest person in each class of a college, the distribution would also match an EVD. It is definitely NOT Gaussian (normal).
The reason we are alreayd 300,000 times better is due to the approximation - it only approximately fits an EVD. Still, our 'expected' best of 6.16A isnt that far off from the 5.0 that is the 2nd best structure right now. It is great that we can estimate this at all though, as many cannot. This gives us the power to predict how much sampling of a given protein needs to be done to get something good. Another rule of thumb is every time our sampling size increases by a factor of 10 for a given protein, we expect roughly a 1A improvement. Thus if 5A is our best after 1 billion, we expect something close to 4A for 10 billion.

The number of structures any individual has generated is irrelevant. You could theoretically make just 1 structure and get, say a 2A structure. It is like winning the lottery. Just because someone who's never played before buys a ticket and wins, that doesn't mean you should stop buying 100 tickets/week or whatever to increase your own chance of winning

Making more structures increases your chance of getting a low RMSD but the important quantity is the TOTAL number of structures made by everybody altogether. As this grows, the best RMSD will gradually decrease.

**Scott Jensen** · 05-12-2002, 06:34 PM

Howard,

Thanks for the reply.

RAJ, AEGION, BWKAZ, or someone:

Could you explain that in a "bit" more layman terms? The Google search either produced porn ... ugh, just when you thought you knew what "extreme" could mean ... or technobabble.

Also, what's "Gaussian" and what does the "A" in say 6A mean? And while I'm asking, could you also explain the "why" of the following: "...but the important quantity is the TOTAL number of structures made by everybody altogether."

Lastly, can I get a college credit in computational biology for reading this forum?

**Scoofy12** · 05-12-2002, 08:59 PM

Originally posted by Scott Jensen

Also, what's "Gaussian" and what does the "A" in say 6A mean? And while I'm asking, could you also explain the "why" of the following: "...but the important quantity is the TOTAL number of structures made by everybody altogether."

Lastly, can I get a college credit in computational biology for reading this forum?

Ok, hope i can help a bit... first of all, the A. The "A" is for angstroms (its a very small unit of length - a ten billionth of a meter). we measure the difference between the protein generated and the "real" one, the structure that appears in nature. when we talk about a structure having a 4A RMSD (root mean squared deviation) that means that the structure generated deviates from the real structure by an average of 4 angstroms. basically the smaller the RMSD is, the closer the generated structure is to the real one.
next, Gaussian. A Gaussian or "normal" probability distribution is a fancy way of saying a bell curve. remember algebra 2 from high school?

Basically you have the mean or average in the middle, where most things are, and it falls off on either side (most things are in the middle near the average, and the farther out you go from the average, the fewer you get) Many random things fall into a gaussian distribution because that's what kind of curve you tend to get when you have a lot of independent random variables working together and you add them up (it's been proven). Probably all the structures generated taken together make up a Gaussian distribution.
An extreme value distribution is the kind of distribution you get when you take, for example, the largest value out of each of a lot of sets. To use howard's example, if you took the tallest person out of each class at a college, or the smallest RMSD out of each set of proteins. The result is a different kind of statistical distribution, which to the untrained eye (like mine, heh) looks vaguely like a gaussian bell curve, but isn't.

Attached is a picture of some extreme value distributions (where the minimum was taken), from http://www.itl.nist.gov/div898/handb...on1/apr163.htm
if you are familiar with gaussian bell curves you may notice that unlike the gaussians, these shapes are asymmetric, but like gaussians, can have varying degrees of "spread-out-ness" (standard deviation).

As for "...but the important quantity is the TOTAL number of structures made by everybody altogether."
I think what Howard meant is that it doesnt matter how many any individual person produces, whats important is how much everyone together produces, since they all end up in the same pot anyway. it makes sense to think that since the structures are sorta randomly generated, that the more you generate, the better your odds of getting good ones (kinda like having lots of lottery tickets gives you better odds of winning).
computational biology? nope, you just got a crash course in statistics

whew. hope that helps

**Scott Jensen** · 05-12-2002, 09:38 PM

Scoofy12,

Thanks. That helped.

Just a few questions on your explanation.

How is the measurement difference measured? Couldn't a number of protein structures have the same score but look totally different and thus perform differently? -- For example, you and I are exactly ten miles from the Statue of Liberty. However, while I'm having a nice cup of espresso in Bronx, you're sinking to the bottom of the Atlantic Ocean. -- If this analogy is reasonably accurate, how do we know which of the two identically scored protein structures will actually perform closest to the real protein?

And this with known proteins. How can you do such scoring/predicting with unknown ones like I guess CASP5 presents? Seems like you're giving us a titanium dart and asking us to hit the Statue of Liberty while being blindfolded ... and spun around ... at an altitude of a 100 miles ... while orbiting the Earth every second.

Hmmm. I guess if you tried it 10 gazillion times, you'd eventually pick off the pigeon pooping on the crown.

Which raises the question of how will Dr. Hogue and Howard know which ones to submit to CASP5?

**ColinT** · 05-12-2002, 09:47 PM

You kids are so smart! Makes me want to live under the porch and not show my face.

Feeling dumb,

Colin

**Paratima** · 05-12-2002, 10:18 PM

Move over.

**Scott Jensen** · 05-12-2002, 10:22 PM

ColinT and Paratima,

Go find your own porch! I was under here first.

**Scoofy12** · 05-12-2002, 10:56 PM

Originally posted by Scott Jensen
Scoofy12,

Thanks. That helped.

Just a few questions on your explanation.

How is the measurement difference measured?

I'm not sure, exactly. In general, a root mean squared (RMS) value is just sort of a way to take the average value of a function without it being 0 (like for example a sine wave spends half its time in positive values and half its time in negative values. if you just did an average value, the average would be 0, which wouldnt tell you anything. but the RMS value basically squares a function (which makes it always positive), averages it, and then takes the square root. (the average value for a sine wave is zero, but the RMS value is the amplitude or max value divided by the square root of 2). Presumably <conjecture> they have a function for proteins whose value is the distance in space from the actual structure to a corresponding spot (maybe the closest, but maybe not) in the generated one, so high values of the function mean large differences between the generated and real structure. then they take the RMS value (so, for example, it would not make a difference what direction the distance is in, positive or negative X direction for example, you would just get a sort of average value for raw distance). </conjecture> so a higher RMS value means more difference between generated and real structure, or a less good approximateion.

Couldn't a number of protein structures have the same score but look totally different and thus perform differently? ...
If this analogy is reasonably accurate, how do we know which of the two identically scored protein structures will actually perform closest to the real protein?

yes, they could. i'm not sure whether you can know how they will perform or if so, how. youll have to ask howard.

And this with known proteins. How can you do such scoring/predicting with unknown ones like I guess CASP5 presents?

Basically, for unknown structures, none of this RMS stuff applies, because youre right, we have no known structure to compare and take RMSs from. that's why they have this "scoring" problem you may have heard Raj and others talk about. Basically, as far as i understand it (and here we go outside of my knowledge of prob and stat and into the realm of biochemistry) they have a way of taking the total energy (as chemical potential energy? not sure about this) in the molecule, and they take the lowest-energy structures, on the assumption that in real life, things that undergo chemical reactions tend toward the lowest energy state... (entropy and all that). so they guess that low energy structures are more likely to be close to the real one.

Seems like you're giving us a titanium dart and asking us to hit the Statue of Liberty while being blindfolded ... and spun around ... at an altitude of a 100 miles ... while orbiting the Earth every second.

hehe... not quite. they have some way of reducing the sample space, or total possibilities for structures they generate, like for example, when the client stalls and backtracks, it has generated a structure that intersects itself and can't actually exist. I'm sure there are numerous other ways. this is the "sampling" problem that Raj talked about. so maybe this is more like not being spun around, but pointed in a direction that you knew was within 45 degress or so of the statue

.. youd still have to shoot a lot of darts, but not quite so many.

Which raises the question of how will Dr. Hogue and Howard know which ones to submit to CASP5?

I dont know if its only the energy they go by, or if they have other scoring functions. maybe Howard can help us with that, or maybe I or someone can go read their papers

BTW, if it makes you guys feel any better, i just finished a 3-hour college course in "Probability and Random Variables in Electrical Engineering" which covered a lot of the statistics stuff i just rattled off... and EEs deal with RMS values all the time

... wow, who knew all this stuff was actually useful?

**Scoofy12** · 05-12-2002, 11:37 PM

Wow, heres a quote from their Proteins 2002 paper about your Statue of liberty question:
"Protein structure prediction from sequence alone by "brute force" random methods is a computationally expensive problem. Estimates have suggested that it could take all the computers in the world longer than the age of the universe to compute the structure of a single 200-residue protein."
(Guess that's because it's an NP complete problem, for you computer-science buffs)

If you go to http://bioinfo.mshri.on.ca/trades/sampling.htm they have stuff about that. Structures are generated in a "Kinetic self-avoiding random walk" which means "In its simplest form, a random walk begins by placing the first two residues, [amino acids in the protein?... our current protein has 76 residues] and then proceeds to extend the chain one residue at a time, checking for atomic collisions as they are added, until the end of the protein is reached and a conformation [structure] is output" Apparently, as far as i can gather they further reduce the possibilities by using a "trajectory distribution" for each residue. trajectory means where it goes (like the trajectory of a ball thrown throught the ari) and distribution as in probability distribution.... i thikn what this means is that they have sort of a map of which directions each residue is most likely to go in, and they pick directions from this map for each residue of the protein that are the most likely. their algorithm also allows "geometric constraints" which i think means you can tell it that 2 certain parts of the protein are close together (ie if you knew residue 5 and 34 were close together in space you could tell it that and that would limit your possible structures)... it also "handles cis-proline residues as well as post- translationally modified amino acids" whew. it's anyone's guess as to what those are! maybe some kind of special case of certain acids as they appear in proteins. i dunno. this is complicated!

**Brian the Fist** · 05-13-2002, 03:10 PM

Excellent job Scoofy! Your answers are almost completely on the ball (even your guesses)! Can we hire you?

Anyways, choosing the best structures from the samples that we generate is the most difficult part of the protein folding problem. We attempt to use approximations of the energy of the structures to choose (low energy = good structure in general) but these approximations are not always accurate enough. We cannot compute the exact energy as it would be too time consuming, even with lots of computers!

**Beebleman** · 05-13-2002, 10:17 PM

Okay I was wondering about an equilibrium search. Now I understand why all the protein folding algorithms, look for a low energy. As I understand it the engery actually corresponds more directly with the forces in the molecule anyways.

I find minimization to be a vary fascinating area of computer science, so this whole project is just great to be a part of.

Too bad there was a quick and dirty way to calculate the total Potential engery, then the minimization would only require find a structure where the two forces balanced each other.

On that note I wonder if that is why some Molecular Dynamics simulations have found that there are a lot of structures found with lower Energies than native proteins?

Just putting thoughts out, trying to understand the problem better. Anybody care to go into detail on the engery calculations that can be done?

Dan

**MAD-ness** · 05-13-2002, 10:19 PM

I read most of the science page, not that 2002 paper though I don't think, so what scoofy just cut and pasted sounded familiar, but it was mostly in one ear and out the other.

All that statistics information was intriguing and well explained Scoofy. If you have a good textbook recommendation on statistics I would appreciate it. Otherwise I am gonna come calling on you every time I Have a statistics problem.

Hehe.

Very good work.

**Jodie** · 05-15-2002, 12:44 AM

Statistics for the Utterly Confused. By Dr. Lloyd Jaisingh, I believe.

Multivariate Statistics by Barbara [last name I usually goof up] Tabachnik

Elemetary Statistics by Mario Triola

Probability and Statistics by Morris H. Degroot and Mark J. Schervisch

Statistical Methods in Bioinformatics is interesting. Dont' remember all the authors, it's at home with most of my other non-management, non-business, non-CS books... Umm, Gregory Grant
and Warren Ewens (?) I think...

For the engineering types - you can't live without -

Numerical Recipes in C++: The Art of Scientific Computing
by William H. Press, et al

There are others that are good. Let me know if that list doesn't work out for you...

I learned just about enough to get me on past where I was so I could get to wavelet decomp and other fun stuff. [grin]

Ended up with an advanced degree in the field somehow, for _insert diety of choice_s' sake!

**Scott Jensen** · 05-22-2002, 01:18 PM

*cough* *cough*

So, Howard, when will the above mentioned fun facts be added?

**Scoofy12** · 05-22-2002, 11:40 PM

Is there some kind of user FAQ about the project itself? Maybe all this stuff and others could be rolled into soemthing like that... are there project FAQs at teams' websites this could be added to?

**Stardragon** · 05-23-2002, 09:46 AM

Scoofy,

I'm not sure which additional facts/FAQs you would like to see added (specifically or topic-wise).

The vast majority of the facts mentioned in the thread can be found either on www.distributedfolding.org, under various sections, or at http://bioinfo.mshri.on.ca/trades/ for more detailed scientific information.

However, if you feel that this thread lends itself to a new section on the website, or the coverage of additional topics, please let us know

**Scott Jensen** · 05-23-2002, 12:43 PM

This thread and "The Science of DF?" thread have been very educational. I'd suggested "simply" making a new section with the "Bug Tracking" and "Technical Support" and title it "Educational". Toss in good informative threads that have been discussed here in the general forum into that one, but if someone posts a reply to a thread in there, have it appear BOTH in there and in this general forum for WIDE discussion. I would not allow new threads to be started up in there. Have it be a moderator-placed-only thing. The threads have to start up in this general discussion forum and if they turn out to be fairly educational, honor them by tossing them into the Educational section.

I'd also provide a link to that Educational section on the website. These threads are a lot better than standard FAQs because there's give-and-take between askers and answerers and these are more geared for the layman since it is us laymen that are doing the asking. Normal FAQs commonly fail because the ones-with-all-the-answers did both the "asking" and answering thus they (the ones-with-all-the-answers) tend to think they have sufficiently answered the question when they haven't and/or have "asked" all the questions people would likely ask. A lot of times the ones-with-all-the-answers don't realize when they're jumping ahead of the horse or assuming "everyone" knows such-and-such fact, theory, etc. These threads are not perfect, but they have done a LOT better job than normal FAQs.

As for slimming the threads down that you toss into the Educational section ... hmmm ... that could be a bit touchy. Also, sub-threads can be just as educational. It might be a bit messy, but I wouldn't edit them ... even editting out this little discussion about educational threads and what to do with them since in a way it is also educational about how this project is trying to educate crunchers about itself.

By the way, I'd suggest that "Rate this Thread" be changed to "How educational is this thread?". First, I've never seen (or more likely paid attention to) such ratings before and doubt anyone has. Instead ask "How educational is this thread?" then have only two check-off boxes of "This is a good one to add to the Educational section" and "This is a good one to add to the Technical Support section". This would simply be a way for you moderators to know what we crunchers think was educational or good technical support. The voting would really have no power (cannot put a thread into the Educational or Technical Support sections), but simply would raise that thread to your attention for consideration for one or the other section.

**MAD-ness** · 05-23-2002, 07:59 PM

I have to say that I agree with Scott.

A very minimal amount of energy/resources would go into doing something like what he suggests (I actually like the entire idea, but the core of it is what is most important) and it would (hopefully) increase awareness about the project and its science.

Also, why do more work when we (the users) can or have done some of it for you? For example, Scoofy's statistical analysis, etc. This thread and a few others here have answered a LOT of my personal questions which I was not able to answer for/by myself as a result of reading the official web site (every single page I could find) and the TraDES site (been about 6 weeks since I went there, but I read the vast majority of it).

**Brian the Fist** · 05-24-2002, 12:46 PM

We of course have no control over this forum, so I believe it will be up to Dryyrath (sp?) to implement these wishes.. what do you say D?

**MAD-ness** · 05-24-2002, 06:28 PM

Hrmm. Guess I can't speak for Dyyryath, but I am under the impression he would set up anything reasonable that you requested, time permitting, of course.

**Dyyryath** · 05-25-2002, 03:42 PM

Hey guys, I missed this thread earlier. Thanks to Pointwood for sending me an email and bringing it to my attention.

I'd be more than happy to set up another sub-forum (like Tech Support & Bug Tracking) for you guys. We'd like to make this as useful as possible for you guys.

Brian, if you'd nail down the specifics (i.e. what to call it, who can post there, and who should moderate it), I'll get it set up immediately.

Dyyryath

**Brian the Fist** · 05-25-2002, 04:27 PM

Well D, if you read Scott Jensen's post about 4 messages up in this thread, that about sums it up. It should be a read-only thread called 'Educational' and we will simply move/copy useful/educational threads such as this one, and our previous rant with Raj, into that forum so people can find them if interested. Thanks.
If possible, it would be good if us moderators had some way to move/copy a thread from another forum into this read-only forum (I have no idea how I would do that, Id just as you

)

**Dyyryath** · 05-26-2002, 02:36 AM

OK, you've got a place to put them. It's not *entirely* locked, however. I've currently got it set to allow anyone to post, but only posts that are OK'd by a moderator are actually displayed.

If you think this will be too much for the moderators to handle, let me know and I'll lock the entire section so that the only means of adding content is to move threads from other locations.

As always, if there is anything else we can do to make your home here more comfortable, don't hesitate to ask.

**MAD-ness** · 05-26-2002, 10:16 PM

Seriously Dyyryath, anywhere we can send beer or something? =)

**Jodie** · 05-26-2002, 10:27 PM

Yeah - a dead serious "ditto".

And the job offer is still open, btw.

**Joey2034** · 06-20-2002, 09:44 PM

First, the probability function. Reading carefully shows that the function that was given at the beginning of this thread is not a probability density function (PDF), it is the probability of obtaining a structure with an RMSD less than or equal to X. Therefor, taking the derivative of this function will give the PDF (it's really too complicated to write in text format). However, the mean value of the PDF is 15.5A, which means that one would expect the most frequently occuring RMSDs to be around this value. This is obviously not true; the most frequently occurring is 99.9999999. One thing that this PDF doesn't take into account is that structures with RMSD of more than 100A are rounded down to 100, I think. But the function given says that ALL structures will have 100A or less RMSD. In actuality, there is a big jump from the probability of getting a 100 to the probability of getting a 99, which is not reflected in the PDF.

Next, there was a question about the energy. All matter in the universe is subject to four basic forces (depending how you count). They are the strong force, the weak force, the electromagnetic force, and gravity. In intermolecular interactions, only three of these matter, because one of them has to do with interactions inside the nuclei of atoms. The gravitational interactions, when taken on the molecular scale, are known as Van der Waals forces, and they depend on the masses of the interacting particles. The electromagnetic interactions depend on the charges of the particles (a proton having a +1 charge, and an electron having a -1 charge). I always get the strong and weak forces mixed up, so I'll skip them.

There are two types of energy to be mindful of: kinetic and potential. Potential energy is when a particle (or any massive object), has forces acting on it that can accelerate it. When the particle is allowed to accelerate, the potential energy is changed into kinetic energy. The potential enery, both in gravity and electromagnetism, follows an inverse-square law, which basically means that the closer a particle gets to another particle, the less potential energy the particles will have. (To note, the particles can't accelerate forever, so the kinetic energy, which depends on a particle's velocity, gets changed into other types of energy when two particles get close enough together, such as sound energy, light energy, heat energy, chemical energy, etc.)

Now, in chemistry, molecules are known to be more stable when they have less potential energy. To apply this to folding, the proteins will jiggle and bobble around until they find a setup which is stable enough to maintain. If it is not stable enough, the external forces from water molecules and such will force the protein to unfold and start over.

So, when they say that proteins with lower energies are scored better, it is because they are more stable, and more accurately represent a molecule that will occur in nature.

Furthurmore, there was mention of calculated proteins that have lower energies than the actual things. This is because, in the real world, if a protein finds a setup that is stable enough to withstand the external forces, it won't unfold and try to find an even more stable setup, simply because it doesn't need to. Say the external forces are 5 (example units), the protein will need to be able to withstand 5 units of external force. Now imagine that the most stable setup of the protein that is computationally possible can withstand 4 units of external force. But, in it's bobbling and jangling, the protein finds a setup that can withstand 4.8 units. This setup will stick, because the external forces won't pull it apart. However, it is not the most stable setup possible, it just works.

I hope this helps at least somebody.

Ciao

Thread: Smallest RMSD structure: The King of the Hill

Thread Tools

Rate This Thread

Display

Smallest RMSD structure: The King of the Hill

No equations translation

Posting Permissions