How about client programs only submit "winning" structures? [Archive]

View Full Version : How about client programs only submit "winning" structures?

Scott Jensen

05-03-2002, 02:46 AM

To dramatically cut down on server load and thus also the number, power, and expense of servers for DF, why not have our client programs only contact the server when they have better (smaller) RMSD structures than the current smallest one?

In this scenario, the client program crunches away as it currently does, but either after each structure it constructs or after so many structures (even 5,000), it compares their RMSD ratings against the current smallest structure for that protein. If none of the structures it has created beats the current smallest RMSD structure (or RMSD score) that it has on file, it simply trashes all of them, starts creating new ones, and doesn't bother the DF server. If it has one that beats the current title holder, it contacts the server and reports it, trashes the rest, and goes back to trying to create an even smaller RMSD one. And if it does report in a smaller RMSD structure than the one it has on file but the one it reports in doesn't beat the current smallest RMSD structure, the server updates the client program as to what is the current smallest RMSD structure (or possibly just RMSD score) so the client program doesn't bother it until it has a better one. If the client program does beat the current smallest RMSD structure, the volunteer gets credit and their name goes up in lights on the smallest structure board.

And to even cut down on server load further, when a new protein is selected, simply set the bar low to begin with. Some RMSD number that Dr. Hogue is sure that will be surpassed but which should take a good deal of time (folds) before it is. For example, RMSD of, say, 5.5 for one like the one we're currently working on. This way there's no initial flood of clients submitting in larger structures that are expected to very likely be surpassed.

To prevent a client program from never moving onto the next protein because it never beat the current smallest RMSD it had on file, have the client program "ping" the server ... hmmm ... once a week? Hmmm. Once every 100,000 folds? If once a week, when the client pings in, it would inform the server how many folds it has done since the last ping-in so that score could be added to the folder's stats and, saying there's any, then get the newest (updated) versions of itself (client program). If once every 100K folds, even less information would need to be given to the server thus less server time taken up and load stress on the server ... though it would still receive any updated versions of itself if there's any at that ping-in time. At these ping-ins, the server would ping back the current smallest RMSD score and, when it's getting close to being done with a protein, instruct the client program to ping back in a shorter time span (or fewer folds) so when roll-over occurs for the next protein, it isn't still working on the old one too long. Hmmm. Then again, to reduce down server traffic, perhaps the server should NOT instruct the client to ping-in in a shorter time (or fewer folds) when the current protein worked on is nearing its folds goal. The logic being that there really wouldn't be any harm in letting the client continue on since there's a chance it might just come up with a still smaller RMSD structure (which would be great if it did), the volunteer will still get credit for folds done (even though they've exceeded that protein's fold goal), and no special pinging instructions would need to be written up for the server and client to shorten the client's ping-ins at some point ... as well as less of a flood of client programs pinging in to get the next protein to be worked on.

If the above were done, DF webmaster should probably then replace "Best structures generated to date" roster with "Milestones in structure creation" (MISC). MISC would chronicle each folder that beat the then-current smallest RMSD structure. MISC would give not only the folder's name, organization name, team name, and "Best RMS in A" score, but also the date when they achieved it. The first score on this milestone chart could be a sort of "Dr. Hogue's RMSD Challenge" which would be what the client programs would have on file that would come with each new protein to be worked on and needed to be beat before reporting in a structure.

Now I could see what might kill the above idea is if the RMSD scoring program is a big program that would take up too much space on people's computers. If it isn't, I don't see any other problem with the above idea. Or am I missing something ... again?

muttley

05-03-2002, 03:56 AM

I'm just a newbie to this and read a lot and am good at research.

If I remember right there is a time out limit I would guess for slower computers, and also for computer crashes, lockups, errors, etc.

preface... sourse code is not given out at times for security reasons. In this case a computer can revalidate that a result turned in is valid. However what happens if a person deviates and sends in a signal that says they have not had a positive in the last 100,000 and in reallity they had some that were.

Also I think that having a database of conformers that are close would have the ability to lead a possibility of something being useful discovered. (hope that layman answer is understod.)

In a race for the absolute least the c??? 4 and 5. Then all out modification of the program for a race is different.

Next in my reading I think that someone asked if this was on 48-hours a person answered that this present system is upgrading for 50,000 users or about there and 100,000 was comming. Seti their problem is bandwidth of the internet connection comming into the school. I doubt the internet connection is being stressed or the server I havent seen a slowness when it connects.

I would imagine from a report of replacing a drive and no information lost is that they don't have hot swap drives and enough raid interactive backup hard drives. I have my preferences in hardware usage but they I think keeping all the information would be 5 terabytes which is about the size of your brain or a little large (4 terrabytes I think.) Then toss the useless stuff at the finish of the project.

I may have some of my facts mixed cause of all my reading but I hope this is close.

my 22 cents
mutley/Bruce

bwkaz

05-03-2002, 07:59 AM

I'm fairly sure they don't only care about the smallest RMSD structures. If they're trying to correlate a (collection of) scoring function(s) with RMS deviation, then they want to get as many (score, RMSD) ordered pairs as possible to make the correlation worthwhile.

It's a statistics problem now. They plot all the (score, RMSD) pairs, and fit a line (or parabola, or whatever other function seems to work; I'll assume it's linear for the moment) to the points. If the points are scattered all over the place, then whichever scoring function they're using to calculate the "score" part of the ordered pairs is a bad one because it doesn't "predict" the RMSD of a given structure very well. If there isn't much scattering (say, if all the points are really close to the line y = x/2), then they'll be able to estimate the RMSD of any given structure without knowing the true structure (and this is important, not just for CASP 5, but for future research on unknown proteins). In this case, they just find the score of the candidate as given by that scoring function, divide it by two, and that number will be the expected RMSD.

However, you can't really do statistics like this on only 20 data points, or even 200. You need as many as possible, all the way along the line, from 4 angstrom RMSD structures all the way up to 15 angstrom RMSD ones.

Brian the Fist

05-03-2002, 10:55 AM

As bwkaz has summed up, in the initial phase, we definitely want to record energies and other statistics on all billion structures we generated. Later on, we will be doing novel proteins where the structure is not know so then we cannot even compute the RMSD. We do however intend to collect less structures and data for CASP predictions (only the best or most likely ones basically). This will reduce network traffic somewhat, although this has not really been a problem with the current number of users.

As for the idea of reporting how many units they've completed when contacting the server, without uploading evidence of the work done, this is just asking for trouble. This would open a huge door for potential cheaters, and duplication of data (of which we already receive a fair bit). We wish to maintain the integrity of our data at all costs. Thanks for the ideas though, and keep 'em coming; I can assure you they will all be read and considered, at the least.

pointwood

05-06-2002, 05:49 AM

Maybe it's not a problem at your end, but for people on dial-up connections, it is ;)

And it would make it easier to scale your setup also.