PDA

View Full Version : suggestions for beta client



__k
03-15-2003, 07:53 AM
some of the generated structures seem
very long, they get stuck too.
I think the longest protein known to man, times 5 even wouldn't be as long as current stuck proteins.
can't you set a limit of size on it?
(not by time, or by clock calculation)

also, what do u say about turning the client into two clients.
one for generating the sample structures at beginning,
so users with slower machines can do those and see they contribute.
maybe even add a percent of completion
so they see how close all the structures are to 100% (say 10 million, or 10 billion as it's now)
then, we take the best structures,
let's say best 100 structures.
and start do the generation client work.
generation minimization, and more.
this way, we can advance to other proteins before getting native rmsd protein.

what do u think?

bwkaz
03-15-2003, 08:40 AM
On your first question, all the proteins that the client generates are the same size. That is currently 96 amino acids with the beta client. The reason it takes longer to generate them is that the algorithm is repeatedly trying to place an amino acid somewhere that it doesn't fit (like in the middle of another AA), and has to back up and restart. That's what is meant by getting "stuck". But when it finishes, they're all 96 AA's long.

On the second question, that would be something for Howard (the project leader guy) to decide, not me. Sounds, initially, like not too bad of an idea, though.

__k
03-15-2003, 09:31 AM
The reason it takes longer to generate them is that the algorithm is repeatedly trying to place an amino acid somewhere that it doesn't fit (like in the middle of another AA), and has to back up and restart. That's what is meant by getting "stuck". But when it finishes, they're all 96 AA's long.

Thanks for clarifying.
yet, that would mean that as the proteins are currently somewhat random in the sampling phase.. that is, there isn't much difference between a long time to build protein and a fraction of second protein, right?
in that case, why not drop everything that takes more than 3 seconds? (or 1 second for that matter)

as for point 2, I think when people see they contribute to the whole
for example, if the client shows an updated percentage (downloads number of sample proteins so far out of goal)
which gets updated say every upload
they would be more willing to spend time on it. (and also the added benefit of not running the client on a protein that already reached the 100% goal)

m0ti
03-15-2003, 10:22 AM
Could be a good idea.

As some people have pointed out in the beta, the time required to produce folds higher than generation 0 can be very high, and producing a full 50 (100, 150, 200, 250, 300 or whatever) generations, may be impractical for users utilizing slower machines.

I guess he's talking about breaking it up into a two step process:

1) Generation of gen 0 proteins for protein 1.

2) Generation of higher gens based on the proteins generated in gen 0 for protein 1.

This would involve some sort of split in the project. Slower machines would very likely move on to stage 1 for protein 2 while stage 2 for protein 1 was completing.

bwkaz
03-15-2003, 10:55 AM
Of course, you'd have to make sure people didn't just move on to gen 1 of the next protein on all their machines. Otherwise we wouldn't actually accomplish anything. ;)

Although maybe the fact that gen 1 proteins don't get you any points might help? Because they didn't, last I remember.


yet, that would mean that as the proteins are currently somewhat random in the sampling phase.. that is, there isn't much difference between a long time to build protein and a fraction of second protein, right? in that case, why not drop everything that takes more than 3 seconds? (or 1 second for that matter) Yes, we are sampling them randomly. We have to, because once we start using the algorithm on proteins that no one knows the structure of, we won't have anything to compare to.

There is a difference between a protein that takes a long time and one that takes a short time, though. The one that takes a short time is generally further away from the real structure. Its energy is higher, and the RMSD (the RMS deviation, which is only findable if the real structure is known) is also usually higher. The structures that take the longest to complete are usually (though not always) the lowest-RMSD ones.

I think. You'd have to wait for Howard to know for sure, but I'm pretty sure I remember people saying that before.

Oh, and low RMSD is good, because the lower it is, the closer we are to the real structure and (once this starts getting used to make drugs that attack certain e.g. virii) the better the drugs will be that get made.

Brian the Roman
03-16-2003, 12:36 AM
Gen 0 structures are purely random. All generations thereafter are 'near neighbours' of the best of the previous gen. The structures in these generations take longer because they start out using very tight limits on the placement of the residues. This often yields better results than the random approach but it takes longer. therefore, Howard et al are trying to find the best compromise between the pure random approach which is fast but usually yields low quality and the targeted approach which is slower but gives better results.

As for the suggestion that slower machines only do gen 0, the best approach (I believe) is to have each machine do gen 0 and report its best 100 or so results to the server who then assigns the best of the entire pool to clients to work on. I've already suggested this to Howard.
Having the slow clients only do gen 0 seems counter-productive to me. We aren't trying to make lots of bad structures, but good ones. I'm pretty sure Howard et al would be far happier with 1000 good structures than a million bad ones. By having slow clients only do gen 0 work you'd be virtually guaranteeing that slow machines would never be in the top 10 quality ranking (energy or rmsd) which is actually much more important than # of structures made..

ms

__k
03-16-2003, 08:17 AM
Gen 0 structures are purely random. All generations thereafter are 'near neighbours' of the best of the previous gen. The structures in these generations take longer because they start out using very tight limits on the placement of the residues. This often yields better results than the random approach but it takes longer. therefore, Howard et al are trying to find the best compromise between the pure random approach which is fast but usually yields low quality and the targeted approach which is slower but gives better results.

excellent.
can you explain why this AMD-is-logical
dude got such great results
by the workaround he programmed so the timeout is much smaller?
it seems to me he made the first gen0 structures as fastest as he could,
to spend more time doing the generations.
also, it didn't seem to affect his results,
(maybe improving them) when he forcefully timed out complex structures.

did i understand it wrong?
I would appreciate a clarification in that case :))
:crazy:


Having the slow clients only do gen 0 seems counter-productive to me. We aren't trying to make lots of bad structures, but good ones.

the point is that since we're doing those anyway, we can have other people do it,
then take the best and have those as the "sample pool" as you called it
thus, we minimize work and save cpu cycles to do better targeted work.
atleast, logically it sounds that way to me
do correct me if i'm wrong.

By having slow clients only do gen 0 work you'd be virtually guaranteeing that slow machines would never be in the top 10 quality ranking (energy or rmsd) which is actually much more important than # of structures made..

I personally don't care which machine did the highest quality, may it be the fastest cpu farm out there, or a p133 computer.
i think people do these things for science, for interest and support, and out of boredom or the wish for super humans
in any case, whatever people are doing this for, we are actually the 2nd logical step after the genome mapping was done.
which is a pretty noble idea imho.
this is a really interesting project and i'd love seeing it going further with higher goals, say, compute all known proteins.
or is this too wishful a thinking?

Brian the Roman
03-17-2003, 02:10 PM
Originally posted by __k
excellent.
can you explain why this AMD-is-logical
dude got such great results
by the workaround he programmed so the timeout is much smaller?
it seems to me he made the first gen0 structures as fastest as he could,
to spend more time doing the generations.
also, it didn't seem to affect his results,
(maybe improving them) when he forcefully timed out complex structures.

did i understand it wrong?
I would appreciate a clarification in that case :))
:crazy:


He got better results faster. We still have to see if he got better results overall. He's done 250 gens 'cause he's going 60 times faster than the rest of us. When the rest of us have done 250 gens we may show his approach to be flawed, it remains to be seen.

His change would have virtually no impact on the gen 0 structures since they almost never take long anyway. So he reaps all of his speed advantage in gen 1 on.


Originally posted by __k
thus, we minimize work and save cpu cycles to do better targeted work.
atleast, logically it sounds that way to me
do correct me if i'm wrong.


We wouldn't minimize work at all. All that would be done is the slower PC's would do the quick low quality structures (gen 0) and the faster ones would do the harder ones. No more work ends up being done, its only been shuffled from one place to another.


Originally posted by __k
this is a really interesting project and i'd love seeing it going further with higher goals, say, compute all known proteins.
or is this too wishful a thinking?

All in due time. At first we need to devise good algorithms for the measuring of the results and optimize the process for the sampling problem. Since the sampling problem is so big (over 3^n structures for one protein with n amino acids, I understand) we need to find a process which makes it manageable for a single protein. As an example when we sampled 10 billion structures we did less than one billionth of a billionth of a billionth of a single protein.