Thanks for the clarification. It seems to get stuck on every generation. Back to -qt mode for me!
This is what Howard was refering to when he mentioned how the structures can get "stuck" with the beta. This is supposed to happen and is not a bug.Originally posted by Guff®
Running in -qf mode, I've noticed that the residue calculation number will "hang" or even fall back in sequence.
ie., "calculating residue 6x of 96", may fall back to "5x of 96", go back up to "7x of 96" and fall back to "6x of 96". It's currently on #7 generation, but #6 was similar. Is this normal?
It's like it gets bored and just doodles all over itself.
WinXP w/AMD 1.4 T-Bird
-df -qf -it -rt switches
A member of TSF http://teamstirfry.net/
Thanks for the clarification. It seems to get stuck on every generation. Back to -qt mode for me!
Last edited by Guff®; 02-23-2003 at 12:51 AM.
Distributed Computing Mercenary
BBR Team Endeavor
I'm currently on set 50, and I've noticed an issue with a couple of the protein structures. What happened is that the client got stuck multiple times at different points while processing the same structure. It ended up taking over 30 minutes to process a single structure on both occasions. I strongly suspect that taking this amount of time is counterprouctive since it could be better spent crunching additional structures.. (I'm crunching on an Athlon 2000+ that wasn't doing anything else cpu intensive, so its not that we're talking about a slow computer here.)
Some sort of routine should be added to the program so that if it takes over a certain amount of time to crunch a single structure, it gives up on that unit and goes to the next one.
A member of TSF http://teamstirfry.net/
I ran the client a number of times to test out the problems I've had with systems being shut down improperly and then having to edit filelist to get them to upload. I started the client after disconnecting the ethernet cable from the machine. (since I've had repeated problems with it losing internet connections and being Improperly Terminated and not uploading until filelist was hand cleaned).
I let it get to group 2 3 different times, and stopped the program improperly 3 different ways. On restart, (connected to the network), it uploaded the packets for group 0 and group 1 with no complaints.
That's a nice improvement.. Thanks.
As we move on to generation 50, does it get progressively longer per generation? (such that after a certain generation number, keeping waypoints in the current generation starts making sense?)
From my observation, it does get slower over time, probably because the structures have a lower A and the probability of generated structures having overlapping atoms/whatever makes a structure invalid and get 'stuck', is higher? Just a guess. As for waypoints: During the tests I did, exiting the client with Q, it started over after the last structure in a generation, not at the beginning of a generation (if you exited after some structures were done).As we move on to generation 50, does it get progressively longer per generation? (such that after a certain generation number, keeping waypoints in the current generation starts making sense?)
Do you mean something different by 'waypoints'?
Jérôme
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closelyOriginally posted by Aegion
Some sort of routine should be added to the program so that if it takes over a certain amount of time to crunch a single structure, it gives up on that unit and goes to the next one.
I DID already explain this, but it will keep relaxing the contraits more and more until it eventually gets unstuck if it keeps getting stuck.
Last edited by Brian the Fist; 02-23-2003 at 10:50 AM.
Howard Feldman
But what if it keeps getting stuck on the other retries as well?Originally posted by Brian the Fist
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely
Ok thanks for the clarification. The main fallacy here is this though. Remember when we are predicting novel, unknown folds, we do not know which are 'good' or 'bad' samples. All we have is a pseudo-energy score which in some cases tells us which samples may possibly be somewhat decent. Thus we do not want to place to much reliance on this energy value (as we learned from CASP, its just not good enough yet to pick out the best structures). Thus even if a CPU generates 3 or 4 excellent structures in terms of RMSD, when we choose the top5 energies there is no guarantee they wil be in there. Anyhow, when we switch to trying a true genetic algorithm, the server will indeed keep pieces of the good-scoring samples and redistribute them to clients so that they will get used, that can be thought of as 'phase 3'...Originally posted by Brian the Roman
To be sure I've been clear I'll do this as pseudo code. (sorry if you already understand)
1) client A crunches 5000 and reports the best 100 structures to the server.
2) the server assigns the best structure of the 100 to client A to drill down on and marks the structure as 'taken'.
3) client B crunches 5000 and reports the best 100 to the server.
4) the server assigns the the best structure out of the 200 excluding those already taken to client B to drill down on and marks that structure as taken.
5) and so on for the rest of the clients
ms
Howard Feldman
It seems to me that it tries for 3 minutes and then starts over on the same one instead of going to the next one.Originally posted by Brian the Fist
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely
How does the scoring actually work? I think now that if you just kept running the first generation over and over you would generate way better stats.
Yup, I think I since changed it so it will continue at the start of the current generation if killed improperly. Is this sufficient? Should it be checkpointed say every 10 structures instead? Its not a big deal, checkpointing just involves updating filelist.txt correctly on disk (but does require disk activity for those who are paranoid). Perhaps I could sync it with the progress.txt update interval, would that be a good idea?? Sounds like a good one to me actually.. Actually that might not be true, it might have to start at the beginning of the generation because if you kill it stuff won't get written out properly but I will double check on that.Originally posted by jlandgr
From my observation, it does get slower over time, probably because the structures have a lower A and the probability of generated structures having overlapping atoms/whatever makes a structure invalid and get 'stuck', is higher? Just a guess. As for waypoints: During the tests I did, exiting the client with Q, it started over after the last structure in a generation, not at the beginning of a generation (if you exited after some structures were done).
Do you mean something different by 'waypoints'?
Jérôme
So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.
Howard Feldman
I have it running on two machines, a 2000MP running linux and a ~1400 MHz P4 running XP. After about 28 hours the P4 was on generation 50 and the AMD was on about 40. The AMD is normally faster and after 8 hours had a pretty good lead but I guess it ran into some gnarly structures.Originally posted by Brian the Fist
...
So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.
'gnarly structures'..Got to remind that oneOriginally posted by Welnic
gnarly structures.
You are not getting what I'm saying. It gets unstuck, but then it gets stuck again later on with the same structure at a later point. It also looks like it backs up from time to time, often getting stuck again at the same point it was at earlier. I was watching a clock as I timed the structure, it took over 30 minutes to process a single structure in both instances.Originally posted by Brian the Fist
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely
I DID already explain this, but it will keep relaxing the contraits more and more until it eventually gets unstuck if it keeps getting stuck.
A member of TSF http://teamstirfry.net/
I post about this problem and you said it's not a bug. So what turned it into one?Originally posted by Aegion
You are not getting what I'm saying. It gets unstuck, but then it gets stuck again later on with the same structure at a later point. It also looks like it backs up from time to time, often getting stuck again at the same point it was at earlier. I was watching a clock as I timed the structure, it took over 30 minutes to process a single structure in both instances.
Distributed Computing Mercenary
BBR Team Endeavor
Once the number is increased to 200 structures, there should definitely be checkpointing. Power failures happen, and many people play unstable games on their systems. If a 200 structure generation takes many hours to produce, that is way too much work to lose.Originally posted by Brian the Fist Yup, I think I since changed it so it will continue at the start of the current generation if killed improperly. Is this sufficient? Should it be checkpointed say every 10 structures instead?Just in case you're refering to my complait about disk activity in a previous post, let me clarify. I am getting about 1000 packets per second each way between the node running the beta and the server. Compared to that insane amount of trafic, the amount required for a checkpoint would be insignificant.Its not a big deal, checkpointing just involves updating filelist.txt correctly on disk (but does require disk activity for those who are paranoid).
BTW, I noticed that someone was complaining about the beta DF client seriously hurting their performance when loading programs and such under Win9x. I don't have Win9x so I'm just speculating, but I can't help but wonder if the huge number of disk requests from the client is acting as a sort of denial-of-service attack on the Win9x disk subsystem.Sounds good to me. That way, we have a switch that will let us select the frequency of checkpointing.Perhaps I could sync it with the progress.txt update interval, would that be a good idea?? Sounds like a good one to me actually..It depends, and it can vary. On one node running a single copy of the client I get roughly 24 hours or so.So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.
On another computer I'm running four copies of the client at once, as well as other stuff. Each client is getting about 1/8 of the CPU. Here it takes only about 6 hours of CPU for a client to do 50 generations.
And this computer has a slower CPU than the node.
So if you put a gigabyte of memory on your computer and run 8 copies of the client at once, you will have about 4 times the production compared to what your computer would have with only one copy.
This seems to be due to the 3 minute (real time) timer. No matter how fast or slow your CPU, it will sit there until enough 3 minute timeouts have occured to loosen the constraints enough for that CPU to do a structure in under 3 minutes, then it will generate structures at the same rate no matter what the CPU speed was.
There are several problems with this. First, it is blatantly unfair to people with fast machines, and that kind of unfairness can turn people away from a project. Second, it invites people to do weird things like run many copies of the client at once, or to rig their real-time clock to run 16x normal speed. Third, it can't possibly be an efficient way for the client to use CPU cycles.
I can often see the client getting stuck and repeatedly backing off about 5 units and running foward again. It's just sitting there wasting CPU cycles and not getting anywhere. I think the client should be much more aggressive about recognizing this (based on number of tries, not real-time) and loosening the constraints. It can tighten the constraints back up once it's past the sticking point.
So here is a summary of my wish list:
1) The rate the client runs should be based on CPU cycles, not on real-time.
2) The huge amount of disk requests should be scaled way back. If this activity is due to checking the foldtrajlite.lock file, then perhaps there can be a switch to control the rate.
3) Add checkpointing.
4) Make the random number seeding cluster-friendly (if you haven't already). If you haven't found a good way to use the MAC, perhaps you could add that switch we talked about, so that an integer could be given to the client for combining with the time and pid to make the seed.
I agree... why in the world would real-time be used when there is1) The rate the client runs should be based on CPU cycles, not on real-time.
such a large difference in what different PCs can do in that time?
(I'm wasting millions of cycles more than a putz machine)
~~~~ Just Passin' Through ~~~~
Please ignore! Sorry!
Last edited by m0ti; 02-23-2003 at 03:05 PM.
Team Anandtech DF!
I think this has been mentioned before but after exitting out from the client while it is minimizing energy it loses all progress made in minimizing energy. I realize it doesn't take so long to perform but on my machine (AXP 1600+@1900+) it takes it around 1 minute or so, and it seems a shame to lose the work.
Same goes for Trajectory Distribution.
Team Anandtech DF!
Getting stuck is not a bug. However, I'm noticing a protein structure get stuck multiple times during the same sequence, taking over 30 minutes to complete a single protein structure. I'm questioning whether the benefits of completing that structure outweigh the lost time which could be spent completing multiple other protein structures. While the software may be functioning as intended, I've noticed a flaw in the manner it is currently functioning in.Originally posted by Guff®
I post about this problem and you said it's not a bug. So what turned it into one?
A member of TSF http://teamstirfry.net/
Thanks for verifying/explaining what I was seeing. I didn't think that should be normal, hence the original question.Originally posted by Aegion
Getting stuck is not a bug. However, I'm noticing a protein structure get stuck multiple times during the same sequence, taking over 30 minutes to complete a single protein structure. I'm questioning whether the benefits of completing that structure outweigh the lost time which could be spent completing multiple other protein structures. While the software may be functioning as intended, I've noticed a flaw in the manner it is currently functioning in.
Distributed Computing Mercenary
BBR Team Endeavor
This is just a guess but try using the -g option to reduce the frequency it updates the progress.txt file. I believe if you don't specify it the client updates the file for every structure. So try something like -g 10 so it updates the file every 10 structures.Originally posted by AMD_is_logical
I'm trying the beta on one of my cluster nodes. These are diskless nodes that use an NFS server. The network activity on that node is *far* higher than with the old client. I estimate it's about 30 times higher (in terms of bytes transfered) than the old client.
This is also the case with me - running the Windows version under w2k.Originally posted by Brian the Fist
I assume you mean during energy minimization, the progress bar?? This should NOT be the case. I'll fix that if its true. It is important that quiet mode has NO output (it can cause problems depending on how people use it).
It writes a series of []============== for every time the "minimizing energy" or "calculating gen. x trajectory" messages would have appeared.
To my way of thinking, the low quality algorithm suggests that you should be assigning the structures from the server instead of the clients. That way you can modify the algorithm used to select the 'best' structure dynamically on the server side. You could try using two or more different algorithms simultaneously until you determine which is the better one without any impact to the clients.Originally posted by Brian the Fist
Ok thanks for the clarification. The main fallacy here is this though. Remember when we are predicting novel, unknown folds, we do not know which are 'good' or 'bad' samples. All we have is a pseudo-energy score which in some cases tells us which samples may possibly be somewhat decent. Thus we do not want to place to much reliance on this energy value (as we learned from CASP, its just not good enough yet to pick out the best structures). Thus even if a CPU generates 3 or 4 excellent structures in terms of RMSD, when we choose the top5 energies there is no guarantee they wil be in there. Anyhow, when we switch to trying a true genetic algorithm, the server will indeed keep pieces of the good-scoring samples and redistribute them to clients so that they will get used, that can be thought of as 'phase 3'...
But if phase 3 will handle this then the point is probably moot. When do you anticipate phase 3 will begin?
ms
That is in fact a very valid point, even if it is based on CPU cycles.Originally posted by AMD_is_logical
Once the number is increased to 200 structures, there should definitely be checkpointing. Power failures happen, and many people play unstable games on their systems. If a 200 structure generation takes many hours to produce, that is way too much work to lose. Just in case you're refering to my complait about disk activity in a previous post, let me clarify. I am getting about 1000 packets per second each way between the node running the beta and the server. Compared to that insane amount of trafic, the amount required for a checkpoint would be insignificant.
BTW, I noticed that someone was complaining about the beta DF client seriously hurting their performance when loading programs and such under Win9x. I don't have Win9x so I'm just speculating, but I can't help but wonder if the huge number of disk requests from the client is acting as a sort of denial-of-service attack on the Win9x disk subsystem. Sounds good to me. That way, we have a switch that will let us select the frequency of checkpointing. It depends, and it can vary. On one node running a single copy of the client I get roughly 24 hours or so.
On another computer I'm running four copies of the client at once, as well as other stuff. Each client is getting about 1/8 of the CPU. Here it takes only about 6 hours of CPU for a client to do 50 generations.
And this computer has a slower CPU than the node.
So if you put a gigabyte of memory on your computer and run 8 copies of the client at once, you will have about 4 times the production compared to what your computer would have with only one copy.
This seems to be due to the 3 minute (real time) timer. No matter how fast or slow your CPU, it will sit there until enough 3 minute timeouts have occured to loosen the constraints enough for that CPU to do a structure in under 3 minutes, then it will generate structures at the same rate no matter what the CPU speed was.
There are several problems with this. First, it is blatantly unfair to people with fast machines, and that kind of unfairness can turn people away from a project. Second, it invites people to do weird things like run many copies of the client at once, or to rig their real-time clock to run 16x normal speed. Third, it can't possibly be an efficient way for the client to use CPU cycles.
I can often see the client getting stuck and repeatedly backing off about 5 units and running foward again. It's just sitting there wasting CPU cycles and not getting anywhere. I think the client should be much more aggressive about recognizing this (based on number of tries, not real-time) and loosening the constraints. It can tighten the constraints back up once it's past the sticking point.
So here is a summary of my wish list:
1) The rate the client runs should be based on CPU cycles, not on real-time.
2) The huge amount of disk requests should be scaled way back. If this activity is due to checking the foldtrajlite.lock file, then perhaps there can be a switch to control the rate.
3) Add checkpointing.
4) Make the random number seeding cluster-friendly (if you haven't already). If you haven't found a good way to use the MAC, perhaps you could add that switch we talked about, so that an integer could be given to the client for combining with the time and pid to make the seed.
If a large amount of time is spent on trying to complete a fold, then, effectively, most of that time is wasted, and one will increase overall production by running an extra client. Of course, if they both get stuck on a particular fold, then it becomes worth it to have another client running, and so on and so on. The loss of productivity due to context switches will be more than made up for in fold production.
This leads to some difficulties: the current fold should be abandoned after some period of time (or the constraints relaxed enough). If this is done too soon, then a good fold may be lost. If this is done too late, then productivity drops.
Team Anandtech DF!
I must have been high. Checking again, I do not see any problem with it running as a service, just when it is in the dos box. And with some checking with a demo version of the program I did not see the delay. I only see the delay in a beta version that I have, which I would imagine has debugging turned on.Originally posted by Welnic
So I am seeing a big delay doing certain things in Windows XP. I timed the main application that I run all the time for opening and closing times. I had the client running in a dos box with the -rt switch.
Open with regular client: 3 seconds
Open with beta client: 33 seconds
Close with regular client: 5 seconds
Close with beta client: 70 seconds
...
I normally run as a service and that is where I first noticed this. I was just running in the dos box because it was faster to set up and I wanted to make sure it was just doing the normal folding part instead of the energy minimization part.
I understand your requirement to have a quiet mode, but I'd like an additional switch that would have minimum status messages produced by the text client instead of the current verbose output.
These would include:
- current level number started
- 25 percent completion increments with best numbers found.
- level completion with best numbers.
- connections to server documented
- all above timestamped
- one line per status, concise.
Consider it a kind of status in a glance!
Thanks for bandwidth... Ned
WinXP Pro
Running the client normally (no switches).
While Minimizing Energy or calculating the Trajectory Distribution the rest of the computer freezes up, completely. ALL system resources are given to DF even though it's running at low priority.
Team Anandtech DF!
On my Athlon XP 2100+ (running Windows 2000) it completed 50 generations in a little under 17 hours with an average of about 20 minutes pr. generation.Originally posted by Brian the Fist
So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.
That means over 3 hours (on my machine) pr. generation when we make it out of beta.
I will try to make more measurements to see if it varies alot.
WinXP Pro
CLI Client (default switches)
got the following error during Trajectory Distribution for Gen 44: FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 697} File write error
When I restarted it started afresh from Gen 0.
Team Anandtech DF!
I am running 7 dual cpu systems:
model name : Intel(R) Pentium(R) III CPU family 1400MHz
cpu MHz : 1396.449
cache size : 512 KB
They take about 18 1/2 hours to complete 50 generations. I have not had any problems since I fired them up about 5 days ago.. All running on SunLinux which is basically redhat 7.2.
** Would be nice if there was a way for the client to use both CPUs in some clever way to process the data since speed really makes the difference in this version.
Has anyone else noticed problems with getting through a proxy with the beta client !?!, i've recently tried running it on my work computer (which normally runs the 'old' client just fine) and cannot get through to the server... thinking i had done something wrong i doublechecked the proxy.cfg file and also copied it from my original working clients directory but still no dice...
It will stay on "checking for new versions" for a long time (probably some timeout) and then resume crunching but never uploading any results
After completing a generation there is also a long wait before anything happens (another timeout !?!) and i have lots of
ERROR: [000.000] {foldtrajlite.c, line 4721} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
Lines in my error.log file
Is this suppoed to work at all with the beta or am i missing something !?!
Just got this same error again, but for Trajectory Distribution for Gen 33.Originally posted by m0ti
WinXP Pro
CLI Client (default switches)
got the following error during Trajectory Distribution for Gen 44: FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 697} File write error
When I restarted it started afresh from Gen 0.
Team Anandtech DF!
That's why higher generations will be worth significantly more 'points' - to encourage you to go all the way, and not do what you just described...Originally posted by m0ti
That is in fact a very valid point, even if it is based on CPU cycles.
If a large amount of time is spent on trying to complete a fold, then, effectively, most of that time is wasted, and one will increase overall production by running an extra client. Of course, if they both get stuck on a particular fold, then it becomes worth it to have another client running, and so on and so on. The loss of productivity due to context switches will be more than made up for in fold production.
This leads to some difficulties: the current fold should be abandoned after some period of time (or the constraints relaxed enough). If this is done too soon, then a good fold may be lost. If this is done too late, then productivity drops.
Howard Feldman
Sounds like a full disk (or no permission to write). It cannot recover from that sort of error for obvious reasons. (Probably your /tmp partition). This error is directly from a failed fwrite (i.e. not all elements were written) and so is pretty straight-forward.Originally posted by m0ti
WinXP Pro
CLI Client (default switches)
got the following error during Trajectory Distribution for Gen 44: FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 697} File write error
When I restarted it started afresh from Gen 0.
Howard Feldman
So right now the later generations are not worth more points?Originally posted by Brian the Fist
That's why higher generations will be worth significantly more 'points' - to encourage you to go all the way, and not do what you just described...
Also I have the beta running as a service on an XP box and the monitors never power off like they normally do.
I don't believe the points system is really implemented yet in the beta. The point of the beta testing is to participate and locate bugs, not to obtain top place in the stats rankings.Originally posted by Welnic
So right now the later generations are not worth more points?
Also I have the beta running as a service on an XP box and the monitors never power off like they normally do.
A member of TSF http://teamstirfry.net/
I vote for pointless beta testing... I do that at work all the time! Are the structures still valid though ???
The point system SHOULD be in place right now. Please test this for us too You should, I believe, get 5000 points for gen 0 (but this may be changed to 200), and for gen. x you should get 200*sqrt(x) points (ok, whip out those calculators). If this is NOT the case please let me know and I'll check it out.
Aside from this, I've gone through the whole thread and identified 7 bugs and 7 features (including stuff Chris and I have decided to add) which I will now fix and/or shove into the next beta, which I should hopefully have ready later this week. Any further betas after this will likely be to play with parameters like size of and number of generations to optimize those a bit more but I think you've all done a really great job at nailing all the bugs and even potential bugs. You found some things I really didn't expect with such a relatively small testing group (under 100 of you anyways).
Unless you find another new bug or have an important suggestion/feature to add which hasn't been already approximately mentioned in this thread, lets pause it here for now and I will get these changes done ASAP. With the next beta I may also release the screensaver, and hopefully a few of you will be willing to test that out as well just to make sure there's nothing quirky specific to it (but remember its all really the same code so most things should work the same in the screensaver as the text client in general).
Thanks All!
Howard Feldman
I would change the points for gen 0 to 200 or some people will just reset it after gen 0 and just do those. But if you don't reset the overall stats I would leave gen 0 at 5000 and also change the rest to 5000. Otherwise the scores already done will have too much weight.Originally posted by Brian the Fist
The point system SHOULD be in place right now. Please test this for us too You should, I believe, get 5000 points for gen 0 (but this may be changed to 200), and for gen. x you should get 200*sqrt(x) points (ok, whip out those calculators). If this is NOT the case please let me know and I'll check it out.
...
If you don't make them equal I'll demo the gen 0 only advantage when I get back from vacation next week.
Alright I'm in as well.
XP1700+ @ 2400+ spec (10.5*192)
512MB RAM
Window XP Pro
Running DOS text client.
The client is going crazy generating huge ASII diagram. :shocked: