Originally posted by AMD_is_logical
Once the number is increased to 200 structures, there should definitely be checkpointing. Power failures happen, and many people play unstable games on their systems. If a 200 structure generation takes many hours to produce, that is way too much work to lose. Just in case you're refering to my complait about disk activity in a previous post, let me clarify. I am getting about 1000 packets per second each way between the node running the beta and the server. Compared to that insane amount of trafic, the amount required for a checkpoint would be insignificant.

BTW, I noticed that someone was complaining about the beta DF client seriously hurting their performance when loading programs and such under Win9x. I don't have Win9x so I'm just speculating, but I can't help but wonder if the huge number of disk requests from the client is acting as a sort of denial-of-service attack on the Win9x disk subsystem. Sounds good to me. That way, we have a switch that will let us select the frequency of checkpointing. It depends, and it can vary. On one node running a single copy of the client I get roughly 24 hours or so.

On another computer I'm running four copies of the client at once, as well as other stuff. Each client is getting about 1/8 of the CPU. Here it takes only about 6 hours of CPU for a client to do 50 generations.

And this computer has a slower CPU than the node.

So if you put a gigabyte of memory on your computer and run 8 copies of the client at once, you will have about 4 times the production compared to what your computer would have with only one copy.

This seems to be due to the 3 minute (real time) timer. No matter how fast or slow your CPU, it will sit there until enough 3 minute timeouts have occured to loosen the constraints enough for that CPU to do a structure in under 3 minutes, then it will generate structures at the same rate no matter what the CPU speed was.

There are several problems with this. First, it is blatantly unfair to people with fast machines, and that kind of unfairness can turn people away from a project. Second, it invites people to do weird things like run many copies of the client at once, or to rig their real-time clock to run 16x normal speed. Third, it can't possibly be an efficient way for the client to use CPU cycles.

I can often see the client getting stuck and repeatedly backing off about 5 units and running foward again. It's just sitting there wasting CPU cycles and not getting anywhere. I think the client should be much more aggressive about recognizing this (based on number of tries, not real-time) and loosening the constraints. It can tighten the constraints back up once it's past the sticking point.


So here is a summary of my wish list:

1) The rate the client runs should be based on CPU cycles, not on real-time.

2) The huge amount of disk requests should be scaled way back. If this activity is due to checking the foldtrajlite.lock file, then perhaps there can be a switch to control the rate.

3) Add checkpointing.

4) Make the random number seeding cluster-friendly (if you haven't already). If you haven't found a good way to use the MAC, perhaps you could add that switch we talked about, so that an integer could be given to the client for combining with the time and pid to make the seed.
That is in fact a very valid point, even if it is based on CPU cycles.

If a large amount of time is spent on trying to complete a fold, then, effectively, most of that time is wasted, and one will increase overall production by running an extra client. Of course, if they both get stuck on a particular fold, then it becomes worth it to have another client running, and so on and so on. The loss of productivity due to context switches will be more than made up for in fold production.

This leads to some difficulties: the current fold should be abandoned after some period of time (or the constraints relaxed enough). If this is done too soon, then a good fold may be lost. If this is done too late, then productivity drops.