Log in

View Full Version : Beta 2 now available - beta testers please update



Brian the Fist
02-21-2003, 02:07 PM
Ok, I have released an updated beta with all the bugs that have been found so far fixed. If you are helping to beta test, please get the update from the same place as before:

ftp://ftp.mshri.on.ca/pub/distribfold/download/distribfold-beta-linux-i386.tar.gz
for Linux and
ftp://ftp.mshri.on.ca/pub/distribfold/download/distribfold-beta-win9x.zip

To install, just overwrite the previous beta or a fresh install with the files in this archive; no other files have changed.

Please delete filelist.txt as well, if present, to force it to restart at generation zero.

If you downloaded it recently and are not sure which you got, just look at the readme.txt which should say 'Beta 2' on the first line, if you've got it.

Please post any further bugs/problems with Beta 2 in THIS thread and I will ignore any further reports in the previous thread, I will assume those are from the old version.

Thanks for your assistance.

MavericK
02-21-2003, 02:55 PM
Okay, this one seems to be working correct. No more data file checksum error. Doesn't seem to be stuck for so long as well.
Good job.:thumbs:

FoBoT
02-21-2003, 04:17 PM
the new beta2 for windows doesn't ask about the # of files to buffer, the foldit.bat on both of my beta boxes skipped that question

Pascal
02-21-2003, 05:14 PM
I've tested some (ca. 6) different options under SuSE Linux 8.1 - no error occured up to now.

Great work!

If an error occures, I'll post something about it ;)

Brian the Fist
02-21-2003, 05:38 PM
Originally posted by FoBoT
the new beta2 for windows doesn't ask about the # of files to buffer, the foldit.bat on both of my beta boxes skipped that question

You mean pertaining to the -df option right? As stated in the 'readme.txt', the -df option is now always enabled. This is because it is not needed anymore. The files are a lot smaller than they used to be and the program should never take up TOO much disk space (until someone complains that it does of course :rolleyes: ).

Ok, please try to break this version as much as you can, meanwhile I still have a few small things to add to it. Since it is probably relatively stable now, I will take feature requests and suggestions at this time too - but please keep them simple and useful to many people (not just yourself). :) As always they will be judged fairly and if deemed worthy, included in the next (and hopefully final) beta.

P.S. when I change to the next beta, something will change on the server too meaning data from the old beta will not be accepted any longer. In fact that might be a good way to tell when the next beta is ready, if you don't check here often :) No big deal otherwise.

Pascal
02-21-2003, 06:16 PM
Originally posted by Brian the Fist
... Since it is probably relatively stable now, I will take feature requests and suggestions at this time too - but please keep them simple and useful to many people (not just yourself). :) As always they will be judged fairly and if deemed worthy, included in the next (and hopefully final) beta.


Perhaps you can add displaying the number of the produced stuctures and the best RMSD of this sequence of generations.
The best RMSD could be updatet after each generation.
Although it might be a bit difficult, it might be a good step for monitoring.

Insidious
02-21-2003, 06:31 PM
Running WinXP ASCII client at the moment.

I haven't been able to break it wrt internet disconnects so far :)

running other apps. and games doesn't seem to bother it a bit!

When I shut it down improperly (windows re-start without stopping client first). The next time I ran it, it started over and
data was apparently lost (I think you said this is normal) After that everything seemed to progress normally!

KUDOS!

Will experiment with service install next.

update: it's running fine as a service for me as well :)


For the shallow amoung us, it would be fun if the progress.txt file
indicated how many "points" were uploaded since client last started. (I don't understand how to correlate the scores on the
stats site with what I see on my client display/progress.txt.

-Sid

hallmar
02-21-2003, 06:32 PM
Currently been running for 2hrs with no errors so far :thumbs:

Nicely done :cheers:

mighty
02-21-2003, 07:34 PM
Will the client be hardcoded to use 5000 for generation 0 and 200 for subsequent generations, or will it be possible for the user to change those values?
One might want to start out with 10000 to get the best possible structure top start out with.

bwkaz
02-21-2003, 09:13 PM
Are there any plans to change the format of progress.txt between the beta and the release of the new client?

If the format does change, then updating dfGUI would be pointless, but if it doesn't, then starting with the update now would probably be a good idea.

I do see that Insidious has asked for:


it would be fun if the progress.txt file indicated how many "points" were uploaded since client last started. which would be a format change in progress.txt, so perhaps I should wait to start modifications of the Linux port (of dfGUI, that is) until I hear on whether that's going to happen.

Anonymous
02-21-2003, 09:19 PM
P4-1.8A,
256 DDR 333 x2,
ASUS P4S533,
gforce4ti-4200 64MB...

Windows XP Pro (Hong Kong Version)
__________________________________

When I shut it down (pressing "Q"), it can resume next time. It seems be no problem until finishing 500th struct.

When making Gen. 3 Trajectory Distribution, an error occurs.
[ ][NULL_Caption] FATAL ERROR:
[012.002] Attempt to insert duplicate residue number into database
Hit Return

After pressing "Enter", it displays Abrupt: code = 12

In error.log,
FATAL ERROR: [012.002] {trajtools.c, line 2377} Attempt to insert duplicate residue number into database

When it started over again, the error is still here.
I have no any idea......:confused:

robi2106
02-21-2003, 10:31 PM
Software Setup:
WinXP Pro version 2002

Hardware:
Motherboard ASUS A7M-266-D (dual Athlon MP1800+)

Environment:
No other instances of the beta2 are running.
Several IE windows are open.
Only Winzip8.1 and various hardware drivers are installed.

Description: Fatal error on every run of foldit.bat (non-service install).


Steps to Reproduce:
1)Unzip latest full version of win9x command line client.
2)Unzip beta (version2) into existing dir overwriting files.
3)Change the foldtrajlite arguments to the following:
.\foldtrajlite -f protein -n native -s 10000 -df -p 20

4)run "foldit.bat"
5)Error occurs immediately

Error Description:
When runnign foldit.bat for the first time the following error message is displayed to the command screen:

[NULL_Caption] FATAL ERROR: [000.000] Missing/Invalid arguments. For usage, run program, with no arguments.
Hit Return

After hitting return the following windows pops up:
Title: foldtrajlite.exe - Fatal Application Exit
Message:Abrupt: code = 0

Anonymous
02-21-2003, 10:32 PM
One point I have not mentioned.
The parameter of client I ran is -qf -it -rt.

Without -qf parameter, it seemed be no problems (ran continuously for an hr)
:)

When minimizing energy of best structure found, my PC cannot run smoothly although the priority of CPU usage of the client is the lowest. (I do not know if it is a problem or not) A bit annoying I felt.
:(

TheWeatherMan
02-21-2003, 10:44 PM
Beta seems to run smoothly except for one thing...
As long as the client window has focus it runs like lightning...
If I click on some other window it nearly stops and proceeds at a snails pace. That isn't the case on the current release.

OS Win 95 SE
AMD Athlon 1.466 Mhz

Nothing was changed... unzipped the new stuff right over the old.

AMD_is_logical
02-21-2003, 11:39 PM
Originally posted by robi2106
.\foldtrajlite -f protein -n native -s 10000 -df -p 20 There is no -s switch. The client is exiting with a fatal error because you are giving it an invalid switch.

Run foldtrajlite without any arguments to see a list of valid switches.

Aegion
02-22-2003, 02:02 AM
So far with both versions of the beta, the client has been working great for me! The improvement in the accuracy of the protein structure predictor is really amazing. I can actually actually watch my best protein structure progressively improve as I upload additional structures. Instead of merely randomly going down in numbers occassionally, I've actually seen it drop by .01 two uploaded sets in a row. The results definately look encouraging so far. The really interesting question will be how much better the best protein structure prediction will be compared to the previous client with a large enough sample size.

ginaguy18p0r
02-22-2003, 02:43 AM
This post is GONE.

Edited by admin for inappropriate content.

robi2106
02-22-2003, 03:14 AM
Originally posted by AMD_is_logical
There is no -s switch. The client is exiting with a fatal error because you are giving it an invalid switch.

Run foldtrajlite without any arguments to see a list of valid switches.

I included that argument because it is listed in the readmefirst.txt that comes with the current full version of the beta (upload frequency control).

Welnic
02-22-2003, 03:37 AM
Originally posted by robi2106
I included that argument because it is listed in the readmefirst.txt that comes with the current full version of the beta (upload frequency control).

That is a valid argument with the regular version. With it didn't really matter whether you uploaded 1000 units 10 times or 10000 units once. But with the beta the first generation is just to get a good fold to start with for the succeeding generations, so you no longer have the option to choose how many this is going to be.

Akkermans
02-22-2003, 04:40 AM
System used: PIII 600, 128MB, Win NT4 Sp6.
Tested the options -qt -rt + removing .lock file etc without a problem so far.

Small remark, you get '[]====[]===' output on your screen even when you use the -qt option. This doesn't cause any problem however .

Brian the Roman
02-22-2003, 07:57 AM
1) Client now says what generation it's currently working on which is nice. Better still to list all generations done so far and the best rmsd or energy for each generation. This could replace the ASCII art.

2) Instead of having each client do 5000 random selections first and then choose the best of their own 5000 to 'drill down on', why not have each client that is connected to the net report its 5000 to your server and have your server assign the structure to explore more?
That way you could leverage the best of the entire collection as opposed to just the best out of each 5000. Furthermore, this would set up a structure where you could apply some intelligent guidance to the selection process based upon whatever criteria you deem best in the future.

3) The 'DATA UPLOADED TO SERVER - CONTINUING' message stays on too long in generations after the first.

Of these, only #2 is really substantive and is also the most difficult to implement. Maybe in a future release...

ms

grobinette
02-22-2003, 07:59 AM
When minimizing energy of best structure found, my PC cannot run smoothly although the priority of CPU usage of the client is the lowest. (I do not know if it is a problem or not)

I also had this problem with the client. Running on Win 98se, 650 processor, 256mg ram, with the -qt -rt switches, although I had the client set on high priority. Glad to see it was not just in the low priority setting that this happened.

This is a problem for people who will run this on boxes that are not dedicated crunchers. There were instances of not being able to open or close applications and severe keyboard lag.

Mpemba Effect
02-22-2003, 08:19 AM
This client seem to be very good, haven't been able to break it so far :)

One thing though, I do like the funky ascii art but with this client generating huge asciiness all the structures seems to look the same i.e writing over itself filling the entire screen. So it'll be nice to change this to something else maybe, any chance of a really cut down version of the client without any graphical output at all? The majority of us proberbly run it as a service or quietly in the background in linux anyway. I know you can run it in quiet mode, but it's gotta save on some speed over the lengths of time we run the client at?

Starfish
02-22-2003, 10:41 AM
Originally posted by Brian the Roman


2) Instead of having each client do 5000 random selections first and then choose the best of their own 5000 to 'drill down on', why not have each client that is connected to the net report its 5000 to your server and have your server assign the structure to explore more?
That way you could leverage the best of the entire collection as opposed to just the best out of each 5000. Furthermore, this would set up a structure where you could apply some intelligent guidance to the selection process based upon whatever criteria you deem best in the future.


That's funny, I was just about to post the same thing :)

The 'Distributed Particle Accelerator Project' uses a similar strategy. You can download a file with the best 250 results so that your own client can directly do computations based on those good results.

Best 250 FAQ item (http://stephenbrooks.org/muon1/faq.html#best250)

I have the feeling that a similar strategy could be very benificial for DF as well.. :cool:

TheWeatherMan
02-22-2003, 11:19 AM
Deleted...
went back and read about new stuff in 1st beta thread...

Brian the Fist
02-22-2003, 11:51 AM
Originally posted by Akkermans
System used: PIII 600, 128MB, Win NT4 Sp6.
Tested the options -qt -rt + removing .lock file etc without a problem so far.

Small remark, you get '[]====[]===' output on your screen even when you use the -qt option. This doesn't cause any problem however .

I assume you mean during energy minimization, the progress bar?? This should NOT be the case. I'll fix that if its true. It is important that quiet mode has NO output (it can cause problems depending on how people use it).

Brian the Roman
02-22-2003, 11:51 AM
Originally posted by Starfish
That's funny, I was just about to post the same thing :)

The 'Distributed Particle Accelerator Project' uses a similar strategy. You can download a file with the best 250 results so that your own client can directly do computations based on those good results.

Best 250 FAQ item (http://stephenbrooks.org/muon1/faq.html#best250)

I have the feeling that a similar strategy could be very benificial for DF as well.. :cool:

Glad someone agrees. Actually I was thinking about this a bit more and you probably wouldn't need the full 5000 from everyone, just the best 100 or so since they're the ones we're really interested in anyway.

ms

Brian the Fist
02-22-2003, 11:57 AM
Originally posted by Brian the Roman
1) Client now says what generation it's currently working on which is nice. Better still to list all generations done so far and the best rmsd or energy for each generation. This could replace the ASCII art.

2) Instead of having each client do 5000 random selections first and then choose the best of their own 5000 to 'drill down on', why not have each client that is connected to the net report its 5000 to your server and have your server assign the structure to explore more?
That way you could leverage the best of the entire collection as opposed to just the best out of each 5000. Furthermore, this would set up a structure where you could apply some intelligent guidance to the selection process based upon whatever criteria you deem best in the future.

3) The 'DATA UPLOADED TO SERVER - CONTINUING' message stays on too long in generations after the first.

Of these, only #2 is really substantive and is also the most difficult to implement. Maybe in a future release...

ms

2: Im not sure I understand. Why does the server have to pick out of the 5000 instead of the client?? After we've tested the present sampling method on a few proteins, we may change it more to be like a genetic algorithm. I won't go into detail now but I think it'll be a bit similar to your suggestion with some added complexity and randomness.

3: Can you be more specific? Is it a bug? The length of time it stays on is measured in structures, not real-time, could this be why?

Brian the Fist
02-22-2003, 12:05 PM
Originally posted by grobinette
I also had this problem with the client. Running on Win 98se, 650 processor, 256mg ram, with the -qt -rt switches, although I had the client set on high priority. Glad to see it was not just in the low priority setting that this happened.

This is a problem for people who will run this on boxes that are not dedicated crunchers. There were instances of not being able to open or close applications and severe keyboard lag.

I will verify this, but I'm not sure if there is anything I can do about it, as it is a Windows thing. The only difference I can think of is that this part of the code is multithreaded (2 threads, one to do the work, one to update the progress bar) - maybe Windows doesnt handle low priority multi-theads well? Any Microsoft experts out there? Actually, one idea comes to mind, perhaps I need to put a sleep() in the progress bar updating loop or else that might be thrashing the CPU as it basically loops endlessly waiting for the real work to progress. I'll try it out...

Starfish
02-22-2003, 12:14 PM
Originally posted by Brian the Fist
2: Im not sure I understand. Why does the server have to pick out of the 5000 instead of the client?? After we've tested the present sampling method on a few proteins, we may change it more to be like a genetic algorithm. I won't go into detail now but I think it'll be a bit similar to your suggestion with some added complexity and randomness.


It's more like:

1) The client gives its best result(s) to the server
2) The server does know what the few best results are from ALL work returned by all clients (best results of the entire project phase)

3) The server gives the best results back as a few extra 'random seeds' (or something like that) to ALL the clients again, so that All clients may benefit from the few best results of a few clients. (giving them a headstart/foundation to work on)

But I must confess that I don't know if this could be implemented, because I don't know how the 'intelligence' in the beta client works.

edit: it's more or less a way to 'share intelligence' between clients so that the avg. 'not so smart' client benefits from the 'few bright clients' who returned a great result :)

AMD_is_logical
02-22-2003, 02:29 PM
I'm trying the beta on one of my cluster nodes. These are diskless nodes that use an NFS server. The network activity on that node is *far* higher than with the old client. I estimate it's about 30 times higher (in terms of bytes transfered) than the old client.

What is all this disk activity? I know it isn't due to the /tmp stuff, as that is on the client's ramdisk. Could it be eliminated by having the client cache whatever it's reading in its RAM? (I'm already using the -rt switch.) The client wasn't doing anything unusual at the time. It was just working on structure 1 of a later generation waiting for enough 3 minute timeouts so that it could proceed.

I don't think my server can handle such a huge amount of activity on all nodes at once. The best thing would be to fix the client so that it doesn't have so much needless disk activity. (I don't see the need for any disk activity while a structure is being worked on, except for checking the lock file, and if this is from checking the lock file then it is being done several orders of magnetude too often.) If that isn't done, I will need to know what file(s) are involved, then find a way to put them on the client's ramdisk.

grobinette
02-22-2003, 03:48 PM
Originally posted by Brian the Fist
I will verify this, but I'm not sure if there is anything I can do about it, as it is a Windows thing. The only difference I can think of is that this part of the code is multithreaded (2 threads, one to do the work, one to update the progress bar) - maybe Windows doesnt handle low priority multi-theads well? Any Microsoft experts out there? Actually, one idea comes to mind, perhaps I need to put a sleep() in the progress bar updating loop or else that might be thrashing the CPU as it basically loops endlessly waiting for the real work to progress. I'll try it out...

Some additional information if you are interested but I was running standard office applications at the times it caused the keyboard lags and/or freezes, Outlook, Norton AV and Word. Nothing out of the ordinary. System resources were still at 76% once everything returned to a "normal" state. The problems were intermittent throughout the folding process, but I could not even control alt delete to stop any processes once it froze the apps.

Akkermans
02-22-2003, 05:06 PM
Originally posted by Brian the Fist
I assume you mean during energy minimization, the progress bar?? This should NOT be the case. I'll fix that if its true. It is important that quiet mode has NO output (it can cause problems depending on how people use it).

This happens indeed with the -qt option during the energy minimization phase and Generating Trajectory Distribution.
The lenghth is more that the standard Enegry minimazation window size. It can go up to 4 to 5 lines in the DOS box (so
[]==================================================================================================== ==================================================================================================== =============================================[]==================================================================================================== ==================================================================================================== ===================================[]======... etc)

So you get far more = characters than without the -qt option.

Welnic
02-22-2003, 06:22 PM
So I am seeing a big delay doing certain things in Windows XP. I timed the main application that I run all the time for opening and closing times. I had the client running in a dos box with the -rt switch.

Open with regular client: 3 seconds
Open with beta client: 33 seconds
Close with regular client: 5 seconds
Close with beta client: 70 seconds

A also saw this over a year ago when I was trying to run the cli version of seti. It seems like the beta client is willing to give up the cpu but the application never really asks for it. It was the first thing that I checked when I started on this project. Once the application is open it seems to be able to grab the cpu time that it needs.

I normally run as a service and that is where I first noticed this. I was just running in the dos box because it was faster to set up and I wanted to make sure it was just doing the normal folding part instead of the energy minimization part.

TheWeatherMan
02-22-2003, 06:46 PM
During the Minimizing energy routine a fatal error occured.

Windows message:
FOLDTRAJLITE caused an invalid page fault in
module <unknown> at 00de:7c004407.
Registers:
EAX=00000000 CS=017f EIP=7c004407 EFLGS=00010212
EBX=a0000000 SS=0187 ESP=08dffe55 EBP=0108dffe
ECX=00000001 DS=0187 ESI=fc08b4b3 FS=392f
EDX=089e1e50 ES=0187 EDI=bc08ba97 GS=0000
Bytes at CS:EIP:

Stack dump:
bc08dffe 0008b4b3 b0000000 2008b4b3 0108ba99 60000000 6008ba97 0108ba98 00089100 6a000000 42c19760 f5416a1e 00bf87d3 3fe80000 e0000000 c69ad33c

OS: Win 98SE
CPU: AMD 1.466 Ghz
RAM: 500 Megs PC 133
Flags used: .\foldtrajlite -f protein -n native -qf -df -it -rt

Update:
I restarted the client changing nothing...
The client started over and promptly crashed after finishing the first part and while the minimzing routines were running.

I shut down everything, rebooted Windows, and restarted the client.
At exactly the same point it crashed again.

Insidious
02-22-2003, 07:38 PM
That one sounds like the errors I get if I am OC'd more than DF likes.

DF is the most sensitive app. I run to memory errors, and is usually the first to indicate I am pushing things beyond 100% stable.

-Sid

TheWeatherMan
02-22-2003, 08:54 PM
Originally posted by Insidious
That one sounds like the errors I get if I am OC'd more than DF likes.

DF is the most sensitive app. I run to memory errors, and is usually the first to indicate I am pushing things beyond 100% stable.

-Sid Well, I might agree with you if the system wasn't stable. It's been running the non-beta version now for nearly 2 weeks... no problem. This version has run fine since yesterday. Also, this machine has run the Stanford Gromacs core which is sensitive to overclocking to the extreme without a problem. I put it back to stock speed and it did exactly the same thing. It's now running a Stanford work unit. I'll wait and see what happens. Thanks for the suggestion.

Brian the Roman
02-22-2003, 11:19 PM
Originally posted by Brian the Fist
2: Im not sure I understand. Why does the server have to pick out of the 5000 instead of the client?? After we've tested the present sampling method on a few proteins, we may change it more to be like a genetic algorithm. I won't go into detail now but I think it'll be a bit similar to your suggestion with some added complexity and randomness.


The process as I understand it now is that the client will generate 5000 structures, pick the best one to drill down on in the next generation, and then pick the best one of that generation, etc until 50 generations have been done.
The weakness in this approach is that each client is limited in its choice of structures to the 5000 it produced at the beginning. If you have 100 people crunching it's very unlikely that the 100 best structures overall will be distributed evenly across the 100 sets of 5000. Rather, some sets of 5000 will actually contain several very good structures while some others will have none. The guy who ends up drilling down on the set that had no good structures to begin with is wasting his time compared to the guy who got lucky and got a good structure. The way to avoid this is to pool the best of the structures crunched by the clients on your server. Then, the server assigns the best structure that has not already been assigned to each client who asks. That way all of the best structures are pooled on the server and every client is drilling down on the best structure available out of the entire pool instead of just its own 5000.

To be sure I've been clear I'll do this as pseudo code. (sorry if you already understand)

1) client A crunches 5000 and reports the best 100 structures to the server.
2) the server assigns the best structure of the 100 to client A to drill down on and marks the structure as 'taken'.
3) client B crunches 5000 and reports the best 100 to the server.
4) the server assigns the the best structure out of the 200 excluding those already taken to client B to drill down on and marks that structure as taken.
5) and so on for the rest of the clients

The advantage of the above approach is that each client is always crunching the best unique structure available. With the current method some of the best structures found overall are ignored completely simply because they happen to be second best out of 5000.

looking at this from another perspective, you're not leveraging all of the sampling work done by all of the clients when choosing the best to drill down on. Everyone's on their own which will almost certainly be less effective overall.

ms

Brian the Roman
02-22-2003, 11:23 PM
Originally posted by Brian the Fist

3: Can you be more specific? Is it a bug? The length of time it stays on is measured in structures, not real-time, could this be why?

No it's not a bug. It's just that now that the structures take a minute or so each instead of a second or two, the message stays up about 60 times longer than it used to. It's a minor point

ms

Guff®
02-22-2003, 11:46 PM
Running in -qf mode, I've noticed that the residue calculation number will "hang" or even fall back in sequence.

ie., "calculating residue 6x of 96", may fall back to "5x of 96", go back up to "7x of 96" and fall back to "6x of 96". It's currently on #7 generation, but #6 was similar. Is this normal?
It's like it gets bored and just doodles all over itself.

WinXP w/AMD 1.4 T-Bird
-df -qf -it -rt switches

Aegion
02-22-2003, 11:52 PM
Originally posted by Guff®
Running in -qf mode, I've noticed that the residue calculation number will "hang" or even fall back in sequence.

ie., "calculating residue 6x of 96", may fall back to "5x of 96", go back up to "7x of 96" and fall back to "6x of 96". It's currently on #7 generation, but #6 was similar. Is this normal?
It's like it gets bored and just doodles all over itself.

WinXP w/AMD 1.4 T-Bird
-df -qf -it -rt switches
This is what Howard was refering to when he mentioned how the structures can get "stuck" with the beta. This is supposed to happen and is not a bug.

Guff®
02-23-2003, 12:05 AM
Thanks for the clarification. It seems to get stuck on every generation. Back to -qt mode for me!

Aegion
02-23-2003, 02:17 AM
I'm currently on set 50, and I've noticed an issue with a couple of the protein structures. What happened is that the client got stuck multiple times at different points while processing the same structure. It ended up taking over 30 minutes to process a single structure on both occasions. I strongly suspect that taking this amount of time is counterprouctive since it could be better spent crunching additional structures.. (I'm crunching on an Athlon 2000+ that wasn't doing anything else cpu intensive, so its not that we're talking about a slow computer here.)

Some sort of routine should be added to the program so that if it takes over a certain amount of time to crunch a single structure, it gives up on that unit and goes to the next one.

tpdooley
02-23-2003, 02:53 AM
I ran the client a number of times to test out the problems I've had with systems being shut down improperly and then having to edit filelist to get them to upload. I started the client after disconnecting the ethernet cable from the machine. (since I've had repeated problems with it losing internet connections and being Improperly Terminated and not uploading until filelist was hand cleaned).
I let it get to group 2 3 different times, and stopped the program improperly 3 different ways. On restart, (connected to the network), it uploaded the packets for group 0 and group 1 with no complaints.

That's a nice improvement.. Thanks. :)

As we move on to generation 50, does it get progressively longer per generation? (such that after a certain generation number, keeping waypoints in the current generation starts making sense?)

jlandgr
02-23-2003, 04:55 AM
As we move on to generation 50, does it get progressively longer per generation? (such that after a certain generation number, keeping waypoints in the current generation starts making sense?)
From my observation, it does get slower over time, probably because the structures have a lower A and the probability of generated structures having overlapping atoms/whatever makes a structure invalid and get 'stuck', is higher? Just a guess. As for waypoints: During the tests I did, exiting the client with Q, it started over after the last structure in a generation, not at the beginning of a generation (if you exited after some structures were done).
Do you mean something different by 'waypoints'?
Jérôme

Brian the Fist
02-23-2003, 10:45 AM
Originally posted by Aegion

Some sort of routine should be added to the program so that if it takes over a certain amount of time to crunch a single structure, it gives up on that unit and goes to the next one.

Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely :)

I DID already explain this, but it will keep relaxing the contraits more and more until it eventually gets unstuck if it keeps getting stuck.

Starfish
02-23-2003, 10:48 AM
Originally posted by Brian the Fist
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely :)

But what if it keeps getting stuck on the other retries as well?

;)

Brian the Fist
02-23-2003, 10:49 AM
Originally posted by Brian the Roman

To be sure I've been clear I'll do this as pseudo code. (sorry if you already understand)

1) client A crunches 5000 and reports the best 100 structures to the server.
2) the server assigns the best structure of the 100 to client A to drill down on and marks the structure as 'taken'.
3) client B crunches 5000 and reports the best 100 to the server.
4) the server assigns the the best structure out of the 200 excluding those already taken to client B to drill down on and marks that structure as taken.
5) and so on for the rest of the clients

ms

Ok thanks for the clarification. The main fallacy here is this though. Remember when we are predicting novel, unknown folds, we do not know which are 'good' or 'bad' samples. All we have is a pseudo-energy score which in some cases tells us which samples may possibly be somewhat decent. Thus we do not want to place to much reliance on this energy value (as we learned from CASP, its just not good enough yet to pick out the best structures). Thus even if a CPU generates 3 or 4 excellent structures in terms of RMSD, when we choose the top5 energies there is no guarantee they wil be in there. Anyhow, when we switch to trying a true genetic algorithm, the server will indeed keep pieces of the good-scoring samples and redistribute them to clients so that they will get used, that can be thought of as 'phase 3'...

Welnic
02-23-2003, 10:50 AM
Originally posted by Brian the Fist
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely :)

It seems to me that it tries for 3 minutes and then starts over on the same one instead of going to the next one.

How does the scoring actually work? I think now that if you just kept running the first generation over and over you would generate way better stats.

Brian the Fist
02-23-2003, 10:53 AM
Originally posted by jlandgr
From my observation, it does get slower over time, probably because the structures have a lower A and the probability of generated structures having overlapping atoms/whatever makes a structure invalid and get 'stuck', is higher? Just a guess. As for waypoints: During the tests I did, exiting the client with Q, it started over after the last structure in a generation, not at the beginning of a generation (if you exited after some structures were done).
Do you mean something different by 'waypoints'?
Jérôme
Yup, I think I since changed it so it will continue at the start of the current generation if killed improperly. Is this sufficient? Should it be checkpointed say every 10 structures instead? Its not a big deal, checkpointing just involves updating filelist.txt correctly on disk (but does require disk activity for those who are paranoid). Perhaps I could sync it with the progress.txt update interval, would that be a good idea?? Sounds like a good one to me actually.. Actually that might not be true, it might have to start at the beginning of the generation because if you kill it stuff won't get written out properly but I will double check on that.

So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.

Welnic
02-23-2003, 11:37 AM
Originally posted by Brian the Fist
...
So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.

I have it running on two machines, a 2000MP running linux and a ~1400 MHz P4 running XP. After about 28 hours the P4 was on generation 50 and the AMD was on about 40. The AMD is normally faster and after 8 hours had a pretty good lead but I guess it ran into some gnarly structures.

Starfish
02-23-2003, 11:41 AM
Originally posted by Welnic
gnarly structures.

'gnarly structures'..Got to remind that one :D

Aegion
02-23-2003, 01:43 PM
Originally posted by Brian the Fist
Actually, it already does exactly what you say. The timeout is about 3 minutes. Guess you haven't been watching THAT closely :)

I DID already explain this, but it will keep relaxing the contraits more and more until it eventually gets unstuck if it keeps getting stuck.
You are not getting what I'm saying. It gets unstuck, but then it gets stuck again later on with the same structure at a later point. It also looks like it backs up from time to time, often getting stuck again at the same point it was at earlier. I was watching a clock as I timed the structure, it took over 30 minutes to process a single structure in both instances.

Guff®
02-23-2003, 01:53 PM
Originally posted by Aegion
You are not getting what I'm saying. It gets unstuck, but then it gets stuck again later on with the same structure at a later point. It also looks like it backs up from time to time, often getting stuck again at the same point it was at earlier. I was watching a clock as I timed the structure, it took over 30 minutes to process a single structure in both instances.
I post about this problem and you said it's not a bug. So what turned it into one?

AMD_is_logical
02-23-2003, 01:53 PM
Originally posted by Brian the Fist Yup, I think I since changed it so it will continue at the start of the current generation if killed improperly. Is this sufficient? Should it be checkpointed say every 10 structures instead? Once the number is increased to 200 structures, there should definitely be checkpointing. Power failures happen, and many people play unstable games on their systems. If a 200 structure generation takes many hours to produce, that is way too much work to lose.
Its not a big deal, checkpointing just involves updating filelist.txt correctly on disk (but does require disk activity for those who are paranoid). Just in case you're refering to my complait about disk activity in a previous post, let me clarify. I am getting about 1000 packets per second each way between the node running the beta and the server. Compared to that insane amount of trafic, the amount required for a checkpoint would be insignificant.

BTW, I noticed that someone was complaining about the beta DF client seriously hurting their performance when loading programs and such under Win9x. I don't have Win9x so I'm just speculating, but I can't help but wonder if the huge number of disk requests from the client is acting as a sort of denial-of-service attack on the Win9x disk subsystem.
Perhaps I could sync it with the progress.txt update interval, would that be a good idea?? Sounds like a good one to me actually.. Sounds good to me. That way, we have a switch that will let us select the frequency of checkpointing. :D
So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks. It depends, and it can vary. On one node running a single copy of the client I get roughly 24 hours or so.

On another computer I'm running four copies of the client at once, as well as other stuff. Each client is getting about 1/8 of the CPU. Here it takes only about 6 hours of CPU for a client to do 50 generations.

And this computer has a slower CPU than the node.

So if you put a gigabyte of memory on your computer and run 8 copies of the client at once, you will have about 4 times the production compared to what your computer would have with only one copy.

This seems to be due to the 3 minute (real time) timer. No matter how fast or slow your CPU, it will sit there until enough 3 minute timeouts have occured to loosen the constraints enough for that CPU to do a structure in under 3 minutes, then it will generate structures at the same rate no matter what the CPU speed was.

There are several problems with this. First, it is blatantly unfair to people with fast machines, and that kind of unfairness can turn people away from a project. Second, it invites people to do weird things like run many copies of the client at once, or to rig their real-time clock to run 16x normal speed. Third, it can't possibly be an efficient way for the client to use CPU cycles.

I can often see the client getting stuck and repeatedly backing off about 5 units and running foward again. It's just sitting there wasting CPU cycles and not getting anywhere. I think the client should be much more aggressive about recognizing this (based on number of tries, not real-time) and loosening the constraints. It can tighten the constraints back up once it's past the sticking point.


So here is a summary of my wish list:

1) The rate the client runs should be based on CPU cycles, not on real-time.

2) The huge amount of disk requests should be scaled way back. If this activity is due to checking the foldtrajlite.lock file, then perhaps there can be a switch to control the rate.

3) Add checkpointing.

4) Make the random number seeding cluster-friendly (if you haven't already). If you haven't found a good way to use the MAC, perhaps you could add that switch we talked about, so that an integer could be given to the client for combining with the time and pid to make the seed.

Insidious
02-23-2003, 02:21 PM
1) The rate the client runs should be based on CPU cycles, not on real-time.


I agree... why in the world would real-time be used when there is
such a large difference in what different PCs can do in that time?

(I'm wasting millions of cycles more than a putz machine)

m0ti
02-23-2003, 02:54 PM
Please ignore! Sorry!

m0ti
02-23-2003, 02:58 PM
I think this has been mentioned before but after exitting out from the client while it is minimizing energy it loses all progress made in minimizing energy. I realize it doesn't take so long to perform but on my machine (AXP 1600+@1900+) it takes it around 1 minute or so, and it seems a shame to lose the work.

Same goes for Trajectory Distribution.

Aegion
02-23-2003, 04:16 PM
Originally posted by Guff®
I post about this problem and you said it's not a bug. So what turned it into one?
Getting stuck is not a bug. However, I'm noticing a protein structure get stuck multiple times during the same sequence, taking over 30 minutes to complete a single protein structure. I'm questioning whether the benefits of completing that structure outweigh the lost time which could be spent completing multiple other protein structures. While the software may be functioning as intended, I've noticed a flaw in the manner it is currently functioning in.

Guff®
02-23-2003, 04:42 PM
Originally posted by Aegion
Getting stuck is not a bug. However, I'm noticing a protein structure get stuck multiple times during the same sequence, taking over 30 minutes to complete a single protein structure. I'm questioning whether the benefits of completing that structure outweigh the lost time which could be spent completing multiple other protein structures. While the software may be functioning as intended, I've noticed a flaw in the manner it is currently functioning in. Thanks for verifying/explaining what I was seeing. I didn't think that should be normal, hence the original question.

shortfinal
02-23-2003, 06:15 PM
Originally posted by AMD_is_logical
I'm trying the beta on one of my cluster nodes. These are diskless nodes that use an NFS server. The network activity on that node is *far* higher than with the old client. I estimate it's about 30 times higher (in terms of bytes transfered) than the old client.

This is just a guess but try using the -g option to reduce the frequency it updates the progress.txt file. I believe if you don't specify it the client updates the file for every structure. So try something like -g 10 so it updates the file every 10 structures.

mighty
02-23-2003, 07:27 PM
Originally posted by Brian the Fist
I assume you mean during energy minimization, the progress bar?? This should NOT be the case. I'll fix that if its true. It is important that quiet mode has NO output (it can cause problems depending on how people use it).

This is also the case with me - running the Windows version under w2k.

It writes a series of []============== for every time the "minimizing energy" or "calculating gen. x trajectory" messages would have appeared.

Brian the Roman
02-23-2003, 10:43 PM
Originally posted by Brian the Fist
Ok thanks for the clarification. The main fallacy here is this though. Remember when we are predicting novel, unknown folds, we do not know which are 'good' or 'bad' samples. All we have is a pseudo-energy score which in some cases tells us which samples may possibly be somewhat decent. Thus we do not want to place to much reliance on this energy value (as we learned from CASP, its just not good enough yet to pick out the best structures). Thus even if a CPU generates 3 or 4 excellent structures in terms of RMSD, when we choose the top5 energies there is no guarantee they wil be in there. Anyhow, when we switch to trying a true genetic algorithm, the server will indeed keep pieces of the good-scoring samples and redistribute them to clients so that they will get used, that can be thought of as 'phase 3'...

To my way of thinking, the low quality algorithm suggests that you should be assigning the structures from the server instead of the clients. That way you can modify the algorithm used to select the 'best' structure dynamically on the server side. You could try using two or more different algorithms simultaneously until you determine which is the better one without any impact to the clients.

But if phase 3 will handle this then the point is probably moot. When do you anticipate phase 3 will begin?

ms

m0ti
02-24-2003, 03:51 AM
Originally posted by AMD_is_logical
Once the number is increased to 200 structures, there should definitely be checkpointing. Power failures happen, and many people play unstable games on their systems. If a 200 structure generation takes many hours to produce, that is way too much work to lose. Just in case you're refering to my complait about disk activity in a previous post, let me clarify. I am getting about 1000 packets per second each way between the node running the beta and the server. Compared to that insane amount of trafic, the amount required for a checkpoint would be insignificant.

BTW, I noticed that someone was complaining about the beta DF client seriously hurting their performance when loading programs and such under Win9x. I don't have Win9x so I'm just speculating, but I can't help but wonder if the huge number of disk requests from the client is acting as a sort of denial-of-service attack on the Win9x disk subsystem. Sounds good to me. That way, we have a switch that will let us select the frequency of checkpointing. :D It depends, and it can vary. On one node running a single copy of the client I get roughly 24 hours or so.

On another computer I'm running four copies of the client at once, as well as other stuff. Each client is getting about 1/8 of the CPU. Here it takes only about 6 hours of CPU for a client to do 50 generations.

And this computer has a slower CPU than the node.

So if you put a gigabyte of memory on your computer and run 8 copies of the client at once, you will have about 4 times the production compared to what your computer would have with only one copy.

This seems to be due to the 3 minute (real time) timer. No matter how fast or slow your CPU, it will sit there until enough 3 minute timeouts have occured to loosen the constraints enough for that CPU to do a structure in under 3 minutes, then it will generate structures at the same rate no matter what the CPU speed was.

There are several problems with this. First, it is blatantly unfair to people with fast machines, and that kind of unfairness can turn people away from a project. Second, it invites people to do weird things like run many copies of the client at once, or to rig their real-time clock to run 16x normal speed. Third, it can't possibly be an efficient way for the client to use CPU cycles.

I can often see the client getting stuck and repeatedly backing off about 5 units and running foward again. It's just sitting there wasting CPU cycles and not getting anywhere. I think the client should be much more aggressive about recognizing this (based on number of tries, not real-time) and loosening the constraints. It can tighten the constraints back up once it's past the sticking point.


So here is a summary of my wish list:

1) The rate the client runs should be based on CPU cycles, not on real-time.

2) The huge amount of disk requests should be scaled way back. If this activity is due to checking the foldtrajlite.lock file, then perhaps there can be a switch to control the rate.

3) Add checkpointing.

4) Make the random number seeding cluster-friendly (if you haven't already). If you haven't found a good way to use the MAC, perhaps you could add that switch we talked about, so that an integer could be given to the client for combining with the time and pid to make the seed.

That is in fact a very valid point, even if it is based on CPU cycles.

If a large amount of time is spent on trying to complete a fold, then, effectively, most of that time is wasted, and one will increase overall production by running an extra client. Of course, if they both get stuck on a particular fold, then it becomes worth it to have another client running, and so on and so on. The loss of productivity due to context switches will be more than made up for in fold production.

This leads to some difficulties: the current fold should be abandoned after some period of time (or the constraints relaxed enough). If this is done too soon, then a good fold may be lost. If this is done too late, then productivity drops.

Welnic
02-24-2003, 09:19 AM
Originally posted by Welnic
So I am seeing a big delay doing certain things in Windows XP. I timed the main application that I run all the time for opening and closing times. I had the client running in a dos box with the -rt switch.

Open with regular client: 3 seconds
Open with beta client: 33 seconds
Close with regular client: 5 seconds
Close with beta client: 70 seconds

...
I normally run as a service and that is where I first noticed this. I was just running in the dos box because it was faster to set up and I wanted to make sure it was just doing the normal folding part instead of the energy minimization part.

I must have been high. Checking again, I do not see any problem with it running as a service, just when it is in the dos box. And with some checking with a demo version of the program I did not see the delay. I only see the delay in a beta version that I have, which I would imagine has debugging turned on.

Ned
02-24-2003, 10:31 AM
I understand your requirement to have a quiet mode, but I'd like an additional switch that would have minimum status messages produced by the text client instead of the current verbose output.

These would include:
- current level number started
- 25 percent completion increments with best numbers found.
- level completion with best numbers.
- connections to server documented
- all above timestamped
- one line per status, concise.

Consider it a kind of status in a glance!

Thanks for bandwidth... Ned

m0ti
02-24-2003, 12:09 PM
WinXP Pro
Running the client normally (no switches).

While Minimizing Energy or calculating the Trajectory Distribution the rest of the computer freezes up, completely. ALL system resources are given to DF even though it's running at low priority.

mighty
02-24-2003, 02:55 PM
Originally posted by Brian the Fist
So anyone have a handle on how long 50 generations is taking for them on 1 CPU? This would be useful info for us (we will have to multiply by 10 of course for the real plan) Thanks.

On my Athlon XP 2100+ (running Windows 2000) it completed 50 generations in a little under 17 hours with an average of about 20 minutes pr. generation.

That means over 3 hours (on my machine) pr. generation when we make it out of beta.

I will try to make more measurements to see if it varies alot.

m0ti
02-24-2003, 04:44 PM
WinXP Pro
CLI Client (default switches)

got the following error during Trajectory Distribution for Gen 44: FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 697} File write error

When I restarted it started afresh from Gen 0.

lemonsqzz
02-24-2003, 08:04 PM
I am running 7 dual cpu systems:

model name : Intel(R) Pentium(R) III CPU family 1400MHz
cpu MHz : 1396.449
cache size : 512 KB

They take about 18 1/2 hours to complete 50 generations. I have not had any problems since I fired them up about 5 days ago.. All running on SunLinux which is basically redhat 7.2.

** Would be nice if there was a way for the client to use both CPUs in some clever way to process the data since speed really makes the difference in this version.

PackSwede
02-25-2003, 02:35 AM
Has anyone else noticed problems with getting through a proxy with the beta client !?!, i've recently tried running it on my work computer (which normally runs the 'old' client just fine) and cannot get through to the server... thinking i had done something wrong i doublechecked the proxy.cfg file and also copied it from my original working clients directory but still no dice...

It will stay on "checking for new versions" for a long time (probably some timeout) and then resume crunching but never uploading any results

After completing a generation there is also a long wait before anything happens (another timeout !?!) and i have lots of

ERROR: [000.000] {foldtrajlite.c, line 4721} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER

Lines in my error.log file

Is this suppoed to work at all with the beta or am i missing something !?!

m0ti
02-25-2003, 02:39 PM
Originally posted by m0ti
WinXP Pro
CLI Client (default switches)

got the following error during Trajectory Distribution for Gen 44: FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 697} File write error

When I restarted it started afresh from Gen 0.

Just got this same error again, but for Trajectory Distribution for Gen 33.

Brian the Fist
02-25-2003, 04:26 PM
Originally posted by m0ti
That is in fact a very valid point, even if it is based on CPU cycles.

If a large amount of time is spent on trying to complete a fold, then, effectively, most of that time is wasted, and one will increase overall production by running an extra client. Of course, if they both get stuck on a particular fold, then it becomes worth it to have another client running, and so on and so on. The loss of productivity due to context switches will be more than made up for in fold production.

This leads to some difficulties: the current fold should be abandoned after some period of time (or the constraints relaxed enough). If this is done too soon, then a good fold may be lost. If this is done too late, then productivity drops.

That's why higher generations will be worth significantly more 'points' - to encourage you to go all the way, and not do what you just described...

Brian the Fist
02-25-2003, 04:29 PM
Originally posted by m0ti
WinXP Pro
CLI Client (default switches)

got the following error during Trajectory Distribution for Gen 44: FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 697} File write error

When I restarted it started afresh from Gen 0.

Sounds like a full disk (or no permission to write). It cannot recover from that sort of error for obvious reasons. (Probably your /tmp partition). This error is directly from a failed fwrite (i.e. not all elements were written) and so is pretty straight-forward.

Welnic
02-25-2003, 04:50 PM
Originally posted by Brian the Fist
That's why higher generations will be worth significantly more 'points' - to encourage you to go all the way, and not do what you just described...

So right now the later generations are not worth more points?

Also I have the beta running as a service on an XP box and the monitors never power off like they normally do.

Aegion
02-25-2003, 05:00 PM
Originally posted by Welnic
So right now the later generations are not worth more points?

Also I have the beta running as a service on an XP box and the monitors never power off like they normally do.
I don't believe the points system is really implemented yet in the beta. The point of the beta testing is to participate and locate bugs, not to obtain top place in the stats rankings.

lemonsqzz
02-25-2003, 05:19 PM
I vote for pointless beta testing... I do that at work all the time! Are the structures still valid though ??? :jester:

Brian the Fist
02-25-2003, 05:28 PM
The point system SHOULD be in place right now. Please test this for us too ;) You should, I believe, get 5000 points for gen 0 (but this may be changed to 200), and for gen. x you should get 200*sqrt(x) points (ok, whip out those calculators). If this is NOT the case please let me know and I'll check it out.


Aside from this, I've gone through the whole thread and identified 7 bugs and 7 features (including stuff Chris and I have decided to add) which I will now fix and/or shove into the next beta, which I should hopefully have ready later this week. Any further betas after this will likely be to play with parameters like size of and number of generations to optimize those a bit more but I think you've all done a really great job at nailing all the bugs and even potential bugs. You found some things I really didn't expect with such a relatively small testing group (under 100 of you anyways).

Unless you find another new bug or have an important suggestion/feature to add which hasn't been already approximately mentioned in this thread, lets pause it here for now and I will get these changes done ASAP. With the next beta I may also release the screensaver, and hopefully a few of you will be willing to test that out as well just to make sure there's nothing quirky specific to it (but remember its all really the same code so most things should work the same in the screensaver as the text client in general).

Thanks All!

Welnic
02-25-2003, 06:01 PM
Originally posted by Brian the Fist
The point system SHOULD be in place right now. Please test this for us too ;) You should, I believe, get 5000 points for gen 0 (but this may be changed to 200), and for gen. x you should get 200*sqrt(x) points (ok, whip out those calculators). If this is NOT the case please let me know and I'll check it out.

...

I would change the points for gen 0 to 200 or some people will just reset it after gen 0 and just do those. But if you don't reset the overall stats I would leave gen 0 at 5000 and also change the rest to 5000. Otherwise the scores already done will have too much weight.

If you don't make them equal I'll demo the gen 0 only advantage when I get back from vacation next week. :D

Tawcan
02-25-2003, 10:19 PM
Alright I'm in as well.

XP1700+ @ 2400+ spec (10.5*192)
512MB RAM
Window XP Pro

Running DOS text client.

The client is going crazy generating huge ASII diagram. :shocked:

m0ti
02-26-2003, 12:20 AM
WinXP Pro,
CLI Client default switches:

ERROR: [777.000] {ncbi_http_connector.c, line 217} [HTTP] Error writing body at offset 8192

ERROR: [777.000] {ncbi_http_connector.c, line 117} [HTTP] Retry attempt(s) exhausted, giving up
ERROR: [000.000] {foldtrajlite2.c, line 3618} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
ERROR: [777.000] {ncbi_http_connector.c, line 217} [HTTP] Error writing body at offset 8192
ERROR: [777.000] {ncbi_http_connector.c, line 117} [HTTP] Retry attempt(s) exhausted, giving up
ERROR: [000.000] {foldtrajlite2.c, line 3618} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
FATAL ERROR: [012.002] {trajtools.c, line 2377} Attempt to insert duplicate residue number into database


The fatal error occurred during Trajectory Distribution.

m0ti
02-26-2003, 12:23 AM
Originally posted by Brian the Fist
Sounds like a full disk (or no permission to write). It cannot recover from that sort of error for obvious reasons. (Probably your /tmp partition). This error is directly from a failed fwrite (i.e. not all elements were written) and so is pretty straight-forward.

Disk is not full (all partitions have free space) and all permissions are there. This happened after leaving the client on the entire day.

perhaps it couldn't upload to the server for a while so the buffer on disk got full?

Or is the buffer now infinitely large?

m0ti
02-26-2003, 12:52 AM
Originally posted by Brian the Fist
The point system SHOULD be in place right now. Please test this for us too ;) You should, I believe, get 5000 points for gen 0 (but this may be changed to 200), and for gen. x you should get 200*sqrt(x) points (ok, whip out those calculators). If this is NOT the case please let me know and I'll check it out.


Aside from this, I've gone through the whole thread and identified 7 bugs and 7 features (including stuff Chris and I have decided to add) which I will now fix and/or shove into the next beta, which I should hopefully have ready later this week. Any further betas after this will likely be to play with parameters like size of and number of generations to optimize those a bit more but I think you've all done a really great job at nailing all the bugs and even potential bugs. You found some things I really didn't expect with such a relatively small testing group (under 100 of you anyways).

Unless you find another new bug or have an important suggestion/feature to add which hasn't been already approximately mentioned in this thread, lets pause it here for now and I will get these changes done ASAP. With the next beta I may also release the screensaver, and hopefully a few of you will be willing to test that out as well just to make sure there's nothing quirky specific to it (but remember its all really the same code so most things should work the same in the screensaver as the text client in general).

Thanks All!

I think there's a definite advantage to the Gen 0 group for sure. Gen 0 certainly doesn't take more time than any of the other generations (if anything it takes less), but even gen 50 is only worth 200 * sqrt(50) = 1414.

Perhaps gen 0 should be aggressively scaled down? I'm thinking like a 100:1 ratio; make it worth 50 points. Basically, ensure that the higher generations are attractive enough to make people not want to stick around at gen 0.

tpdooley
02-26-2003, 06:39 AM
for the final release.. it'll be gen0=5k, and later generations will be 5k*sqrt(gen)?

FoBoT
02-26-2003, 10:54 AM
this is why starting with new stats for phase II would be an advantage

if you continue to add to the old stats, you have two problems to deal with

1- making the stats of phase II comparable to phase I
2- making the inter generational stats of phase II structured in a manner to encourage people to run it through a complete sequence that most benefits the science vs. manually manipulating the process to squeeze out a few extra points (and thus the science suffers from this manipulation)


doing both of these simultaneously will be challenging.
if you eliminate #1, you can concentrate on #2, which i think is more relevent to the current active participants


as m0ti points out, the points awarded for the generations need to be weighted to give incentive to letting the process run through its "natural" course, if this isn't the case, people will come up with a way (scripts, .bat files, 3rd party apps, etc) , to start/stop the client in a manner that is advantagous to gaining more points, not more scientific data

have a nice day! :)

AMD_is_logical
02-26-2003, 12:14 PM
Originally posted by FoBoT
this is why starting with new stats for phase II would be an advantage

if you continue to add to the old stats, you have two problems to deal with

1- making the stats of phase II comparable to phase I
2- making the inter generational stats of phase II structured in a manner to encourage people to run it through a complete sequence that most benefits the science vs. manually manipulating the process to squeeze out a few extra points (and thus the science suffers from this manipulation)


doing both of these simultaneously will be challenging. I disagree. I don't see any problem here. First adjust the relative scores for the various generations to accomplish (2), then scale everything to accomplish (1) using the assumption that most people will be running through all 50 generations.

For each new protein the overall scaling could be adjusted to give comparable credit for a given amount of CPU time.

Brian the Fist
02-26-2003, 05:26 PM
Lets not start arguing about stats again now. Somebody just please confirm whether you are being credited the proper amounts, as stated in the formula I gave earlier. This formula may change once the beta becomes non-beta.

mighty
02-26-2003, 06:59 PM
Originally posted by Brian the Fist
Lets not start arguing about stats again now. Somebody just please confirm whether you are being credited the proper amounts, as stated in the formula I gave earlier. This formula may change once the beta becomes non-beta.

I don't think the points are awarded as described. I just uploaded 87 generations and tried to update my stat-page multiple times during this upload. I kept getting points in rather round and neat intervals - like 200, 400 or 800 pr. generation, but if its 200*sqrt(X) then there should be some not-so-round numbers in between, like gen. 50 should be 1414.

Of course this could easily be explained if you're doing some rounding up or down...

AMD_is_logical
02-26-2003, 08:08 PM
Originally posted by Brian the Fist
Lets not start arguing about stats again now. Somebody just please confirm whether you are being credited the proper amounts, as stated in the formula I gave earlier. This formula may change once the beta becomes non-beta. I made a new account, and got the following:

gen 0 - 5000
gen 1 - 5200
gen 2 - 5400
gen 3 - 5600

So the 0'th gen gives 5000, and the rest give 200 each, with no sign of a sqrt(x).

Also, all numbers on the stats page seem to be multiples of 200.

Insidious
02-26-2003, 08:17 PM
wouldn't it make more sense to let it 200 * x instead of using the square root of x?

I mean, the square-root of 50 is only about 7 or so.
Stat whores will VERY quickly realize 7 gen 0 calculations takes
MUCH less time than 50 generations.

-Sid

Scotttheking
02-26-2003, 08:37 PM
When is the OSX Beta coming?

Aegion
02-26-2003, 08:39 PM
Ok, I believe I have now discovered another bug, and I'm trying to figure out which type it is. Does anyone know if a val file should be produced when quitting halfway through the first structure generated in a set? I quit in this situation, and when I restarted, the structure got stuck in the first structure of a generation set for over an hour. When I checked the files, the val file listed in the filelist is missing.

AMD_is_logical
02-26-2003, 09:41 PM
Originally posted by Insidious
wouldn't it make more sense to let it 200 * x instead of using the square root of x?

I mean, the square-root of 50 is only about 7 or so.
Stat whores will VERY quickly realize 7 gen 0 calculations takes
MUCH less time than 50 generations.

-Sid You're overlooking two things. First, the amount you get for the 0'th generation will be reduced until it gives fewer stat points per CPU cycle than other generations.

Second, the 200 * sqrt(50) is for generation 50, NOT for all 50 generations. By the time you reach generation 50 you will have already gotten points for each and every generation from 0 to 49.

Brian the Fist
02-26-2003, 09:47 PM
Thanks for pointing out the scoring bug, I suspected it was in error.

Any other comments or questions about scoring for the beta will be officially ignored. You will just have to trust us to make it fair. :smoking:

Brian the Fist
02-26-2003, 09:48 PM
Originally posted by Aegion
Ok, I believe I have now discovered another bug, and I'm trying to figure out which type it is. Does anyone know if a val file should be produced when quitting halfway through the first structure generated in a set? I quit in this situation, and when I restarted, the structure got stuck in the first structure of a generation set for over an hour. When I checked the files, the val file listed in the filelist is missing.

If it has not completed one structure in a generation, there should be no .val file. The .val IS the structure after all. No structure, no .val. Partial structures are not (cannot) be saved.

Aegion
02-26-2003, 09:51 PM
Originally posted by Brian the Fist
If it has not completed one structure in a generation, there should be no .val file. The .val IS the structure after all. No structure, no .val. Partial structures are not (cannot) be saved.
In that case, I've definately found an instance I can duplicate where during the crunching of a single structure, it gets stuck at the same point for over at least an hour. Should I email the pertinent files for your to examine?

Brian the Fist
02-26-2003, 10:54 PM
Originally posted by Aegion
In that case, I've definately found an instance I can duplicate where during the crunching of a single structure, it gets stuck at the same point for over at least an hour. Should I email the pertinent files for your to examine?

No, but please clarify exactly what you are talking about. What options (flags) did you run with, and exactly what did you observe that appears to be wrong. How do you know it gets stuck at the same point? What generation is it at? And getting stuck is not a bug, remember? What is it that you think is wrong here exactly...

Aegion
02-26-2003, 11:14 PM
Originally posted by Brian the Fist
No, but please clarify exactly what you are talking about. What options (flags) did you run with, and exactly what did you observe that appears to be wrong. How do you know it gets stuck at the same point? What generation is it at? And getting stuck is not a bug, remember? What is it that you think is wrong here exactly...
I'm watching for starters. Just to be clear, it is getting stuck at 72-73 on #1 generation 28 for my set. When I let it run, it definately does not eventually move forward, but stays stuck perpetually in the exact same place for over at least an hour. (It does occassionally vary the number slightly but always remains suck, it appears it might get stuck in the high 60's instead of the 70's sometimes.) I'm running an Athlon 2000+ system without anything else cpu intensive running so speeds not the issue here. I have been able to verify it does NOT eventually move to #2 generation 28 after running the structure for over an hour in each case. It stays stuck at exactly the same position. Its possible after several hours it might move to the next structure, but it definately takes a vastly longer time than it should. This same behavior occurs when I use the q option to cause the client to close and then reload it.

I'm running it on a Windows XP system. My settings are .\foldtrajlite -f protein -n native -qf -df -it -rt

My filelist displays the following:
.\fold_1_7vshcwgg_0_7vshcwgg_protein_27.log.bz2
.\7vshcwgg_1_7vshcwgg_protein_27_0000005.val
CurrentStruc 1 1 123 28 1 0 10000000.000

edit: The structure does try outright resetting from time to time, but it always gets stuck at the same point.

update: The structure did finally move on to the next one after a couple of hours. I can still send the file with the structure to you so that you can examine its behavoir since I backed it up in a seperate location. The current structures are also unfortunately exibiting similar behavoir.

Brian the Fist
02-27-2003, 01:03 PM
This is normal behaviour Aegion, nothing wrong here. You may not like it getting stuck for so long, but it can happen. I may fiddle with this a bit still before the final release but it will never go away completely. In the long run everything will balance out though.

Aegion
02-27-2003, 01:09 PM
Originally posted by Brian the Fist
This is normal behaviour Aegion, nothing wrong here. You may not like it getting stuck for so long, but it can happen. I may fiddle with this a bit still before the final release but it will never go away completely. In the long run everything will balance out though.
Ok, I do have to wonder if it is somewhat counterproductive to allow the client to expend so much time on a single structure.

Brian the Fist
02-27-2003, 06:07 PM
Originally posted by Aegion
Ok, I do have to wonder if it is somewhat counterproductive to allow the client to expend so much time on a single structure.

That is a matter for study and research, which we have performed, and not something which can just be decided on a whim, or even intuition, unfortunately. You'll just have to trust that we know what we are doing :crazy:

m0ti
02-27-2003, 06:42 PM
Originally posted by Brian the Fist
That is a matter for study and research, which we have performed, and not something which can just be decided on a whim, or even intuition, unfortunately. You'll just have to trust that we know what we are doing :crazy:

I think that looking at the results the beta has produced so far justifies the time spent per fold. There may be more efficient ways of balancing things, but I'm sure that Howard and Dr. Hogue have taken a good look at it; after all we're after top-notch structures in possible narrow valleys, which are very compact... it can take a lot of folding time to get to them.

Just to point out how good a job the new algorithm has done:

we've got less than 100 users and we've done some 15 Million folds in about a week. The entire top 10 folds are better than the best fold we found in 11 Billion folds under the old algorithm in 3 weeks. Yes, the new algorithm is slower, but it produces much higher quality folds. I'm eagerly awaiting the release of the final beta and the algorithm then going into general use.

Aegion
02-27-2003, 06:49 PM
Originally posted by m0ti
I think that looking at the results the beta has produced so far justifies the time spent per fold. There may be more efficient ways of balancing things, but I'm sure that Howard and Dr. Hogue have taken a good look at it; after all we're after top-notch structures in possible narrow valleys, which are very compact... it can take a lot of folding time to get to them.

Just to point out how good a job the new algorithm has done:

we've got less than 100 users and we've done some 15 Million folds in about a week. The entire top 10 folds are better than the best fold we found in 11 Billion folds under the old algorithm in 3 weeks. Yes, the new algorithm is slower, but it produces much higher quality folds. I'm eagerly awaiting the release of the final beta and the algorithm then going into general use.
I certainly was not questioning the new algorithem in general, just a specific aspect of it. I definately recognize its overall potential and how it is improved over the old one. If Howard has carefully researched my issue and determined that the issue should remain as is, I'll take his word for it. I was was simply bringing up a possible area of concern.

m0ti
02-28-2003, 10:54 AM
I got that write error again (this time at generation 43 - again during Trajectory distribution). This is highly annoying since the current line is completely lost. I had produced a very good fold (6.24 RMS) and would have liked to continue to generation 50 with it.

Any chance of doing a backup of the needed files before doing energy minimization and trajectory distribution? That way, in case of an error, it can resume by trying the energy min and traj distribution again instead of resetting to gen 0.

Again; this is not a write permission problem, and this is not a disk-space problem. Interestingly enough, this ONLY occurs during Trajectory Distribution.

Brian the Fist
02-28-2003, 01:09 PM
Originally posted by m0ti

we've got less than 100 users and we've done some 15 Million folds in about a week. The entire top 10 folds are better than the best fold we found in 11 Billion folds under the old algorithm in 3 weeks. Yes, the new algorithm is slower, but it produces much higher quality folds. I'm eagerly awaiting the release of the final beta and the algorithm then going into general use.

Actually, remember the server is counting 5000 for gen. 0 and 200 for gen 1+ when in reality you are submitting only 500 and 20 respectively. Thus although it says we've made 15 million or whatever, it is actually 1/10th of this. So we're another 10 times better than you thought :)

Brian the Fist
02-28-2003, 01:11 PM
Originally posted by m0ti
I got that write error again (this time at generation 43 - again during Trajectory distribution). This is highly annoying since the current line is completely lost. I had produced a very good fold (6.24 RMS) and would have liked to continue to generation 50 with it.

Any chance of doing a backup of the needed files before doing energy minimization and trajectory distribution? That way, in case of an error, it can resume by trying the energy min and traj distribution again instead of resetting to gen 0.

Again; this is not a write permission problem, and this is not a disk-space problem. Interestingly enough, this ONLY occurs during Trajectory Distribution.

Has ANY other beta-tester received this same error (File Write Error)?? I am still not convinced it is a bug and I am 99.999% certain it is a problem with writing to your TEMP dir/partition. Maybe it is NFS mounted or something weird like that if it is not full. If not, please give me a detailed description of your OS, your partitions/drive schemes, which filesystems are remote, and the value of your TEMP environment variable or any other temp directory indicators and the free space on each of them.

TEN-Catdaddy63
02-28-2003, 01:50 PM
I've been running the BETA as a service on a C1200, 256mb PC133 machine with Windows 2000. Installed last Friday evening and it has completed 2 full generations and is working on gen 19 of number three. I have this machine set to useram=0 and have had absolutely no problems. I may try useram=1 over the weekend and see if that causes any issues. Nice job so far Howard, looking very good!

KWSN_Millennium2001Guy
02-28-2003, 05:28 PM
I was experiencing the same write error on one machine. It turned out that there was only 4 or 5 megabytes free on the drive where the temp directory was located because the user had set IE to cache 10 gigs of history.

I deleted the IE file cache and the machine began running again.

Ni!:rolleyes:

Brian the Fist
02-28-2003, 05:57 PM
Beta 2 has now been 'terminated' Please see the thread on Beta 3 to continue beta testing. Thanks.