Beta 4 now available [Archive]

Brian the Fist

03-11-2003, 03:58 PM

Again, files are at:

ftp://ftp.mshri.on.ca/pub/distribfold/download/distribfold-beta-linux-i386.tar.gz

for Linux

ftp://ftp.mshri.on.ca/pub/distribfold/download/distribfold-beta-win9x.zip

for Windows text client

ftp://ftp.mshri.on.ca/pub/distribfold/download/distribfoldss-beta-win9x.zip

for Windows screensaver

Please see the enclosed readme for what has changed. Beta 3 will no longer work (sorry folks, but this IS a beta test). For installation instructions for the screensaver, see the beta 3 thread.

For any of these betas, either download a clean copy from the website again and overwrite the files with the contents of the above archives, or at a minimum delete your filelist.txt and overwrite an existing beta 3 installation with the contents of the above archive.

At this point we are now more optimizing the algorithm than the software itself, but please do report any more bugs that you may find/think you found in this thread.

Thanks.

(the links above were broken before - should be right now)

m0ti

03-11-2003, 04:48 PM

You mentioned that we're using RMSD again for scoring.

Does this mean that the client is no longer working ab inito (sp) for this protein?

Brian the Fist

03-11-2003, 04:56 PM

Originally posted by m0ti
You mentioned that we're using RMSD again for scoring.

Does this mean that the client is no longer working ab inito (sp) for this protein?

That is correct - we want to look at the 'folding graphs' with an ideal scoring function now, and see a) how folded it gets in 250 generations b) how much the energies go up and down along the folding pathway c) at what point it is useless to continue to more generations (hopefully less than 250) d) etc etc.

This will help us decide how to parameterize it for the final release (which may be next if this is the last beta)

Akkermans

03-11-2003, 05:14 PM

Howard,

Does this mean that we should let our PC's run for a long period to generate a large amount of structures/generations?

Cheers,
Akkermans

Brian the Fist

03-11-2003, 05:26 PM

Try to get to 250 generations please, yup. Then it will restart at 1 again (like it did at generation 50 until now). And like I said, if you see any bugs on the way, let us know but we don't expect any (expect that weird one from AMD_is_logical which I cannot fix yet).

pfb

03-11-2003, 06:01 PM

Something I am seeing under XP with Beta 4 is that CSRSS.EXE is taking a fair chunk of CPU time (anywhere from 10 to 25% - :cry: - via Task Manager - see below) when run as a normal console (command window) application and only goes away if I run in quiet mode (via dfGUI v2.2)...I will try as a service and see if the same problem occurs but this hasn't been observed with previous betas and I had restarted the PC before running Beta 4...

According to MS's site (http://support.microsoft.com/default.aspx?scid=kb;en-us;175788) this is may be due to 'A command window will cause the CSRSS process to use more resources on a computer running Windows NT depending on the properties that are selected for the CMD window. ' but I can't see any XP options which could cause this...

System:

Pentium 4 @ 2.7GHz, 512MB RAM
Radeon 8500 with Catalyst 3.1 drivers
Clean install of Beta4

Betas 1 to 3 and standard DF install don't cause this - only Beta 4 does...anyone else experiencing similar problems with CSRSS?

Screen shot of usage:

http://www.brownma.pwp.blueyonder.co.uk/df_csrss.jpg

edit - service is the same as quiet mode...and after another reboot the problem still exists when run as a normal console application.

mighty

03-11-2003, 06:32 PM

Originally posted by pfb
Betas 1 to 3 and standard DF install don't cause this - only Beta 4 does...anyone else experiencing similar problems with CSRSS?

On my system it has been that way ever since the first version I ever installed. Through all the previous proteins and through all the betas. They all take up to 10-15% CPU on CSRSS.EXE whenever I run it in ASCII mode.
The only way to get rid of that is to run it in quiet mode.

For Howard:
You say 250 generations, but in the readme from beta4 it says 300 generations. Which one is it? :confused:

Regards
Ole

PS: Running on w2k

bwkaz

03-11-2003, 07:07 PM

Segfaults! Fun, fun! :(

I've just grabbed beta 4, decompressed a fresh (latest non-beta, that is) normal tarball, then decompressed the beta 4 tarball over top of it. When running the client, it has crashed at structs 5, 9, 23, 67, 101, 120, 119 (which was after 120 -- it must not have made it back from the backtrack point), 124, and 127, each time with the same message in error.log:

ERROR: [001.001] {foldtrajlite2.c, line 1351} Caught sig 11 I am not doing anything strange with it that I know of.

Switches are -it -rt

I realize this probably isn't a great bugreport, but I'm not sure of what to try to reproduce it (other than just run the client).

If it helps, system details:

Linux -- LFS 3.3, with some updated packages
gcc 3.2, with all packages compiled using it, so no compatibility libraries exist.
glibc 2.2.5+linuxthreads (a pthreads implementation)
kernel 2.4.20+preempt
ncurses 5.2, with the required (I think...) libncurses.so.4 symlink

The normal (non-beta) client was running the first few times it crashed, so I thought that might have had something to do with it, but stopping the normal client didn't help anything. The first four crashes were with the normal client running, the fifth through ninth were with nothing else running.

I was running the foldtrajlite executable under gdb on the last crash. A stack backtrace produced:

#0 0x08051801 in strcpy ()
#1 0x096447c8 in ?? ()
#2 0x08052956 in strcpy ()
#3 0x080549f9 in strcpy ()
#4 0x08054a10 in strcpy ()
#5 0x0837e475 in error ()
#6 0x400c6102 in __libc_start_main () from /lib/libc.so.6 So I don't think that helps much. :(

My guesses -- it might be something with backtracking or with loosening the "laxness", since it seems to not happen consistently. Or it might be something with the gcc version. Or it might be something with the ncurses 4 symlink (was that required or not? the normal client -- icc version -- doesn't need it, but the beta complained about a missing lib until I added it). Obviously I have no real idea, though, since my executable doesn't have debug symbols in it, so the backtrace appears to be nonsense.

I'm willing to try anything you think would help.

Digital Parasite

03-11-2003, 08:44 PM

I am now running beta4 on my WinXP box (-rt -g 1).

As soon as I started the client it said 1 gen. buffered even though it started on generation 0 doing the 10000 initial structures. Shouldn't this still read 0 gen. buffered until it has finished the generation 0 and started generation 1?

Also, you said we were switching to RMSD which appears to be the case since I am now reporting 13.762 but the progress.txt file still reads: "Best Energy so far: 13.762"

For the eventual release of the new algorithm are you going to stick with RMSD or did you just switch to that for beta4 and we will be going back to best energy?

Jeff.

Georgina

03-11-2003, 10:10 PM

Originally posted by pfb

Betas 1 to 3 and standard DF install don't cause this - only Beta 4 does...anyone else experiencing similar problems with CSRSS?

I am currently running on W2k server. I don't have a screen shot but i watched Task manager for a minute or so and CSRSS used between 00 and 03 CPU.

G

Aegion

03-11-2003, 11:14 PM

One issue I'm still noticing with the beta is that the beta website still labels the point score we receive for our your as
# of structures generated This is no longer accurate since the beta is now tabulating your score from the units you process. This issue will also lead to a huge number of "bug-reports" from individuals noticing disparities in their structures being counted if release to the non-beta testing public in its current form. This needs to be addressed before the beta is finished. The current # of structures generated should be replaced with the label total points or some other catchy sounding name that makes it clear that the total number of structures generated is no longer what is being counted.

Brian the Fist

03-12-2003, 11:18 AM

Originally posted by Digital Parasite
I am now running beta4 on my WinXP box (-rt -g 1).

For the eventual release of the new algorithm are you going to stick with RMSD or did you just switch to that for beta4 and we will be going back to best energy?

Jeff.

We'll be going back to best energy, dont worry about that

Brian the Fist

03-12-2003, 11:19 AM

Originally posted by Aegion
One issue I'm still noticing with the beta is that the beta website still labels the point score we receive for our your as This is no longer accurate since the beta is now tabulating your score from the units you process. This issue will also lead to a huge number of "bug-reports" from individuals noticing disparities in their structures being counted if release to the non-beta testing public in its current form. This needs to be addressed before the beta is finished. The current # of structures generated should be replaced with the label total points or some other catchy sounding name that makes it clear that the total number of structures generated is no longer what is being counted.

Will do.

Brian the Fist

03-12-2003, 11:21 AM

Originally posted by mighty
On my system it has been that way ever since the first version I ever installed. Through all the previous proteins and through all the betas. They all take up to 10-15% CPU on CSRSS.EXE whenever I run it in ASCII mode.
The only way to get rid of that is to run it in quiet mode.

For Howard:
You say 250 generations, but in the readme from beta4 it says 300 generations. Which one is it? :confused:

Regards
Ole

PS: Running on w2k

250 generations.
readme is wrong
If csrss hogs resources when running in ASCII mode, then run in quiet mode (when not beta-testing) or use Linux :D . I have no control over how Windows manages its resources...

Brian the Fist

03-12-2003, 11:22 AM

Originally posted by bwkaz
Segfaults! Fun, fun! :(

My guesses -- it might be something with backtracking or with loosening the "laxness", since it seems to not happen consistently. Or it might be something with the gcc version. Or it might be something with the ncurses 4 symlink (was that required or not? the normal client -- icc version -- doesn't need it, but the beta complained about a missing lib until I added it). Obviously I have no real idea, though, since my executable doesn't have debug symbols in it, so the backtrace appears to be nonsense.

I'm willing to try anything you think would help.

I'll try it on my box first, and if that doesn't crash I'll send you a copy with symbols not stripped out. Send an e-mail to [email protected] so I have your e-mail address and then I can send it to you if need be. Thanks.

Welnic

03-12-2003, 02:16 PM

I think that the behavior of progress.txt needs to change a little. With the normal client even if you had the -g switch set to 100 it was still only a couple of minutes before the file was first written. Now even at the default setting, (which seems to be 5), it can be a long time after you restart in the later generations before the file gets written. Some people look at this file after foldtrajlite.lock is removed to check if the client is truly finished.

I think that when you first restart the program it should make the file right away listing the last fold that was done. I also think that it should be updated when a generation is finished.

Brian the Fist

03-12-2003, 02:57 PM

Thats easy enough to change - I'll do it for the next version.

Welnic

03-12-2003, 04:56 PM

Originally posted by Brian the Fist (from another thread)
We're a step ahead of you - the 3 minutes is not the limit anymore - it now 'times out' after 3 minutes or about 250000 tries (to place a residue), whichever comes first. On a 2GHz machine, these turn out to be about the same. The 250000 tries was carefully chosen based on previous observations and is not arbitrary. It will not be lowered for any reason as soing so will damage potential results.

So if you want it to try 250000 times why have the 3 minute cutoff? I admit it would be a bummer to watch it stuck for 12 minutes on a 500MHz machine, but it is also a bummer to have a 500MHz machine for folding. People just shouldn't watch.

It does seem to me that I can get more structures done by running 4 instances of the client so that it only does 62500 tries and gives up after 3 minutes. And AMD_is_logical has demonstrated that changing the clock rate works wonders for score production. I have 9 dedicated processors in my farm that I can change the clock rate on (I would imagine that I would eventually figure it out), but if you want it to try 250000 times I would rather do that as long as everyone did.

Brian the Fist

03-12-2003, 06:32 PM

Yes, AMD_is_Logical has certainly discovered another way to 'cheat', in terms of pushing up the score by speeding up the clock. We will definitely make sure that doesn't stay for the final verison. I am interested however to see how his results compare to everyone elses. If you look at the graphs for his movie, the RMSD initially decreases quickly to about 6, but has not shown improvement since then. Once everyone else catches up to him, so to speak, in a couple of days, it will be interesting to see if they do any better. This will help us decide about what to use as the limit cutoff. It seems that although speeding up the clock to make it give up early may get more structures made, but hurt the results in terms of the quality of the structure. In the end, the timeout will likely depend on total number of tries only (not on real-time) and will likely depend on several factors - the length of the protein, the generation and how far along the protein chain it is getting stuck at. We shall see though, gotta analyze more data first.

bwkaz

03-12-2003, 06:45 PM

All right, my segfault problem appears to have been random. The problem was that it failed to write any filenames to filelist.txt after struct 5 on the first run; every crash after that was because the program was expecting filenames to be there but there were none.

I suspected at first that the reason nothing was written was that I was LD_PRELOAD'ing the libhpi.so file from Java 1.4.1 (so that Java itself would load and work), because when I got rid of that and deleted filelist, it started working. But redeleting filelist and setting LD_PRELOAD also makes it work just fine, so it wasn't just LD_PRELOAD.

I wish I knew what caused it, but at least it works now. If anyone is seeing anything like that in the future with beta 4 (this specific problem will be handled with a more specific error message than "caught sig 11" in later client versions), check whether filelist.txt contains any filenames, or if it's just a CurrentStruc line and a digital signature (bunch of gobbledeygook characters). If it's just a CurrentStruc line and a signature, then try deleting it.

I've also been experimenting a little with quitting at strange times, like before the first struct was created, and I can't force the program to create a bad filelist anymore. I know I didn't just delete the two lines referencing the files, because I tried that again and it (that is, filelist.txt) failed the digital signature validation.

The one interesting thing was that when I hit ctrl-c at some point during the generation of struct 5, I was able to reproduce something I had forgotten about the first time through -- the ncurses stuff got deinitialized, but the foldtrajlite process never exited (until I sent it a TERM signal from another shell). But even that didn't corrupt the filelist.txt, so it's not like it was a problem, even though subsequent ctrl-c's weren't accepted.

Wish I knew what caused the first problem... oh well, though, it works now. :thumbs:

Digital Parasite

03-13-2003, 06:55 AM

I have been running beta4 with -rt on my P3-800 WinXP box for about a day and a half now.

FYI, It seems to be taking me about 10 hours to do 5 generations, so about 2 hours/generation which means that I will finish all 250 generations in about 21 days total. :eek: Time for a faster computer...

Jeff.

Brian the Fist

03-13-2003, 01:22 PM

AMD_is_logical - can you explain to me exactly what you are doing to get the accelerated performance here (I know speeding up your clock, but by how much, and anything else)? This will be useful to know when we compare your results to those of the others and help us make a final decision on how the timeout really should work.

mighty

03-13-2003, 02:38 PM

Running on w2k, switches -rt. Running regular client, not as a service.
Running on an Athlon Xp 2100+ (not overclocked!)

I don't know if this is a bug, but at the end of generation 64 the client crashes every time. It writes the following on-screen and in error.log:
"FATAL ERROR: [023.024] {trajtools.c, line 2492} RLEUnPack failed, size=2, should=400*0 - likely this is caused by overclocked or faulty RAM chips, please test your RAM"

I know the message suggests faulty equiment on my part, and of course that is a possibility, but I have been running the "real" client for over a year and all the betas without any problems whatsoever.
The other things that puzzles me about the error, is the fact that after completing structure 50 of generation 64 the client goes straight to the "generation trajectory for gen 65" message. It skips the "minimizing energy" part.

I have saved all the files a point right before the crash, so whenever I re-start it makes structure 50 and then crashes. This can be repeated over and over again (have done it abour 5 times sofar).

Am currently running "MemTest" to see if I should have developed faulty RAM, but sofar no errors.

Brian the Fist

03-13-2003, 03:24 PM

When exactly does it crash, at what point - at the end of making the trajectory distribution or the beginning? or the middle?

It is possible that there is a problem with the structure from the previous generation. If you like you can e-mail me the *.val , *.log.bz2, filelist.txt and the *.trj file with the highest number in it (send to [email protected]) and I can look and see if anything is amiss (or at least verify if I get the same error).

tpdooley

03-13-2003, 03:57 PM

When I log onto the Beta High Flyer's page, it informs me that I'm in 10th place on the RMS race. And that I've contributed 2686 structures. And my team has contributed 2474 structures.
Which one of my teammembers in The Genome Collective uploaded -212 structures? ;)

Ah.. refreshed the screen again, and my score now matches the team score.

mighty

03-13-2003, 04:08 PM

Originally posted by Brian the Fist
When exactly does it crash, at what point - at the end of making the trajectory distribution or the beginning? or the middle?

It is possible that there is a problem with the structure from the previous generation. If you like you can e-mail me the *.val , *.log.bz2, filelist.txt and the *.trj file with the highest number in it (send to [email protected]) and I can look and see if anything is amiss (or at least verify if I get the same error).

It crashes at the start of making the trajectory distribution. Maybe 2 or 3 seconds into it. I'll will mail you the files you mention.

I also tried copying the entire client directory to another computer (866MHz Intel P3 laptop running w2k) and it made the exact same error at the exact same time, so I am almost positive it's not due to bad RAM or otherwise faulty hardware.

Aegion

03-13-2003, 04:09 PM

Originally posted by tpdooley
When I log onto the Beta High Flyer's page, it informs me that I'm in 10th place on the RMS race. And that I've contributed 2686 structures. And my team has contributed 2474 structures.
Which one of my teammembers in The Genome Collective uploaded -212 structures? ;)

Ah.. refreshed the screen again, and my score now matches the team score.
There is a delay for update the general stats while your personal stats are updated immediately. This is the reason for the discrepancy.

Welnic

03-13-2003, 04:15 PM

Originally posted by mighty
It crashes at the start of making the trajectory distribution. Maybe 2 or 3 seconds into it. I'll will mail you the files you mention.

I also tried copying the entire client directory to another computer (866MHz Intel P3 laptop running w2k) and it made the exact same error at the exact same time, so I am almost positive it's not due to bad RAM or otherwise faulty hardware.

It really doesn't sound like it's bad RAM from things you said in your earlier post. But on the last test that you did bad RAM could have messed up one of the files that was saved, so at that point it would crash any computer.

AMD_is_logical

03-13-2003, 04:27 PM

Originally posted by Brian the Fist
AMD_is_logical - can you explain to me exactly what you are doing to get the accelerated performance here (I know speeding up your clock, but by how much, and anything else)? This will be useful to know when we compare your results to those of the others and help us make a final decision on how the timeout really should work. I wrote a little program that gets the time (as an integer, seconds since 1970 or some such thing) and puts it in a local variable. It then goes into an infinite loop where it sleeps for one second, adds 60 to the variable, then sets the time to that value. This runs in the background on the node. The structures seem to timeout after 4 seconds.

I have two nodes on the AMD_is_logical account. They use XP1800 CPUs and run linux. They use -rt -if -qt -p0 -g0 for flags. (The -g0 seems to protect them from that bug which we couldn't find in beta3. I am working on that bug on another node.)

I plan on being consistant with that account. If I decide to try something else, I'll do it to a different account.

Welnic

03-13-2003, 05:30 PM

I like having the Best RMS (Energy) so far in progress.txt. It is nice to see that decrease over the generations.

The normal client cranks through the folds at a predictable rate, but it is a total crapshoot on the results. The beta is random on how long it takes to do the folds, but gets consistently better results as the generations build. The clients obey Fredstien's Third Law of the Conservation of Randomness.

tpdooley

03-14-2003, 04:37 AM

While looking at the (details) results of the top ten beta folders, I noticed that I couldn't find a way of seeing my details after dropping off the top ten list. I'd plummeted until about generation 15, and then started climbing. So I looked at everyone else's.
We seem (with a very small sample to base this opinion on) to hit a valley between generation 15-40; and then climb awhile; hit a peek, and plummet again.
To speed the return to a decline would the following test work? - A. if the last generation was not less than the average of the last 10 generations (to allow short spikes up) - or B. if the last generation was not less than either of the 2 preceeding generations (to return to normal after we're beyond the peek), then dramatically reduce the number of time/tries testing for possibilities.
..once the client sees that it's in a decline again, return to normal.
(Just a thought about how to use AMDisLogical's approach to help speed up the wasted time climbing up to a peek).

AMD_is_logical

03-14-2003, 06:08 AM

I just added two nodes (also XP1800) to my "AMD beta test account". The difference between these nodes and the ones on my other account is that here I'm using +185 in my realtime program, which gives a timeout of 2 seconds. I'm hoping that comparing results from the two accounts will give an idea of what the effect of changing the timeout is (when using a short timeout).

With this timeout, the nodes spend more time doing the minimize/trajectory stuff than actually crunching the 50 structures.

Brian the Roman

03-14-2003, 07:53 AM

Howard;
once you've decided on things like initial sample size, sample size per generation, # of generations etc; will you return to a pure ab inito approach?

ms

Brian the Fist

03-14-2003, 10:32 AM

Originally posted by Brian the Roman
Howard;
once you've decided on things like initial sample size, sample size per generation, # of generations etc; will you return to a pure ab inito approach?

ms

We will use 'crease energy' as the scoring function instead of RMSD, likely, if that is what you mean. We can compute teh crease energy with no knowledge of the correct folded structure so it is 'ab initio'.

Brian the Fist

03-14-2003, 10:34 AM

Originally posted by AMD_is_logical
I just added two nodes (also XP1800) to my "AMD beta test account". The difference between these nodes and the ones on my other account is that here I'm using +185 in my realtime program, which gives a timeout of 2 seconds. I'm hoping that comparing results from the two accounts will give an idea of what the effect of changing the timeout is (when using a short timeout).

With this timeout, the nodes spend more time doing the minimize/trajectory stuff than actually crunching the 50 structures.

Can you do me one more small favor? From the node which generated your 5.5A movie (that is sped up, from what I understand, by a factor of 60), can you post/e-mail be the filelist.txt from it when it is on roughly structure #20 in generation 50 or higher (I assume now that it reached 250 it has reset and is going again on a new set)? I want to see the numbers in there which tell me how 'lax' the structures you are producing are. Thanks.

Welnic

03-14-2003, 03:46 PM

With the new way of folding I would like to have a results.txt file. This would have the best score from each generation listed so we could make our own RMS improvement graphs for each processor.

run:1
gen0:14.267
gen1:11.398
gen2:11.243

Brian the Fist

03-14-2003, 04:14 PM

Perhaps it would be better if I distributed the software we have made to generate all those graphs (since you can already download your best movie, and that is all that is required for input). I'll discuss it with the boss and maybe that'd be a better and more interesting solution.

AMD_is_logical

03-14-2003, 04:23 PM

Originally posted by Brian the Fist
Can you do me one more small favor? From the node which generated your 5.5A movie (that is sped up, from what I understand, by a factor of 60), can you post/e-mail be the filelist.txt from it when it is on roughly structure #20 in generation 50 or higher (I assume now that it reached 250 it has reset and is going again on a new set)? I want to see the numbers in there which tell me how 'lax' the structures you are producing are. Thanks.
I ran the client with quiet mode off (but otherwise with the same switches). I stopped it around structure number 20 and logged the contents of the filelist.txt file. I repeated for several generations, and did the same for two other nodes as well.

I assume you just want the CurrentStruc line. (Let me know if need something else. I still have the entire contents logged.)

Here's the 60x (4 sec timeout) node that produced that structure.
It was doing generation 168.

CurrentStruc 1 20 123 168 1 6 7.291 -272.471 1497.552 207.135 11748199.000 1.400 2.600 1163.098
CurrentStruc 1 21 123 169 1 19 7.207 -284.427 978.283 241.184 9940255.000 1.400 2.600 1163.098
CurrentStruc 1 20 123 170 1 11 7.249 -469.606 1236.111 185.136 8855551.000 1.300 2.400 879.469
CurrentStruc 1 21 123 172 1 14 7.214 -100.283 2014.440 299.015 18499776.000 1.300 2.400 879.469

Here's the other 60x node. (Although it runs at the same speed, it was taking longer to do these generations.)
It was doing generation 215.

CurrentStruc 1 21 123 215 1 16 6.389 -2022.274 -804.780 -585.711 45021196.000 2.050 3.900 7156.295
CurrentStruc 1 21 123 216 1 5 6.324 -2227.056 -1304.998 -736.268 69396480.000 2.150 4.100 9464.200
CurrentStruc 1 20 123 217 1 17 6.340 -2311.714 -1109.748 -666.149 60635824.000 2.150 4.100 9464.200
CurrentStruc 1 20 123 218 1 19 6.314 -1784.907 -452.469 -521.274 37506988.000 2.150 4.100 9464.200

Here's one of the 185x (2 sec timeout) nodes that I have crunching to my other account.
It was doing gen 151.

CurrentStruc 0 21 123 151 1 5 6.055 -2247.021 -533.654 -518.490 36841944.000 1.550 2.900 1768.926
CurrentStruc 0 21 123 152 1 18 6.024 -2301.021 -432.239 -486.721 35569452.000 1.550 2.900 1768.927
CurrentStruc 0 21 123 153 1 19 5.911 -3010.360 -748.922 -651.610 63574196.000 1.500 2.800 1538.197
CurrentStruc 0 20 123 154 1 13 5.841 -2391.077 -862.600 -551.492 43018768.000 1.400 2.600 1163.098

Brian the Fist

03-14-2003, 05:15 PM

Ok, that's fantastic, thanks. I was mainly interested in the last 3 numbers which let me see how 'loose' it was letting the structures get. (And thus how poor quality they will be). The starting values are 0.85, 1.5 and 500. When the first number is about 1.5 or less the structures shouldnt be too bad but when it gets to above 2.0 its probably starting to suck.

Also it appears these numbers are not changing much from generation to generation so it may not be wise to reset them to (0.85,1.5,500) at the start of each new generation, it may be best to just keep them from the previous generation. My main concern is I do not want the protein 'cutting through itself' which could happen if the structures get sloppy enough. Anyhow, I'll take this all into consideration as I decide exactly hwo the final version will work. The last piece of data I will need is to see how everyone else does compared to AMD_is_logical once you've all completed your 60x slower 250 generations.

You have all been most helpful so far, thanks.

m0ti

03-15-2003, 02:16 AM

Originally posted by Brian the Fist
Ok, that's fantastic, thanks. I was mainly interested in the last 3 numbers which let me see how 'loose' it was letting the structures get. (And thus how poor quality they will be). The starting values are 0.85, 1.5 and 500. When the first number is about 1.5 or less the structures shouldnt be too bad but when it gets to above 2.0 its probably starting to suck.

Also it appears these numbers are not changing much from generation to generation so it may not be wise to reset them to (0.85,1.5,500) at the start of each new generation, it may be best to just keep them from the previous generation. My main concern is I do not want the protein 'cutting through itself' which could happen if the structures get sloppy enough. Anyhow, I'll take this all into consideration as I decide exactly hwo the final version will work. The last piece of data I will need is to see how everyone else does compared to AMD_is_logical once you've all completed your 60x slower 250 generations.

You have all been most helpful so far, thanks.

Just started Gen 75.

AMD_is_logical

03-15-2003, 11:53 AM

I have been working on the rare but serious bug that we couldn't track down in beta3. I can only reproduce it sporadically.

I can sometimes get it by doing the following. I crunch with -rt -if -g1 for a while, then remove the f*.lock file while it is working on structure 1 of a new generation. I can only occasionally stop it at just the right time. I get the impression that I can't stop it too quickly after it starts structure 1, but that I must remove the f*.lock file before I see it working on structure 2.

I will then have a fold_*.bz2 file with '1' as the middle number (although having such a file does not insure that I got it right).

I then crunch with -rt -if -g0 (note the different -g switch) until it finishes the current generation and starts working on the next one.

If the bug occurred, then the fold_*.bz2 file with a '1' in the middle will now be gone, but it will still be in the filelist.txt file. There is no fold_*.bz2 file for that generation.

I will try to PM you a URL where you can get the work files. The bug1 directory is after successfully stopping the -g1 crunch at the right place, and bug2 is after the -g0 crunch. Let me know if there are any problems.

AMD_is_logical

03-15-2003, 12:51 PM

More Bugs. I woke up to find one of my nodes not crunching, and the following in the error.log :

FATAL ERROR: [023.024] {trajtools.c, line 2492} RLEUnPack failed, size=2, should=400*0 - likely this is caused by overclocked or faulty RAM chips, please test your RAM

Note that none of my nodes are overclocked, and all have passed memtest86.

The switches were: -rt -if -qt -p0 -g0

This node was using the +=185 realtime clock acceleration (for a 2 second timeout).
(During beta3 I a similar error on a different node, but also using the +=185 acceleration. As I recall, that previous time occurred during the minimize.)

Sometime after the bug hit, my automatic upload script woke up. It removes the f*.lock file and waits for the client on the node to exit. In the morning it was still waiting. I suppose the node had put an error message on the screen and was waiting for keyboard input, but it had no monitor or keyboard so I couldn't tell. The easiest thing to do was to just reset the node.

Then I got hit by a second bug. The error.log contained:

ERROR: [001.001] {trajtools.c, line 3465} Unable to open trajectory distribution file <handle>_protein_239.trj
FATAL ERROR: [002.003] {foldtrajlite2.c, line 4237} Unable to read trajectory distribution, please create a new one

Indeed that file wasn't there. (How do you create a new one?) There was a file <handle>_protein_240.trj though.

Apparently the reset left the work files in an inconsistent state, and the client was unable to recover. I consider that a bug.

I made a backup of the directory, then I renamed <handle>_protein_240.trj to <handle>_protein_239.trj and the client was happy.

BTW, the structure being crunched was the 5.18 one on my "AMD beta test account".

bwkaz

03-15-2003, 04:47 PM

Have fun guys (if anyone that uses dfGUI for Linux is even using the beta client, that is... ;)).

It's hosted in the same place as before, http://3dguios.resnet.mtu.edu/dfGUI-linux/#newversion

EDIT: Wait a minute, there's a pretty big bug in the sucker. When a generation finishes, the benchmark data goes completely nuts, because the starting struct number needs to be reset. Hmmm... how to detect that a generation has been finished... Well, I'll see if I can't fix that here.

EDIT AGAIN: OK, it's fixed and uploaded, but I've been seeing some strange behavior where it appears that the "stop client" button has gotten pressed spontaneously. Hopefully that's just related to the fact that I'm running the non-beta in a different directory on the same machine. Hopefully. If anyone else sees it, let me know.

Brian the Fist

03-15-2003, 05:36 PM

Ole has already identified this bug:

FATAL ERROR: [023.024] {trajtools.c, line 2492} RLEUnPack failed, size=2, should=400*0 - likely this is caused by overclocked or faulty RAM chips, please test your RAM

which is kind of interesting. It is caused because part of the protein wandered off the edge of conformational space (!). (In this case, conformational space is represented as the surface of a sphere, and it has hit the north pole which causes some strange problems. I had generally thought it impossible until now, but guess not. Anyways, I have fixed this (for the next version) so it cannot happen anymore, but for now it is, unfortunately, unrecoverable :swear: (so just delete fielilst.txt and restart from gen. 0 if this occurs). I'm not sure if the:

ERROR: [001.001] {trajtools.c, line 3465} Unable to open trajectory distribution file <handle>_protein_239.trj
FATAL ERROR: [002.003] {foldtrajlite2.c, line 4237} Unable to read trajectory distribution, please create a new one

occurred on the same node afterwards?? If so, its probably a direct result of the first error but if it was on a different node and needs to be looked at let me know.

I'll try to reproduce the tricky bug again that you mentioned above - so far I thin you're the only one whose seen it so it must have something to do with the switches you are using. I am still not clear at what point you see the first error message in that case - is in when you do the '-ut' option?

AMD_is_logical

03-16-2003, 08:23 AM

Originally posted by Brian the Fist ... Anyways, I have fixed this (for the next version) so it cannot happen anymore, but for now it is, unfortunately, unrecoverable :swear: (so just delete fielilst.txt and restart from gen. 0 if this occurs). Actually, I did recover. After the f*.lock file was deleted the client refused to exit, presumably because it had printed an error message and was waiting for a keypress (dispite the -qt switch). I didn't have a monitor or keyboard on that node, so I terminated the client by reseting the node (by means of its power cord). I had been using the -g0 switch (to avoid that other bug) so it hadn't been checkpointing, and as a result it "forgot" it had done that generation. The only problem was that the client apparently had already removed the .trj file for that generation and made one for the next generation. Once I renamed the .trj file the client was happy, and all 250 generations have now been successfully crunched and uploaded for that structure.

I'm not sure if the:

ERROR: [001.001] {trajtools.c, line 3465} Unable to open trajectory distribution file <handle>_protein_239.trj
FATAL ERROR: [002.003] {foldtrajlite2.c, line 4237} Unable to read trajectory distribution, please create a new one

occurred on the same node afterwards?? Yes, it happened when I tried to start the client after the reset described above.
I'll try to reproduce the tricky bug again that you mentioned above Did you get the work files? With the work files in the bug1 directory the critical step has already been done, so if you start with there and just do the second step it should be no problem. You just need to crunch with -rt -if -g0 until you've completed the generation and started the next one. Unlike the first step, it doesn't have to be stopped at any particular time, so you can leave it running and check on it when it's convenient. This step also has a high probability of working. Starting from the files in the bug1 directory, it worked all 4 times I tried.
so far I thin you're the only one whose seen it so it must have something to do with the switches you are using. I think it needs the -if switch. I was hit several times when crunching without the -g switch (I use -g1 and -g0 when trying to reproduce the bug because it makes the bug much more likely to happen.) I used the -rt switch because I always do. I never tried it without that switch. I suspect no one else in the beta has seen this bug because not many testers are doing a lot of crunching with the -if switch and regularly stopping their client with a script that removes the f*.lock file.

I am still not clear at what point you see the first error message in that case - is in when you do the '-ut' option? The client doesn't produce any error message when the bug actually happens. A fold_*.bz2 file for one of the generations quietly disappears from the directory (although it is still in filelist.txt). When the client running with -if is shut down and the results are uploaded by running with -ut, then the error message occurs because the client can't find the missing fold_*.bz2 file.

Digital Parasite

03-17-2003, 07:16 AM

Originally posted by Brian the Fist
The last piece of data I will need is to see how everyone else does compared to AMD_is_logical once you've all completed your 60x slower 250 generations.

I have been running beta4 since you released it but I am still only at generation 40 at the moment. :(

At generation 37, structure 24 my laxness values were:
1.400 2.600 1163.101

At generation 40, structure 3 my laxness values are:
1.600 3.000 2034.271

This is on a P3-800 running -rt -g 1

Jeff.

Brian the Roman

03-17-2003, 08:20 AM

Howard;
it seems to me that with respect to # of generation, original sample size, # of retries on placing a residue etc, the 'try it and see' approach seems to be necessary. My suggestion is that ALL of these parameters that we may adjust with experience should be set by the server on a dynamic basis. That way each time a client starts up it can get the latest values.

You can make the method of setting the values on the server side as simple or complex as you'd like. At the beginning I'd expect you'd simply set them manually and then simply re-set them periodically to try to determine the optimal values. After a while you could create a method of differentiating the results of client using different setting and then have sets of clients working with different values simultaneously.

By adopting this approach you could roll out the beta to the generall population without having to first spend all of the time trying to determine the optimal values with very few clients working at it.

You've probably noticed by now that this is one of my coding philosophies - set everything dynamically using parameters. :D

ms

Digital Parasite

03-17-2003, 08:55 AM

At generation 40, structure 50 my laxness values are:
1.800 3.400 3557.954

So they increased quite a bit in that generation.

Jeff.

Brian the Fist

03-17-2003, 10:54 AM

Originally posted by Digital Parasite
I have been running beta4 since you released it but I am still only at generation 40 at the moment. :(

At generation 37, structure 24 my laxness values were:
1.400 2.600 1163.101

At generation 40, structure 3 my laxness values are:
1.600 3.000 2034.271

This is on a P3-800 running -rt -g 1

Jeff.

Actaully, it would be useful to me id a few people (who have the time) could post the last line of their filelist.txt (the line starting with 'CurrentStruc') at the point when they are at structure 20-40 (whichever) within a generation, and for several generations after about 50 or up. Make sure it is at structure 20-40 in the generation though, this is important

For example, do it for gen 55, struc 25, gen. 56 struc 30, gen. 57 struc 27 (or something like that).

This is just so I can compare with AMD_is_logical (who already posted this info if you scroll up a few messages). Thanks!

Aegion

03-17-2003, 11:56 AM

My current values at generation 102 are the following.
CurrentStruc 0 28 123 102 1 4 7.411 -2572.960 -994.298 -1013.360 99461552.000 1.050 1.900 437.252
(If at some point you would feel that I would be better served by simply resetting to zero again I would be happy to do so.)

KWSN_Millennium2001Guy

03-17-2003, 01:03 PM

At generation 89

Filelist.txt
.\fold_0_###_0_###_protein_88.log.bz2
.\###_0_###_protein_88_0000011.val
CurrentStruc 0 1 123 89 1 0 10000000.000 10000000.000 -10000000.000 0.000 0.000 0.850 1.500 250.000

And Progress.txt
Building structure 1 generation 89
49 until next generation
0 generations buffered
Best Energy so far: 10000000.000

I think it has taken 2 days to get from gen 80 to gen 89

I have only one machine running the beta, a 1.8 Ghz P4 w/512 megs DDR RAM running XP pro and nothing other than DF beta.

running as a service and the useram = 1 flag in service.cfg

Brian the Fist

03-17-2003, 01:36 PM

AMD_,

I got the bug1 and bug2 dirs from you but I'm still very confused from your previous posts. Can you please give me a clear, concise, setp by step instructions of what you want me to do once I unzip the bug1 and bug2 dirs? Specifically how many times I need to start and stop the program, and what flags it should use each time? I tried running in the bug1 dir to the end of the generation and it seemed fine...

Welnic

03-17-2003, 02:25 PM

CurrentStruc 0 21 123 84 1 2 6.721 -556.506 873.475 110.176 4166943.500 1.300 2.400 879.471

I can see that this could quickly get out of hand, so I am going to wait until I have several generations and put them all in the same post.

AMD_is_logical

03-17-2003, 03:23 PM

Originally posted by Brian the Fist
AMD_,

I got the bug1 and bug2 dirs from you but I'm still very confused from your previous posts. Can you please give me a clear, concise, setp by step instructions of what you want me to do once I unzip the bug1 and bug2 dirs? Specifically how many times I need to start and stop the program, and what flags it should use each time? I tried running in the bug1 dir to the end of the generation and it seemed fine... Start with the work files in the bug1 directory. Notice that there is a file fold_0*1*1.log.bz2 in both the directory and filelist.txt. As far as I can tell, all is as it should be.

Now crunch with the switches -rt -if -g0

Notice that it is crunching near the start of gen 1. Let it continue uninterupted until it is crunching gen 2. (You can let it go as long after that as you want. Just be sure it has started the ASCII graphics of gen 2 before you stop the client.)

The contents of bug2 is what I had after doing the above. Notice that the fold_0*1*1.log.bz2 file is no longer there. It is still in filelist.txt, but not in the directory. There is no fold_*.bz2 file for gen 1 in the directory.

Georgina

03-17-2003, 04:54 PM

structure #25 of gen. 84:
CurrentStruc 0 24 123 84 1 21 6.695 -807.737 649.858 -65.892 3514353.250 1.050 1.900 437.252

structure #28 of gen 85:
CurrentStruc 0 27 123 85 1 9 6.697 -545.162 589.686 -125.918 3512862.750 1.200 2.200 665.006

structure #21 of gen. 87:
CurrentStruc 0 20 123 87 1 10 6.533 -727.888 419.502 -73.535 2480975.500 1.700 3.200 2690.320

structure #23 of gen. 90:
CurrentStruc 0 22 123 90 1 13 6.419 -1213.603 327.414 -101.658 4267804.500 1.100 2.000 502.840

structure 26 of gen 93:
CurrentStruc 0 25 123 93 1 7 6.392 -631.807 1099.677 -36.079 3397420.000 1.200 2.200 665.006

All from w2k server

G

mighty

03-17-2003, 05:05 PM

CurrentStruc 0 34 123 75 1 31 7.909 -1847.480 -88.338 -558.075 28772618.000 1.400 2.600 1163.101

Strucure 36 in gen. 75

Brian the Fist

03-17-2003, 05:51 PM

Please post several generations Currentstruc lines in one post, I need to see 3-4 of them from teh same user together, and maybe from 3-4 users and that's it, then you can stop. Thanks.

Welnic

03-17-2003, 05:52 PM

If you want to generate a file with the info that Howard needs, make a file with like this and have cron run it every 5 minutes. It will grab all of the structures in the 20s (and also structure 2) and append them to the file. Afterwards you can edit the file and you're ready. Depending on the machine you may have to adjust how often it runs.

#!/bin/sh

cat temp | grep CurrentStruc\ 0\ 2 >> output.txt
cat temp | grep CurrentStruc\ 1\ 2 >> output.txt

Welnic

03-17-2003, 08:42 PM

CurrentStruc 0 21 123 84 1 2 6.721 -556.506 873.475 110.176 4166943.500 1.300 2.400 879.471
CurrentStruc 0 21 123 85 1 17 6.729 -705.449 1129.461 113.307 4769018.000 1.400 2.600 1163.099
CurrentStruc 0 22 123 86 1 1 7.509 -543.301 -543.301 -10.866 295175.875 1.100 2.000 502.840
CurrentStruc 0 21 123 87 1 10 6.853 -878.808 1055.511 75.562 5356430.500 1.400 2.600 1163.101
CurrentStruc 0 22 123 88 1 8 6.849 -1004.899 1278.727 35.834 6161865.000 1.100 2.000 502.840

dano

03-18-2003, 05:08 AM

XP 1800+
CurrentStruc 0 26 123 104 1 16 6.438 -913.039 1202.665 -30.714 5906088.500 1.050 1.900 437.252
CurrentStruc 0 2 123 107 1 1 6.990 286.935 286.935 5.739 82331.695 1.050 1.900 437.252
CurrentStruc 0 44 123 117 1 5 6.547 -414.151 1366.844 361.686 15450094.000 1.050 1.900 437.252
XP 1700+
CurrentStruc 0 1 123 108 1 0 10000000.000 10000000.000 -10000000.000 0.000 0.000 0.900 1.600 287.500
CurrentStruc 0 40 123 109 1 37 5.571 -1291.189 1077.074 -341.397 18956416.000 1.250 2.300 764.757
CurrentStruc 0 16 123 112 1 5 5.706 -1223.127 394.204 -120.809 6028414.500 1.400 2.600 1163.099
Duron 1.3G
CurrentStruc 0 12 123 85 1 6 7.304 -1376.449 -311.703 -186.313 9276916.000 1.250 2.300 764.757
CurrentStruc 0 23 123 86 1 17 7.232 -1602.302 -384.244 -456.128 27032274.000 1.100 2.000 502.840
CurrentStruc 0 1 123 89 1 0 10000000.000 10000000.000 -10000000.000 0.000 0.000 1.400 2.600 1163.100

Welnic

03-18-2003, 11:09 AM

Same box as the last post, just more of it.

CurrentStruc 0 21 123 84 1 2 6.721 -556.506 873.475 110.176 4166943.500 1.300 2.400 879.471
CurrentStruc 0 21 123 85 1 17 6.729 -705.449 1129.461 113.307 4769018.000 1.400 2.600 1163.099
CurrentStruc 0 22 123 86 1 1 7.509 -543.301 -543.301 -10.866 295175.875 1.100 2.000 502.840
CurrentStruc 0 21 123 87 1 10 6.853 -878.808 1055.511 75.562 5356430.500 1.400 2.600 1163.101
CurrentStruc 0 22 123 88 1 8 6.849 -1004.899 1278.727 35.834 6161865.000 1.100 2.000 502.840
CurrentStruc 0 20 123 89 1 1 6.904 -841.903 252.314 -116.810 3201866.000 1.050 1.900 437.252
CurrentStruc 0 24 123 90 1 8 6.995 -1010.049 427.602 -156.914 6935936.500 1.100 2.000 502.840
CurrentStruc 0 20 123 91 1 16 6.819 -533.324 1412.476 49.499 4960647.500 1.150 2.100 578.266
CurrentStruc 0 20 123 92 1 3 6.945 5.095 1382.624 184.075 7352760.500 1.050 1.900 437.252
CurrentStruc 0 23 123 93 1 2 6.933 -690.115 1412.229 167.838 8959461.000 1.100 2.000 502.840
CurrentStruc 0 24 123 94 1 19 6.888 -373.639 1204.535 98.637 4798321.000 1.050 1.900 437.252
CurrentStruc 0 20 123 95 1 12 6.730 -554.628 1477.573 262.763 13013899.000 1.100 2.000 502.840
CurrentStruc 0 22 123 96 1 8 6.785 -528.053 1453.238 187.811 12790938.000 1.200 2.200 665.006
CurrentStruc 0 22 123 97 1 10 6.854 -402.322 1202.075 215.828 10253236.000 1.200 2.200 665.006
CurrentStruc 0 20 123 98 1 12 6.897 -222.747 1761.928 246.122 13030260.000 1.200 2.200 665.006
CurrentStruc 0 20 123 99 1 8 6.806 -371.432 1327.447 196.520 8917435.000 1.350 2.500 1011.391

Brian the Fist

03-18-2003, 02:56 PM

Bug from AMD_is_logical fixed

Ok, I have finally found the bug AMD_* had identified way back. This problem is related to the checkpointing feature that was recently added. In short, the checkpointing creates the fold*.log.bz2 file and the middle number of the file was one higher than it should have been. This creates all sorts of potentially weird anomalies including those described by AMD_*. I have made some changes to fix this but may have inadventantly fouled one of the other filenames up. I don't think so though, it looks like everything is right now.

I will update the download files on the FTP site to incorporate this fix, I'll post a message when its been updated.

Thanks to AMD for finding this bug and when I put the new one up, please download it and see if you can still break it at all..

Digital Parasite

03-18-2003, 03:12 PM

Howard, are there any other changes you have made in the file you are going to post? ie: should we all download it to make sure everything is running smoothly?

PS: I just got a new Dual AMD MP-2600+ machine last night so I will try unleashing the new beta on it. That will get me to the end of 250 generations much faster. :D

Jeff.

tpdooley

03-18-2003, 03:52 PM

Howard - after mentioning what a better job the Beta program is doing than the normal client, a teammate pointed out that the Beta is folding a different protein than the normal client. What's the difference between the two? (is it just a smaller protein that should have been faster to get to generation 250?)

eshell

03-18-2003, 04:11 PM

It seems to me as though the amount of secondary structure (SS) is slowly degrading generation after generation. It is degrading in the sense that alpha helices are being replaced by 3/10 helices, and eventually by turns and coil. This is logical given that each generation is inheriting trajectory graphs that are not filtered by the SS prediction, and thus, the amino acids are more likely to be placed in conformations that are not as conducive to good SS formation. If this trend continues, even for the beta, it might be wise to reapply the predicted SS filter. On the other hand, SS prediction is known to be only about 80% accurate, and strictly keeping to the predicted secondary structure might prevent us from ever getting close to the native structure... Nevertheless, if after 250 generations, the structures have no worthy SS, then Howard will clearly need to take this into consideration.

just my $0.02 :thumbs:

-=Michel=-

Brian the Fist

03-18-2003, 05:42 PM

The Beta protein is the previous one we worked on on the live server (1CDZ, from 1/28-2/25, best struc was 7.10 A in 10 billion).

Anyhow I've updated the beta files now, they are in the same place as always. It is up to you if you want to update it, you don't have to. If you do though, please try to test the -g option for me thoroughly though. Try it both with and without -if and try different -g values (especially -g0, -g1, -g2 and the default of -g5). To remind you, this affects not only progress.txt update frequency but now filelist.txt update frequency. In case of a crash/hard kill it should restart from teh last 'checkpoint' it made.

Please note if you kill it at a particularly bad time, such as when it is writing out or compressing a .log or .val file for example, you will still be stuck in an unrecoverable state. Thus it is a help but doesn't mean you should now start killing the client improperly :D

Watch carefully to see the behaviour when you stop it (either properly or improperly) and see if it starts off EXCATLY where it left of (in terms of structure number). For example with -g3, if you kill it while building structure #8, it should restart at #7 (since the last checkpoint was at the end of struc 6, the nearest multiple of 3). On the other hand, if it is stopped properly (by removing the .lock file or hitting Q) while building structure #8, it should then restart at structure #8.

If killed/quit when minmizing or creating a trajectory distirbution, it should restart at that step again.

Hopefully this is clear to everyone, if not just ask.

Digital Parasite

03-19-2003, 05:42 PM

Just to let you guys know, I have now switched my gen 50 beta client from my old P3-800 to my new MP-2600+ running on 1 CPU for now. Instead of getting an average of 3 hours 30 minutes per generation it is getting about 1 hour 13 minutes and has already done about 10 generations since this morning.

I will get to 250 much faster now... ;)

Jeff.

m0ti

03-19-2003, 08:50 PM

Hi,

Got the following error:

========================[ Mar 20, 2003 3:43 AM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 3863} Error during upload: STATUS 906 STRUCTURE COMPRESSION ERROR

I've got some 21 gens backlogged because of this.

switches: .\foldtrajlite -f protein -n native -g 1 -rt -df

WinXP Pro (non-service)

Aegion

03-19-2003, 08:52 PM

Originally posted by m0ti
Hi,

Got the following error:

========================[ Mar 20, 2003 3:43 AM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 3863} Error during upload: STATUS 906 STRUCTURE COMPRESSION ERROR

I've got some 21 gens backlogged because of this.

switches: .\foldtrajlite -f protein -n native -g 1 -rt -df

WinXP Pro (non-service)
You should probably clarify if you are running the newest version of the beta 4 software or the older one.

Edit: I'm also having trouble uploading structures, I suspect that the culprit is server problems for Distributed Folding. I probably should try the regular client and see if I can upload normally.

Update: The server for the regular client appears to be working properly and I can upload units.

AMD_is_logical

03-19-2003, 10:07 PM

Originally posted by m0ti
Hi,

Got the following error:

========================[ Mar 20, 2003 3:43 AM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 3863} Error during upload: STATUS 906 STRUCTURE COMPRESSION ERROR

I've got some 21 gens backlogged because of this.

switches: .\foldtrajlite -f protein -n native -g 1 -rt -df

WinXP Pro (non-service) I'm crunching with the new beta4a (since last night) and all was fine for a while. About 2 hours ago my upload script ran and all four nodes got the same structure compression error as what m0ti got above. I'm using linux with -rt -if -g0 -p0.

Trying to upload these results with the old beta4 client gave the same error message (except line 3862).

And I had a sub 5A structure in progress. :cry:

Should I switch back to the old beta4 client?

Do I have to clean out the directories, or can this be fixed on the server side?

Aegion

03-19-2003, 10:16 PM

Originally posted by AMD_is_logical
I'm crunching with the new beta4a (since last night) and all was fine for a while. About 2 hours ago my upload script ran and all four nodes got the same structure compression error as what m0ti got above. I'm using linux with -rt -if -g0 -p0.

Trying to upload these results with the old beta4 client gave the same error message (except line 3862).

And I had a sub 5A structure in progress. :cry:

Should I switch back to the old beta4 client?

Do I have to clean out the directories, or can this be fixed on the server side?
Its almost certainly the servers themselves and had nothing to do with the client.

Georgina

03-19-2003, 10:18 PM

========================[ Mar 19, 2003 8:41 PM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 3862} Error during upload: STATUS 906 STRUCTURE COMPRESSION ERROR

I also just noticed the same error. This is on the same box that I had previously reported results from filelist.txt

G

AMD_is_logical

03-19-2003, 10:20 PM

Originally posted by Aegion
Its almost certainly the servers themselves and had nothing to do with the client. Ok, I'll disable my upload script and just crunch offline for now.

Brian the Fist

03-19-2003, 10:46 PM

I changed some stuff with the beta CGI on the server before I went home today, so I probably broke it :D so just crunch offline until tomorrow when I fix it. If you will notice, the top 10 structures now shows the actual top 10 (before, it showed at most 1 per user, so that is why AMD_is_logical is now dominating the top 10). It is important that we now see the true top 10 movies and not just the top 1 from the top 10 users. Actually if its a minor error I might be able to fix it now, I have 10 minutes... if not, tomorrow morning Ill get the server going again. ;)

update
Ok, fixed it (I think). Sorry 'bout that. Hopefully it won't disrupt your in-progress simulations.

update 2
Ok, now I think it is really fixed (gimme a break, its midnight after all..) If you find it is no longer continuing your movie (i.e. if AMD's #1 movie which is curently at gen 110 doesnt proceed to gen 111 now) do NOT delete any files yet. Check if the files are still on your machine (and it keeps trying to upload them each time) or if they are gone forever (in which case, well, its gone). If the files are stil on your machine I can probably fix it sitll. Its too late at night for me to figure out what the server will do with the stuff it got the last few hours so we'll see tomorrow. nighty-night all.:sleepy:

AMD_is_logical

03-20-2003, 03:19 AM

Ok, I turned my upload script back on and the upload seems to have worked. :thumbs:
I got credit and finished the 4.99A structure. (Alas, it's RMS went the wrong way. :( I guess 50 structures per generation just isn't enough.)

I'm glad it wasn't the new beta client. It crunches faster, and both my accounts have gotten their current best structure since I installed it. :thumbs:

m0ti

03-20-2003, 05:58 AM

Ok, looks like you fixed it!:thumbs:

Thanks!

Now I've just got to try and get some better RMS values than the crap I've been getting! :bang:

tpdooley

03-20-2003, 07:02 AM

Since it takes so long on slower machines to get to generation 100 (I moved to a faster machine a few days ago), can we have an option to have a larger starting pool to select the structure to work on from? Instead of the 5 or 10k we currently have, the option to increase the pool up to .. say 100k, so the slow machines have a much better chance of finding a great structure to work on?

Digital Parasite

03-20-2003, 08:17 AM

Hey everyone, I have had a few spare moments to add a couple of features to the dfGUI beta client so a new version is available for download (same link sa before).

Download new beta at:
http://gilchrist.ca/jeff/dfGUI/dfGUIv22beta.zip

v2.2beta2 (Mar. 19, 2003)
- # generations is now configurable in Config window.

- Config window now appears in centre of dfGUI window so people using low resolutions will still be able to see and access the window.

- Added display to keep track of best energy seen since you started the GUI.

- Added display to indicate time it took to complete the previous generation

- Added display to indicate average time it has taken to complete each generation

- Removed structures per second and minute since the new beta works more slowly these values no longer make sense

- Modified the restart the inactive client code to reduce the chance of it restarting multiple copies at the same time.

If you find any problem, please let me know.
Jeff.

Screenshot:

Brian the Roman

03-20-2003, 08:18 AM

The size of the starting pool is not set by whim. It's a balance between quantity and quality that Howard's trying to maintain. Doing this type of thing would be counter-productive, IMHO.

ms

m0ti

03-20-2003, 08:32 AM

Just a small request:

For the details on the top 10 folds could you also provide text info?

Specifically:

the values of the various potentials for each generation and the value of the RMSD for each generation.

This could be provided as a static text file or something.

Thanks!

m0ti

03-20-2003, 08:37 AM

Jeff,

Nice new features in dfGui. Plus it no longer gives me the invalid int error!:D

Brian the Fist

03-20-2003, 11:01 AM

Originally posted by m0ti
Just a small request:

For the details on the top 10 folds could you also provide text info?

Specifically:

the values of the various potentials for each generation and the value of the RMSD for each generation.

This could be provided as a static text file or something.

Thanks!

Do you mean just make the quantities which are graphed available in text format as well? They already are, there's just no link actually :cool: Why would you want to see the ugly raw numbers?

AMD_is_logical

03-20-2003, 11:57 AM

I just looked at the top ten structure list, and I don't see Dano's 5.19A structure. What happened to it? :confused:

m0ti

03-20-2003, 12:07 PM

Originally posted by Brian the Fist
Do you mean just make the quantities which are graphed available in text format as well? They already are, there's just no link actually :cool: Why would you want to see the ugly raw numbers?

Yup, the ugly raw numbers.

As I posted earlier some weighting of the various potentials should be more accurate than any single potential by itself (i.e. combining crease-energy with the other methods).

With the raw data it would be fairly easy to code something that finds optimal weightings for the top 10. It should be interesting to see how the graph of this revised function compares to RMSD.

Brian the Fist

03-20-2003, 12:17 PM

Originally posted by AMD_is_logical
I just looked at the top ten structure list, and I don't see Dano's 5.19A structure. What happened to it? :confused:

That's a very good point. After doing some thorough detective work, I have found that in that movie, it was at geberation 83-85 where he reached 5.19 A. However the database also think he has only reached generation 70 in that movie. Thus whatever generation he was really at got overwritten with 70 and thus effectively killed it because he was already at gen 110 or so (so when he uploads 111 and it sees the last upload was gen 70 it will reject it, etc).

To clarify further, Dano's top structure is still recorded as 5.19 - this is what it used to print in the top 10. However when I look at the actual movie records, his corresponding movie has a best rms of 7.8 (at gen. 70). Since it now gets the top 10 movies (instead of users) he no longer shows up. I'm 99% sure this was buggered by my mistake yesterday but please keep an eye for future anomolies like this which could inidicate another bug somewhere.

I'll keep an eye on things too.. (my apologies to Dano :D )

Brian the Fist

03-20-2003, 12:32 PM

Originally posted by m0ti
Yup, the ugly raw numbers.

As I posted earlier some weighting of the various potentials should be more accurate than any single potential by itself (i.e. combining crease-energy with the other methods).

With the raw data it would be fairly easy to code something that finds optimal weightings for the top 10. It should be interesting to see how the graph of this revised function compares to RMSD.

Unfortunately, it is more complicated than you may think. Scientists have been working on the 'scoring problem' for well over 20 years now, and still cannot find anything resembling an 'ideal' scoring function (which would have a correlation of 0.99 with RMSD on all known proteins - or ANY known protein for that matter...). On top of this, a scoring function must be quick to compute since it often must be evaluated millions of times or more.

So rest assured we are working on it but its not something you can pull off overnight on your laptop (and if you DO, you definitely deserve a Nobel prize :p ).

Now that I am through rambling, yes I can put links to the text versions of the graphs, no problem, look for it later today hopefully.

m0ti

03-20-2003, 03:02 PM

Originally posted by Brian the Fist
Unfortunately, it is more complicated than you may think. Scientists have been working on the 'scoring problem' for well over 20 years now, and still cannot find anything resembling an 'ideal' scoring function (which would have a correlation of 0.99 with RMSD on all known proteins - or ANY known protein for that matter...). On top of this, a scoring function must be quick to compute since it often must be evaluated millions of times or more.

So rest assured we are working on it but its not something you can pull off overnight on your laptop (and if you DO, you definitely deserve a Nobel prize :p ).

Now that I am through rambling, yes I can put links to the text versions of the graphs, no problem, look for it later today hopefully.

I know it won't be anywhere near the actual RMSD; if it were that simple everyone would be doing this long ago and there wouldn't still be a scoring problem. It should still be interesting to see the contribution that a very primitive combination of scoring functions can acheive.

tpdooley

03-20-2003, 04:32 PM

Originally posted by Brian the Roman
The size of the starting pool is not set by whim. It's a balance between quantity and quality that Howard's trying to maintain. Doing this type of thing would be counter-productive, IMHO.

ms

If the client stays as it is, an Athlon 600 won't finish generation 100 in under a week. (I had to switch to an aXP 1700+ to make sure it got to the 250th generation in a timely manner). During Casp trials - that means 1 shot. I'd like the option of spending a whole day generating a pool to choose from so that I'd have a better chance during the remaining 6 days of getting the lowest score.

Brian the Fist

03-20-2003, 05:18 PM

The algorithm will NOT stay the same as it is now, that is why this is a beta test. Do not make any assumptions about what will or will not be in the final version. We will not allow you to adjust sample sizes however, as this is a critical part of the algorithm.

Just trust us and we can virtually guarantee you that the minimum requirements for participating will not increase above a P3-450 (since we have a couple dozen of those lying around here... :) ).

m0ti

03-20-2003, 05:41 PM

Well after using a very rudimentary method, I've found that crease energy is much much better than the B-L or Z-D potentials; so much so that when assigning weights to minimize errors, the weights come to (1,0,0) for pretty much any degree of accuracy. This is under the requirement that the sums of the weights add up to 1, something that may cause problems due to scaling.

Brian the Roman

03-21-2003, 06:32 AM

Another request for the dfgui beta. Give me somewhere I can go and look at the best energy value for each generation. right now I can only see best so far and best this gen.

See, there is no end to the number of extra features us users will ask for!:D

ms

pointwood

03-21-2003, 07:19 AM

I don't think this is the right place to ask for new features for dfGUI ;)

AMD_is_logical

03-22-2003, 04:59 AM

Sometimes I'll go to the beta stats page, click on "view details" for one of the top ten structures (#5, for instance), and get the details page (saying it's for #5) -- but it is for a different structure (possibly the one that is now #6). This seems to be due to the the details being updated long after the stats page is.

Perhaps the username of the person who made the structure could be added to the details page. That would let people know which structure they are looking at.

Brian the Fist

03-22-2003, 11:56 AM

Yes you are correct. This won't be a problem when it is no longer a beta though. It is only because all the stats and graphs are being generated on the same box as the web server while we beta test so I am not going to bother fixing it. If it helps, the stats are being updated every hour on the hour, and the plots are being computed throughout the entire hour (right now) but unless you just changed rank in the top 10 they will be accurate.

Brian the Roman

03-23-2003, 09:55 AM

Howard;
I'm wondering how the beta algorithm works... When a client completes gen x and gets a best score of, say 7, and then completes gen x+1 and gets a best score of 7.3, which structure does it use as its starting point for gen x+2? Does it continue sampling around the best overall or the best of the previous generation notwithstanding the quality differences?

ms

Brian the Fist

03-23-2003, 11:15 AM

It will always proceed from the best struc of gen. x for gen. x+1, even if gen. x-1 was better. In this way it simulates dynamics to some extent. gen. x is always a slight displacement from gen. x-1.

m0ti

03-23-2003, 11:16 AM

Execellent question Brian (the Roman)!

In any case, it might be a good idea of instead of just doing energy-min and traj distribution for 1 fold, to do it for the top 3-5 folds so far; let 'em crunch for a while across them before you toss anything out.

Seeing as this is an optimization problem we may want to shift to a little bit of a less greedy approach (though a well thought out greedy algorithm can generally produce pretty good results very quickly).

arjanscholl

03-23-2003, 04:41 PM

Hello there, i have a little question, could it be that in the latest (haven't tried it with older versions) beta client, while its 'minimizing energy', runs at a different priority than the first part? I see some programs slowing down when it starts minimizing energy and calculating trajectories... i hope you understand what i mean ;)

arjanscholl

03-23-2003, 04:58 PM

Additionally i've found another thing but i don't really know if it's a problem or not.

When the client is running normally it sometimes says 'Tight spot, trying alternate conformation #xxxxxx' in the top right of the display.

But if it's really busy here lets say to 200000 conformations, you simpely press 'q' and restart the client, without having to continue to find an 'alternate conformation'.

If you are really bored one day, you can simpely start pressing the 'q' key and restart the client again to gain alot of points. (am i right?)

Brian the Roman

03-23-2003, 09:26 PM

When I go to the high flyers page it says my best rmsd is 5.40, but in the top 10 list it has me as 5.21. I thought it might be an update delay but this has been the case pretty well all day now.

ms

Brian the Roman

03-24-2003, 12:08 AM

Now when I go to the page it says the following:

Your smallest RMSD structure 4.95

Overall smallest RMSD structure 4.99

The top 10 list has me on it twice, once at 5.00 and once at 5.21. It has the best structure listed as 4.99.

:confused:

ms

tpdooley

03-24-2003, 02:02 AM

Howard mentioned that a lot of the top ten stuff only gets updated every hour. You're now showing up on the top ten list as the top of the heap at 4.95 for me..
Congratulations on dethroning AMD_is_Logical. ;)

Howard - with a fair number of us at or beyond the 100 generation mark - can you you tell if the lowest initial score at generation 0 has a direct correlation with the lowest score on the top ten list? (i.e. did Brian have the lowest generation 0 score, followed by Dano, then Scoofy and Michel?) Or do some of the sloppier initial scores end up being much better at aligning themselves than the lowest initial Gen0 scores?
(granted, this is an awfully small group to base such an assumption on.)

Brian the Fist

03-24-2003, 10:53 AM

Originally posted by arjanscholl
Hello there, i have a little question, could it be that in the latest (haven't tried it with older versions) beta client, while its 'minimizing energy', runs at a different priority than the first part? I see some programs slowing down when it starts minimizing energy and calculating trajectories... i hope you understand what i mean ;)

Others have mentioned this as well although there is no good reason why it should be the case. What OS are you running and what does the task manager show when this occurs, can you look?

Brian the Fist

03-24-2003, 10:55 AM

Originally posted by tpdooley
Howard mentioned that a lot of the top ten stuff only gets updated every hour. You're now showing up on the top ten list as the top of the heap at 4.95 for me..
Congratulations on dethroning AMD_is_Logical. ;)

Howard - with a fair number of us at or beyond the 100 generation mark - can you you tell if the lowest initial score at generation 0 has a direct correlation with the lowest score on the top ten list? (i.e. did Brian have the lowest generation 0 score, followed by Dano, then Scoofy and Michel?) Or do some of the sloppier initial scores end up being much better at aligning themselves than the lowest initial Gen0 scores?
(granted, this is an awfully small group to base such an assumption on.)

Yes, funny things may appear since stats are being updated only once an hour, plus because of that boo-boo I made a few days ago. Everything will be wiped shortly for the next test so then it should all be consistent again (if not tell me..).

We do not expect the Gen.0 RMS to correlate with the Gen. 100 RMS at all, but you can look yourself in the plots of RMS vs. time for the top 10 structures.

Brian the Roman

03-24-2003, 11:11 AM

One of my clients is running MUCH slower than the other two notwithstanding it's the faster machine. This didn't used to be the case even within the betas. I don't really recall exactly, but I believe when beta '4a' became available I updated the two clients that are now the fastest from scratch, but only updated the .exe for the slower one. I did this to allow it to continue digging deeper into generations. Could this explain the difference?

ms

arjanscholl

03-24-2003, 11:54 AM

Originally posted by Brian the Fist
Others have mentioned this as well although there is no good reason why it should be the case. What OS are you running and what does the task manager show when this occurs, can you look?

I run Distributed Folding together with Chessbrain (Distributed Chess program) normally DF takes up 100% of the cpus power, except when chessbrain gets a task (it doesn't run continuesly) then chessbrain gets 100% cpu power and DF gets nothing (this is the way i want it anyway.)

BUT...when DF is calculating the minimal energys and trajectories of the next generation and Chessbrain is working at the same time they SHARE the cpus power 50/50. Thats why i think the last thread runs at a higher priority.

Hope it makes sense ;)

shortfinal

03-24-2003, 12:24 PM

I have a utility on my WNT V4.0 SP6a system called TaskInfo2000. It allows me to see threads for each process and their priorities. This is what I see:

When generating structures:
|Process| |% CPU| |LT % CPU| |Time| |Sw/s| |InMem KB| |Total KB| |Th||Pri|
- foldtrajlite.ex 94.49% 92.45% 3:42:06 4592 52,712 128,948 1 Idle
--Thread 94.49% 92.45% 2:49:36 4592 1/1

When minimizing/making trajectory distribution:
|Process| |% CPU| |LT % CPU| |Time| |Sw/s| |InMem KB| |Total KB| |Th||Pri|
- foldtrajlite.ex 97.49% 73.36% 13:19:30 129 13,052 74,504 2 Idle 4.0 C:\Test\df\foldtrajlite.exe
--Thread 11:19:03 0 1/1
--Thread 97.49% 73.36% 0:14 129 4/4 <<*************

Sorry about the messed up spacing. The last column shows the 2nd thread has a higher priority of 4/4. Shouldn't it be priority 1/1 like the 1st thread? Could this be the problem?

I'm using switches -rt -g 10 -df -i f

Shortfinal

AMD_is_logical

03-24-2003, 12:54 PM

Originally posted by arjanscholl
I run Distributed Folding together with Chessbrain (Distributed Chess program) normally DF takes up 100% of the cpus power, except when chessbrain gets a task (it doesn't run continuesly) then chessbrain gets 100% cpu power and DF gets nothing (this is the way i want it anyway.)

BUT...when DF is calculating the minimal energys and trajectories of the next generation and Chessbrain is working at the same time they SHARE the cpus power 50/50. Thats why i think the last thread runs at a higher priority.

Hope it makes sense ;) The following is just an observation.

When I first switched from genome to DF, I had both clients crunching on the same computer at the same low priority on a Win2k machine. The genome client got nearly all of the CPU.

Early in the beta, I had several beta clients and one non-beta client running at the same low priority on that Win2k machine. They usually shared the CPU (except the non-beta client got more than it's share). Whenever a minimise started, however, that client would get essentually all the cpu, even though it was still running at the same low priority (according to Task Manager) as the other clients.

m0ti

03-24-2003, 02:31 PM

Yup, same here. Caused freezing quite often during energy minimization.

jlandgr

03-25-2003, 04:59 AM

Whenever a minimise started, however, that client would get essentually all the cpu, even though it was still running at the same low priority (according to Task Manager) as the other clients.
The process priority might still be low, but the thread priority of the calculation thread most probably has changed.
Tools that let you view and manipulate both process and thread priorities include ATM (Another Task Manager) for Win9x/Me and BVSLICE for NT/W2k/XP.
Jérôme

Digital Parasite

03-25-2003, 07:20 AM

I'm not sure how the energy minimzation was coded but what I have found with my own programming is that when you launch a new thread or child processes from a running process, it seems to inherit just the process priority and not the thread priority so I had to run SetThreadPriority() again.

Jeff.

m0ti

03-25-2003, 09:00 AM

Originally posted by Digital Parasite
I'm not sure how the energy minimzation was coded but what I have found with my own programming is that when you launch a new thread or child processes from a running process, it seems to inherit just the process priority and not the thread priority so I had to run SetThreadPriority() again.

Jeff.

I've found that too in Win32.

Brian the Fist

03-25-2003, 10:54 AM

Sorry about the messed up spacing. The last column shows the 2nd thread has a higher priority of 4/4. Shouldn't it be priority 1/1 like the 1st thread? Could this be the problem?

I'm using switches -rt -g 10 -df -i f

Shortfinal

Ok now I finally get it. You see when it does the energy minimization etc., it spawns a few extra threads. I never set the priority of these so they get the default. I should be setting the priority of these according to the -p foldtrajlite option. I will look into this for the next release then, as soon as I can find a program to properly display thread priorities (Windows task manager only display PROCESS priorities, which is different). I'll check what happens on UNIX as well if I can find a program to view the individual threads - if someone already knows a good program let me know.

Digital Parasite

03-25-2003, 12:12 PM

Originally posted by Brian the Fist
I'll check what happens on UNIX as well if I can find a program to view the individual threads - if someone already knows a good program let me know.

In Unix, once you set the priority of a process, any child process or thread also inherits the same priority so you don't need to do anything. Re-setting the priority won't hurt it though. This is true for Linux and FreeBSD anyway.

Jeff.

Paratima

03-25-2003, 12:27 PM

I will look into this for the next release then, as soon as I can find a program to properly display thread priorities (Windows task manager only display PROCESS priorities, which is different). Howard, snag a copy of TaskInfo from Igor's place (http://www.iarsn.com/index.html)!

Double-clicking on the task name shows the individual threads and their priorities. :D

PS. It shows this action happening on the current Winders client, too.

shortfinal

03-25-2003, 12:53 PM

Paratima beat me to the punch about downloading an evaluation copy of TaskInfo2003. :D

As for UNIX, at least Tru64 UNIX, you can view the priority of threads with this command:

# ps -eml

Shortfinal

bwkaz

03-25-2003, 01:38 PM

ps -eml works on Linux as well, at least according to the manpage. The -m option is the one that's needed, it displays all threads. The -e makes all processes show up (instead of just the ones from the current shell session), and -l makes it output in a long format, which has the priority as a column.

Digital Parasite

03-25-2003, 03:49 PM

Originally posted by bwkaz
ps -eml works on Linux as well, at least according to the manpage.

I am getting:
"ps: error: Thread display not implemented."

On my RedHat 7.3 system. :(

Jeff.

bwkaz

03-25-2003, 04:00 PM

Odd... I'm using the "ps" binary from the procps package, version 2.0.10 (installed from source). Is your ps maybe coming from an older version?

Or is it like "hostname" and there are three or four packages that have it, all with slightly different implementations, maybe?

The "hostname" binary that comes with net-tools is much, much better than the one that comes with sh-utils -- net-tools' binary takes the -f option so that it'll print the FQDN, where sh-utils' doesn't, for one, and there are probably other differences.

Edit: Just realized, the version of ps that I used to use probably did the same thing, because it did print out all the threads with options "aux". So it probably just didn't differentiate between a thread and a process (after all, the Linux kernel doesn't). To test this, do you have Mozilla? If so, when you have it running, do you see one mozilla-bin process, or five to seven? If one, then this isn't it. If 5-7, then you have the other ps that treats threads like processes.