Frustrations With DF

**Dyyryath** · 09-09-2003, 12:09 PM

I got CC'd an email last night that IronBits had sent to his brother. In it he expressed a certain amount of dissatisfaction with Distributed Folding that he's been feeling lately. Evidently, he copied me on the email in the hopes that I would have something constructive to say on the matter.

Essentially, his complaints went something like this:

There is a bandwidth issue for uploading results during protein changeovers.
They have ignored pleas to add timestamps for each entry to the logs.
He'd like error level exit codes for dealing with client crashes. Evidently there are some kind exit codes, but they are 'undocumented'.
There are still occasional issues when uploading large groups of work which can result in lost work

Now, IronBits thinks that DF is my 'pet project' and that I'll defend it at all costs.

However, it'd probably be more accurate to say that it's simply the best project I've found in awhile. Since I've already enumerated some complaints about the project, I'll list what I think is good about this project:

The science seems worthwhile to me. I've got nothing against SETI, or the various math and/or cryptology related projects, I'd just rather run something that might lead to increases in our understanding of human health problems.
The client exists for lots of platforms. For me, a project needs (as a bare minimum) Windows & Linux support. Sparc support is a nice addition. If it doesn't have at least the first two, I can't run it.
The project team is accessible. They don't always do what I'd like, but if I ask a question on the forums, I generally get some sort of response, which is a bigger deal than you might think. Getting admins on other (larger) projects to deal with individual users can be a real nightmare.

Sometimes when I'm frustrated with one thing or the other on this project (like IronBits seemed to be yesterday), I consider working on something else for awhile. I'll spend a day or two looking at other projects, but I invariably decide that I'll just stay where I'm at because I can't find another project that fits my requirements as well as DF does, even with it's faults. I've been crunching DC since 1999 and I've participated in a bunch of projects. The one thing I've learned is that none of them are perfect.

So, having said all of that, I got the distinct impression that IronBits expected me to do something about the way he was feeling.

With that in mind, I'd like to use this thread as place to discuss those things about DF that we feel are unsatisfactory. Once we've nailed down some specific gripes, we'll see what can be done about fixing them, either through the project admins or on our own. This project is pretty good, but it's not perfect. Since a large number of us seem to be in agreement that it's better than the alternatives, even with it's faults, maybe the answer is to work to fix what's wonky with this one.

So, who wants to go first? I've already got a couple of ideas about what might be done to address IronBits' gripes, but I'd like to hear what the rest of you have to say before I get into that...

**rsbriggs** · 09-09-2003, 12:46 PM

First, let me say that I believe that I'm feeling the same frustrations.

Second, some recent (over the last week or so) investigations into other DC projects

While investigating United Devices UD (THINK) for the last two days, I've managed to help move the team from 887th to 883rd. Comments:

It isn't a project for the average person.
It takes long hours of crunching on a top end box just to produce a few points.
It isn't available for anything except Windows, so far as I can tell.
This feels like a big slick, glossy, commercial operation run by IBM. I have some suspicions that this is a money making operation for someone, in some fashion or another.
And, you CAN'T TURN THE DAMN THING OFF. There isn't an option that lets you finish a piece of work and then stop crunching.

RC5-72 - I've dropped out of this, due to no stats for weeks on end. Note that the main Distributed.Net developers were hired away and now work for the UD...

Seventeen Or Bust. My old favorite. Can't seem to get the folks in charge to communicate via the news page, except every couple of months. Don't know the overall benefit to humanity of solving the Serpinski conjecture, especially given the fact that I have a terminal illness that protein folding might eventually help find a cure for. Making history as a finder of one of the largest prime numbers known does have a certain attraction to it, though, and this is a great project for benchmarking your hardware.

Ecc2-109 and other crypto applications - don't hold much attraction for me these days. Good projects if you want a direct points-to-computer-power relationship.

Folding@Home I'm becoming fonder and fonder of that project - clients available for a number of platforms, and the windows client is EXTREMELY well behaved - I've discovered that the "set CPU %" stuff works, and it shares CPU with SOB, UD, DF, DNET stuff quite well. I'm starting to lean that way more and more. Can't speak for the communication of the folks in charge, just downloaded, started up, and been crunching a few WUs for a couple of days now with no troubles.

Third:

There is a bandwidth issue for uploading results during protein changeovers.

My immediate reaction would be to suggest to the folks in charge that the servers be split --> an upload server, and a download server. I'm not too certain how the logistics would be worked out

They have ignored pleas to add timestamps for each entry to the logs.

100% agreement here. This is an indication, to me at least, that the authors aren't professional software developers, or this would have been here since day one....

He'd like error level exit codes for dealing with client crashes. Evidently there are some kind exit codes, but they are 'undocumented'.

The fact that the client can silently die, leaving a .lock file, without logging anything whatsoever speaks volumes to me (a Unix/C hack with 35+ years of programming experience.) This is more than just an "exit code" issue...

There are still occasional issues when uploading large groups of work which can result in lost work.

It hurts, doesn't it? I've never lost a million point chunk all in one piece like IB, but I've certainly lost 15 or 20 entire 250 generation uploads over the last several months. There is something in the client/server exchanges that is not particularly robust, for certain.

I REALLY LIKE DF. But, as a professional developer, finding out little things like the reason that you can't "tail" (under Linux) the progress.txt file is because the file is deleted and re-created every time it is written to, is very eye opening about the underlying code and the (lack of) attention paid to performance optimizations...

===bob briggs

**FoBoT** · 09-09-2003, 01:37 PM

regarding the client , especially for people saying "we need a stable client"

i think a very IMPORTANT difference between the DF client and say the SETI client (as an example) is:

THE DF CLIENT CHANGES/CAN CHANGE WITH EVERY PROTEIN CHANGE

the SETI client has only had 3 changes (i think , please correct me if needed) since i became aware of SETI over THREE YEARS ago. the DF client is tweaked by Howard somewhat CONTINUOUSLY over the past 1.5 years we have been doing this, so my WAG (Wild Ass Guess) is that there have been 30-60 different versions of the DF client during half the time.

my point is that the way howard runs DF , he writes all the code (as far as i know) and he distributes it on any given protein change as he sees fit. so unless something fundamental changes (like howard gets a full time stafF?

) , DF will always be like this, a work in progress so to speak, from the client perspective at least

i think those of us that prefer DF to the other DC projects (for all the various reasons) will have to live and deal with these types of "features" in the DF project for the foreseeable future. to me it is part of the "game" , the interest, the excitement. in other words, it is a quality of DF that i actually LIKE

(even though i run the client on from 5-100 boxen at a time) and that many people used to the stability of other DC projects WON'T like.

i don't see things changing in this regard

p.s. don't take this the wrong way, but i am not feeling frustrated. that doesn't mean i don't "feel your pain" , just that it all seems about the same as when we started DF way back when, perhaps the "Fog of War" has numbed me to all of it or maybe i am just

**rshepard** · 09-09-2003, 01:45 PM

Well, from the other point of view, I don't really have any problems with the client right now. All my boxes are on 24/7 connections, so I don't have the troubles with massive uploads and server issues. I think during the changeover I may have topped out between 15-20 generations buffered, so I had no glitches there. As for the timestamps and exit codes issues, I can see where the project managers might want that for debugging purposes, but I can't see where I, personally, could make any use of those items. If it breaks, I look for whatever fixes are posted in the forums; if there's nothing there, I'd file a new error report, dump the directory and reinstall. Not saying any of these issues aren't real, and need to be addressed, mind you.

**rsbriggs** · 09-09-2003, 01:48 PM

Yeah, but the argument that "this is so-and-so's project and ze runs it the way ze wants to" is the same argument that has caused me to leave some other DC projects. If there isn't some room for give-and-take between the project leaders, and the people donating their time, money and resources, then the project will simply dwindle down to just the (possibly very few) people that exactly agree with everything...

I have little enough time to do the things that I want to be doing, and being involved in a project (and I'm not saying this about DF) that is having problems with project-developer-egos versus considerations of the needs/desires of the volunteers, isn't one of them. I like DF. I like being able to "tweak the science on the fly", so long as the basic infrastructure is sound.....

**FoBoT** · 09-09-2003, 01:56 PM

Originally posted by rsbriggs
Folding@Home I'm becoming fonder and fonder of that project - clients available for a number of platforms, and the windows client is EXTREMELY well behaved - I've discovered that the "set CPU %" stuff works, and it shares CPU with SOB, UD, DF, DNET stuff quite well. I'm starting to lean that way more and more. Can't speak for the communication of the folks in charge, just downloaded, started up, and been crunching a few WUs for a couple of days now with no troubles.

the LARGE feature missing from F@H is OFFLINE /no-net support. it doesn't have it, and they (stanford) have answered the request several times by saying "not only no, but our stuff CAN'T work that way" (something to that effect)

so for those of us with significant off-line boxen, F@H is out

while i am at it a generic "feature" that i would want in my "HGP" (Holy Grail Project) is NATIVE proxy support, that is, the project owner should release a SERVER version , not just the client. so that those of us in "no internet" environments could easily pass the work from our off-line boxen to the project servers

**Welnic** · 09-09-2003, 05:07 PM

It does seem to me that the response to problems has gotten slower over the last few months. It is still well above the norm for DC projects though.

I don't have any problems with the bandwidth. I just usually change all of my linux boxen to nonet before the change, download the client early the next day and then change back to internet access. On the occasions that I haven't been able to do that it seems to have worked fine doing the update.

I am concerned about the lack of response on the ever increasing memory useage. The linux clients I have that have been running for 4 days are using 10MBs more memory. No response when I posted that on the last protein.

And I still don't understand why the client just shuts down if it starts up and detects that the server is down.

But on the plus side it supports OSX and it at least runs equally fast per MHz as on the x86 side of things.

I like the fact that there is just one guy running the thing and he is not some faceless entity.

**bwkaz** · 09-09-2003, 06:42 PM

Originally posted by Dyyryath
They have ignored pleas to add timestamps for each entry to the logs.

This is the biggest thing I'd like to see too.

Of course, I don't get errors in the logs very often, but I do get messages about "can't talk to server" about once or twice a week. But seeing as there are zero timestamps on messages, none of us can correlate the times we get them with each other, and there's absolutely no way to figure out a pattern.

Now, you don't need a pattern if the problems that you're fixing are deterministic. However, they are most definitely not (see the uploading issues -- I haven't been bitten by these yet, but I only run 2 machines, and none of them are nonet...).

He'd like error level exit codes for dealing with client crashes. Evidently there are some kind exit codes, but they are 'undocumented'.

This would be sort of nice, but I'd have to rewrite my init script to handle it (not that that's a problem, mind you

).

It'd be nice if it got on the to-do list, in other words. dfGUI for Linux could also use the exit codes to tell you what went wrong -- but only if it had started the client and then been running for the entire time the client had. So if you start the client, exit dfGUI, and restart it, it can't get the exit code. So yeah, it's not a huge deal, but it would be nice.

The science seems worthwhile to me.

Hear, here! Agree with just about everything you say. I used to run dnet (RC5-64

) and Seti, and just ended up getting way too bored.

The client exists for lots of platforms. For me, a project needs (as a bare minimum) Windows & Linux support.

Yep. Though Windows support could be dropped and I wouldn't care. I'm sure that would help out a lot of the DF code immensely, if it didn't have to worry about Windows portability -- but I also realize how many users use Windows, so it ain't gonna happen.

(

)

The project team is accessible. They don't always do what I'd like, but if I ask a question on the forums, I generally get some sort of response, which is a bigger deal than you might think. Getting admins on other (larger) projects to deal with individual users can be a real nightmare.

Aye to this one too.

As for how to fix the stuff above:

Well, it's not all that difficult to call time(NULL);, store the return value, and call ctime() on the address of that return value. This will give you an ASCII representation of the time that you can plug into snprintf() with the standard "%s" format specifier and then write the result out to error.log. It's really not all that hard, and it wouldn't take me more than about a half hour (search through the files for all writes to the error.log stream, then modify each one). Assuming I'd written the code in the first place.

Error level codes shouldn't be all that difficult either. Just change exit(X); to exit(Y);, where X and Y are the old and new error code, respectively. Of course, figuring out Y from the combination of which exit() you're looking at and the code before it (the actual error handler) would take longer, so maybe four or five hours (for me, assuming I'd written the code).

This is also assuming that the DF client code is anywhere near "normal". Of course, nobody except Howard / Elena actually know that, so take these times with a grain of salt.

**Angus** · 09-09-2003, 07:19 PM

Perfect example of the log file problem:

This is the last 3 lines in my error.log file - anyone care to guess when the client died?

========================[ Sep 3, 2003 10:30 AM ]========================
ERROR: [000.000] {foldtrajlite2.c, line 4616} Error during upload: NO RESPONSE FROM SERVER - WILL TRY AGAIN LATER
FATAL ERROR: CoreLib [002.005] {ncbifile.c, line 715} File write error

dfGui is sitting there happy as a clam - doesn't know anything about an error.

Last progress file update was about 4AM today - that's the only indication of when things went sideways.

ARRGH

**Supp** · 09-09-2003, 07:54 PM

[little off-topic]

While investigating United Devices UD (THINK) for the last two days, I've managed to help move the team from 887th to 883rd. Comments:

...etc.

OT here, but...
UD forced BOINC team to close its source for at least 1.5 years because
one of BOINC developers worked for them few years ago!
This really doesn't help DC enviroment to flourish...

[/little off-topic]

**IronBits** · 09-09-2003, 08:04 PM

Well, I had really hoped for a return letter chock full of enthusiasm to boost my spirits

I was repeating, some of the problems folks had brought to my attention,
some I have personally seen, some I have not.

I keep my hand close to the heart of Free-DC and sense a slight dissatisfaction with DF as of late, and I'm unsure of how to handle it exactly.

Fixes for problems that are reported are blamed on the hardware, the user, or the setup and some problems with the client get ignored, or are lost in the 'fever of the moment'.

I have personally requested some features that would help us help them, like TIMESTAMPS for every entry made by the client and exit errorlevels to trap some of the nasty errors that seem to happen for unknown reasons, and was left feeling like I was being ignored or brushed off.
I think they are paramount if we are going to get to the bottom of some of these irritating and frustrating problems. (note: I didn't say major problems)

Over all, this is a good project, with a good client, and good support for MAJOR problems with the client/stats. (read: when a LOT of folks start complaining about the same thing)

Compared to some other projects, we have very few participants, and see no way for that user participation base to increase substantially because of the minor bugs, irritations and upload bandwidth problems. It's not seamless or smooth during transitions because of it...

Folks like a steady diet of stats, without them, you pretty much have zero participation. When you screw them up so badly it messes up all the 3rd party stats engines, you get more tension in the ranks, do it to many times, and you face a revolt and folks abandon the project.

The project is suffering from understaffing, and being a non-professional run project, i.e. seasoned veteran programmers to work thru all the bugs (I'm not saying Howard is a bad programmer), there is just something missing, maybe rsbriggs said it best - "the authors aren't professional software developers" and it shows...

If it were not for Dyyryath's most excellent STATS, I prolly would not be here, truth be told

With that said, it's not that big of a deal to lose a boxen here or there to a crash, for whatever reason, at least for the most part, you are not alone...
The client doesn't trash the computer or interfere with anything you want to do

The client does ramble along and they are still working on it.
and there is always Dyyryath's STATS to keep you hooked!

I rambled enough, nuff said, now

them doggies !

And show Ars your

**Beyond** · 09-09-2003, 09:09 PM

For the casual cruncher (1 or 2 machines at home), the large amount of memory required to run the -rt switich to produce any results in an acceptable amount of time, will cause them to move to a different, less memory and resource demanding project. This problem also prohibits large scale deployments on work boxen, along with the aforementioned protien change overs, large dumpages.

I agree the project's science "may" prove benefical in time. But so far all we have proven is that the "brute force" method of phase 1 is not the way to proceed. Results, good or bad, for phase 2 have yet to be determined! It may be some time yet before we know if we are on the right track with phase 2.

TBH the project does not interest me, I'm more of a "numbers" type of person. Mathmatics has always interested me, so most of my efforts are in those type of projects. Having said that I do not mind helping out here from time to time, but it is hard to stay interested and focused on this project.

**Fozzie** · 09-10-2003, 06:45 AM

Here's some Mathematics for you Beyond.

You are still 3, yes count 'em 1..2..3 million behind me even after all that posturing.

Face it old man you just keep running out of puff.

PM me your address and I'll mail you a new inhaler.

Here's a special

from 3 million ahead.

**Beyond** · 09-10-2003, 07:49 AM

Originally posted by Fozzie
Here's some Mathematics for you Beyond.

You are still 3, yes count 'em 1..2..3 million behind me even after all that posturing.

Face it old man you just keep running out of puff.

PM me your address and I'll mail you a new inhaler.

Here's a special from 3 million ahead.

3 million ahead as I pursue other interest in an effort to give "you" a chance, otherwise you would be so far behind me that you could not even see the dust trail left behind by my rapid climb up the charts.

**Fozzie** · 09-10-2003, 08:03 AM

Don't sing it , bring it.

**Paratima** · 09-10-2003, 08:59 AM

**cygnussphere** · 09-10-2003, 09:44 AM

Originally posted by Fozzie
Don't sing it , bring it.

**DocWardo** · 09-10-2003, 10:00 AM

one would hope we could have kept this on track a bit longer. A serious thread from the #1 team would, at least to me, be worth of some comment by Howard. But if we just go of into trash talkin.....

I have to agree with most of the points that have been brought up.

I don't have many problems with the client. I would LOVE to see timestamps in the errorlog. I am not a software developer, so I just have to go on your guy's experiance that exit codes would help, and it sounds like they do.

I don't think that in the days of most computers having 512 MB or cheaply going up to 512 MB that the 70 or so MB that DF uses with -rt is that much of a pain. I have it installed on many of the computers at work and no one seems to notice.

I agree with the bandwith issues, as well, there is nothing more frustrating than trying to get the new client and get back up and running after a changeover. But to be honsest how much of this is caused by people rushing to dump that 1,000 packets they had noneted before the 24 hour deadline. I don't see a corp. sponsor jumping to the rescue and giving DF an extra pipe for all of this work. I'd be intersted to see what effect we have, at changeover, on the hostpitals network pipe. are we saturating the machine or the pipe?

well just my 3 cents. I do hope howard reads this thread and gives it some serious thought. he has a group of very talented people here willing to help him debug problems and other issues, I would hate to seem him alienate them by not at least addressing some of the simple things.

**Fozzie** · 09-10-2003, 10:39 AM

I haven't had any real grief with the client bar the uploading of cached data.

I haven't had any of them just stop but I have for the last few weeks been running everything as a service under NT/XP.

Changeover is my only concern and they would need to do something drastic to allow all of the contributors the access that everyone wants.

I don't think it's a bandwidth issue just an number of concurrent connections.

Does anyone know what backend they are running.

**DocWardo** · 09-10-2003, 10:45 AM

Originally posted by Fozzie

Does anyone know what backend they are running.

"the back-end of DF is an 8-way 440MHz HP N9000 running HP-UX11,"

as posted by howard's boss right around the time of the blackout.

**bwkaz** · 09-10-2003, 09:43 PM

Right, but what database software? Web server?

Either the DB, the web server, or the OS itself is going to determine the limit on number of connections. It's likely not HP/UX, but of course you never know.

I'd put my money on the database server, personally...

**IronBits** · 09-24-2003, 09:58 AM

From Howard

Good News. I've managed to figure out how to massage the NCBI error messages, so now there will be timestamps preceding all error messages in the error.log! This feature will be in the next release posted on the web site, probably in a week from now.

This is something we have been wanting for a very long time!

Looks like we get a client update along with a new protein next go around.

**Paratima** · 09-24-2003, 11:39 AM

Hooray!!

**FoBoT** · 09-24-2003, 11:41 AM

he also listened and found/fixed a gen. 0 issue that was causing a loss of cpu cycles, howard listens!!

**prokaryote** · 09-24-2003, 01:27 PM

Originally posted by Beyond

.
.
.

I agree the project's science "may" prove benefical in time. But so far all we have proven is that the "brute force" method of phase 1 is not the way to proceed. Results, good or bad, for phase 2 have yet to be determined! It may be some time yet before we know if we are on the right track with phase 2.

.
.
.

I agree totally with this point and in fact the results todate show the same that brute force is not the way to proceed in phase II as well. All that phase II has done is shifted the distribution (RMSD vs frequency of RMSD) a little more towards 0, but it still stalls out with respect to time exactly like phase I.

This is because Phase II is not a simple genetic algorithm, though it has elements of an SGA (per Howard's say so) as in the fact that it does some locallized gradient search about a "good" point found in generation 0. This is the same thing as a brute force search but instead of investigating scattered points in a VERY large search space we now investigate scattered small areas within a VERY large search space. You'd expect to have the same distribution but shifted towards 0.

I don't understand why DF needs to keep hammering away using essentially the same brute force method during phase II. Until a SGA is implemented, this approach will continue to be a brute force method. SGA's investigate schema's (an analogy would be like finding little bits of the final form of the folded protein and stringing them together). My guess is that it's taking some time for the group to decide how to construct its phase III SGA. If that's the case, I can't see running this project until that development work is finished, because we're not adding anything to the body of knowledge regarding protein folding yet by crunching DF.

Also, regarding the science involved, I've posed exactly the same points on the DF board several times and have gotten zip by way of any kind of response from Howard and company. I've even pointed out other projects that are using SGA's (notably DHEP) that use a methodology the doesn't require that all of the algorithms parameters be perfectly tuned to the current protein being investigated (which will bite them in the arse when they do run a SGA, it's a reason why designing SGA's is still an art form).

In my mind, F@H is a better project for protein folding from a scientific point of view as they've managed to at least publish some rather significant papers on the methodology and the results in some fairly prestigious journals and their test protein results have corresponded nicely to known folded structures.

In summary, I appreciate the goals of the DF project and would be willing to crunch it as soon as they're moderately done with the algorithm. In my opinion, currently crunching DF phase II is rather pointless as we're just killing time until phase III is ready.

prok

**bwkaz** · 09-24-2003, 06:31 PM

Originally posted by prokaryote
... because we're not adding anything to the body of knowledge regarding protein folding yet by crunching DF.

But DF is not trying to discover how proteins fold. All we're trying to do is figure out whether a massive, "mind-numbingly parallel" algorithm can get it relatively close.

Besides, as Thomas Edison (I think) once said: "I haven't failed. I've simply found a thousand ways that don't work", or something like that. In Phase I, we were more or less testing the theory that "A completely random placing of amino acids will come close enough to the real structure of a protein," given enough parallel simulations. They've apparently proven that one wrong (actually, they haven't, it just depends on how you define "close enough" -- compared to other methodologies, DF was about in the middle of the road; see the CASP5 results).

Phase II is testing the theory "A random-placing, followed by a bunch of calculations on the best candidate, will come close enough to the real structure of a protein." Different theory, different experiment.

If Phase 3 uses an SGA (and I don't remember ever reading anything that said it definitely would, just a short comment to the effect of "it might"), then we would be testing the theory "An SGA will come close enough to the real structure of a protein", which is another completely different theory.

In short, I view each phase as independent. I don't, therefore, see it as the algorithm being only partway-done, and I don't see the effort I've put in as "killing time". Heck, we knew from the get-go that the algorithm would change, that's pretty much a given. So would it have made sense back in phase 1 for everybody to quit just because it was going to change?

After we go through an SGA-type stage (if we do...), would it make any sense to stop running the client when Howard makes some small algorithm changes to it, just because "it's no different from before"? Because unless I completely misunderstood your post (which is possible...), this all sounds like the logical conclusion of your line of reasoning.

We are contributing to scientific knowledge whether or not we find a great algorithm. If nothing else, we can say "this definitely does not work", and move on after a while. But we have to find out first.

**FoBoT** · 09-24-2003, 10:54 PM

i have no idea what the previous two posts are talking about, i'm a flashing 12:00 VCR type guy

i don't understand the science, but luckily i am only a moderate stats whore, so it all balances out

**Aegion** · 09-25-2003, 01:27 AM

People should be aware that prokaryote is currently one of the project leaders for his own distributed computing project, so he certainly has potential motivations for discouraging people from running the distributed folding project. Significantly, prokaryote didn't directly address how accurate the project currently is, while we finished in the middle of the pack during the CASP 5 trials with phase I, we've made significant improvements since then in accuracy. He's also trying to suggest that the client is simply using a brute force aproach to get its answers, when this is no longer the case. Basicly he's arguing the project is only worthwhile if it uses a simple genetic algorithem, since he's arguing it needs to predict the protein structure perfectly to be worthwhile.

What he seems to be somehow completely missing is that the goal of the project is not necessarily to perfectly predict the shape of a folded protein structure, but just to get close enough that drug companies can start developing medicines based on the data. The Folding@home method is massively more computationally intensive and can generally only work on partial protein chains, and can't predict anything remote near the size of protein that the Distributed Folding Project is predicting. The fact that prokaryote somehow missed this point makes me strongly suspect he is nowhere near as knowledgable in this area as he tries to imply with his posting.

**Chinasaur** · 09-25-2003, 01:39 AM

Let's keep it civil and polite.

The hallmark of Free-DC is politeness and courtesy.

Conspiracy theory, meanness and general unpleasantness need to go offline.

I'm so glad we've had this time together...

**prokaryote** · 09-25-2003, 02:18 AM

Hi bwkaz, here's some conterpoints to your observations about my reasons for not doing DF phase II anymore:

Originally posted by bwkaz
But DF is not trying to discover how proteins fold. All we're trying to do is figure out whether a massive, "mind-numbingly parallel" algorithm can get it relatively close.

>> Granted, DF is not trying understand how a protein folds, but instead what will be the (or also likely a final folded state since some proteins can have several stable folded structures) final folded structure of a protein (within a defined environment).

Besides, as Thomas Edison (I think) once said: "I haven't failed. I've simply found a thousand ways that don't work", or something like that. In Phase I, we were more or less testing the theory that "A completely random placing of amino acids will come close enough to the real structure of a protein," given enough parallel simulations. They've apparently proven that one wrong (actually, they haven't, it just depends on how you define "close enough" -- compared to other methodologies, DF was about in the middle of the road; see the CASP5 results).

>> Saw CASP5 results, by entering the competition, the implied experiment was in fact that it would be the best method to predict final folded structure. I'm pretty sure that the participants didn't enter to target being middle of the pack. So in this case, close enough means the best. I think that what they proved is that the search space is too large even given 10^10 random structural picks. Parallel or serial doesn't matter if each structure is picked at random assuming independence of initial starting points.

Phase II is testing the theory "A random-placing, followed by a bunch of calculations on the best candidate, will come close enough to the real structure of a protein." Different theory, different experiment.

>> Differentish theory, same experimental space, same general effect, same general results that a logical examination of the methods used would predict. Pick random points, expand a small area around a "best-of". Net effect, pick random areas and search for the best structure. The odds that a best of 10,000 random point in the search space actually lies at the local minima of that best point is not very good. Therefore, search the area around this best pick, odds on, you'll find a better fit than this best point, but not much better. Why? because the search space is so large that an area or a random point have about the same net effect/chance to find the global best fit. Hence the spinning the wheels comment. We've seen what it can do so far (phase II), we've noticed that the distribution shape is the same but shifted towards the 0 RMSD (which is a good thing), but how many data points do you need? Lets move on already. Really, I think that Phase II is strictly a means of testing the structure and stability of the algorithm itself in preparation for phase III SGA implementation. I disagree that phase II is really about finding the best structure of a folded protein. I feel that this has been done.

If Phase 3 uses an SGA (and I don't remember ever reading anything that said it definitely would, just a short comment to the effect of "it might"), then we would be testing the theory "An SGA will come close enough to the real structure of a protein", which is another completely different theory.

>>Do a search on genetic algorithm within the official forum. It's mentioned plenty of times and Howard and crew specifically talk about using SGA's in Phase III. Which I think will prove to be a much better approach to finding "best" structures than the current very similar (from a results established statistical point of view) approach of phase I and phase II (as far as we know about it).

In short, I view each phase as independent. I don't, therefore, see it as the algorithm being only partway-done, and I don't see the effort I've put in as "killing time". Heck, we knew from the get-go that the algorithm would change, that's pretty much a given. So would it have made sense back in phase 1 for everybody to quit just because it was going to change?

>> Again, I'd have to disagree. Phase II feeds upon Phase I, it establishes some algorithm structure necessary for implementing Phase III SGA's. I don't know how you define independent but the above is by most definitions that I know defined as dependent. My opinion is that we've learned about as much as we're going to learn by crunching phase II already. Yes, we knew the algorithm was going to change when we started phase I, but we didn't know the results of phase I a priori when it was run otherwise it really would have been pointless. We do now and we see that phase II produces essentially the same results, but shifted (as expected) given the statistical and probabalistic nature of the experiment and the response space. I don't think that I mentioned anywhere that the reason I don't want to crunch DF Phase II is soley based upon the fact that Howard's algorithmic approach necessitates change. I think you're taking what I said out of context completely. Phase II is about testing the structure of the algorithm in preparation for phase III not really about finding the best structure.

After we go through an SGA-type stage (if we do...), would it make any sense to stop running the client when Howard makes some small algorithm changes to it, just because "it's no different from before"? Because unless I completely misunderstood your post (which is possible...), this all sounds like the logical conclusion of your line of reasoning.

>> My logical conclusion is that we're not gaining additional knowledge about producing efficient and worthwhile protein structure predicting algorithms (other than more endless confirmation that we're not) by pursuing phase II further. How you tie what I said about Phase III (an approach that I do favor, if it uses an altogether different method that can be shown mathematically and statistically to be different than the phase I and phaseII approaches) to my reasons for being dissatisfied about phase II, I don't understand. It seems like you're fixating on one point (taken out of context about phase II) and trying to show that somehow my logic would dictate some bizzare position about phase III.

We are contributing to scientific knowledge whether or not we find a great algorithm. If nothing else, we can say "this definitely does not work", and move on after a while. But we have to find out first.

>> Yes that is true to a point. I could spend days dropping an apple from atop a table and contribute to scientific knowledge that gravitational attraction still works between the apple and the specific spot on the earth. Is it worthwhile? How many times do I have to drop the apple in order to convince myself that gravity exists given what is postulated about gravitational attraction? That's why I think that Howard and company are using this part of Phase II to keep people interested in the project while they figure out how to best implement the SGA approach or at least a significantly different approach than what Phase I and Phase II are. They're killing time, and trying to keep people interested while they work out phase III. When phase III comes out, I'll give it another go, but until then there are much better uses for idle CPU time within other DC projects (IMO).

prok

**Angus** · 09-25-2003, 02:41 AM

Originally posted by Aegion
People should be aware that prokaryote is currently one of the project leaders for his own distributed computing project, so he certainly has potential motivations for discouraging people from running the distributed folding project.

Did I miss something here? Which project? DHEP? I thought that was run out of England by Miguel Garvie.

and prok was merely an enthusiatic participant?

Things are getting confused... where's my meds?

**prokaryote** · 09-25-2003, 03:24 AM

Originally posted by Aegion
People should be aware that prokaryote is currently one of the project leaders for his own distributed computing project, so he certainly has potential motivations for discouraging people from running the distributed folding project. Significantly, prokaryote didn't directly address how accurate the project currently is, while we finished in the middle of the pack during the CASP 5 trials with phase I, we've made significant improvements since then in accuracy. He's also trying to suggest that the client is simply using a brute force aproach to get its answers, when this is no longer the case. Basicly he's arguing the project is only worthwhile if it uses a simple genetic algorithem, since he's arguing it needs to predict the protein structure perfectly to be worthwhile.

People should be aware the prokaryote is NOT currently or has NOT ever been one of the project leaders for DHEP or ANY OTHER DC project, so I MOST CERTAINLY DO NOT have potential motivations for discouraging people from running the DF project. And even if I did, if I present a sound logical argument this changes the validity how? Also, I didn't explicitly state that DF in general is not worthwhile. What I'm saying is that phase II is not providing new and useful information and as such phase II is not worthwhile for crunching.

Also, the title of the thread is "Frustrations with DF". Well, these are my frustrations! Sorry I forgot to clear this with you first.

Significantly, I ran phase I as Eukaryote when no-one knew what the results would be for CASP5. We were middle of the pack as you state. We've made EXPECTED improvements in accuracy given that phase II is a localized gradient search about a best point from a population of 10000 randomly picked points within the enourmous protein structure space. It has the same general distribution characteristics as phase I only the central point has shifted more towards zero (a good thing, but expected). Phase II IS still essentially a brute force method. Why? because we're still searching random small areas and points within an enourmous protein structure space. If you want to really improve the accuracy of phase II then just extend the search space to encompass the entire structure space about each best point!

Basically what I'm arguing is that DF II is not currently worthwhile as its real purpose is to test the structure and validity of algorithmic changes that will ALLOW the implementation of phase III, which by the way, has been stated several times by Howard and company to be a Simple Genetic Algorithm (which I feel would be a worthwhile approach). IMO, DF phase II has collected enough information to establish that the structure of the algorithm (not the results) works and is ready for phase III. What I'm saying is that phase III is not ready yet, and I propose that the reason that DF phase II is allowed to go in is to keep people interested in the project until phase III is ready for release (at which point I'll contribute again).

Basically, where do I state that the project is only worthwhile if it uses a simple genetic algorithm (any semi-intelligent approach other than brute force or any combination of brute force and minor embellishments would be worthwhile). Also, I'll thank you to not put words into my mouth, I've never stated or implied that the project has to predict the protein structure perfectly. Howard and company nearly state this explicitly within the DF project goals see below for a direct quote from their home page.

What he seems to be somehow completely missing is that the goal of the project is not necessarily to perfectly predict the shape of a folded protein structure, but just to get close enough that drug companies can start developing medicines based on the data. The Folding@home method is massively more computationally intensive and can generally only work on partial protein chains, and can't predict anything remote near the size of protein that the Distributed Folding Project is predicting. The fact that prokaryote somehow missed this point makes me strongly suspect he is nowhere near as knowledgable in this area as he tries to imply with his posting.

>> The fact that Aegion somehow missed the point that I'm not even a project leader for any DC project makes me strongly suspect he is nowhere near as knowledgable in this area as he tries to imply with his posting. Personal attacks are wonderful aren't they? They require no substance or intelligence, just inuendo and heresay. Enough said.

If it helps any Aegion, my background is Mathematics, Statistics, Artificial Intelligence, Simple Genetic Algorithms, Evolutionary Theory, Biology, Electrical Engineering (semi-conductors), Quality and Reliability Engineering and a bit of cognitive psych and physiology. Granted that the specific problems polled by DF are not in my area per say, but the proposed solutions most assuredly are.

Here are the goals of DF ripped directly from their home page:

Project Goals

Proteins have a vast number of folds, larger than we could hope to compute even with distributed computing. Usually only one fold is found in nature. The Distributed Folding Project aims to test our new protein folding algorithm. We want see if it can reproduce natural protein folds after making extremely large samples of many different folds.

With your help, we will create the largest samples of protein folds ever computed. We have already sampled 1 Billion (1,000,000,000) folds for 5 small proteins, and are in the process of sampling 10 Billion (10,000,000,000) for another 10 large proteins. By the end of our first phase, we hope to make over 100 Billion protein folds spanning 15 different proteins.

However, our results so far have shown us that pure random sampling is not sufficient, even with these large sample sizes, to get a structure sufficiently close to the true fold for a typical sized protein. To this end, we have revised the algorithm, making it more intelligent. Using an iterative approach, where each structure depends on previously generated ones, and a more intelligent search algorithm, we increase our chances of finding the correct fold. This algorithm is being tested and developed on a continual basis, and the Distributed Folding Project provides an ideal platform to rapidly test and evaluate new ideas and modifications to the algorihtm, to determine what works and what does not. As we proceed we will continue to change and enhance the sampling algorithm over time.

Sorry to say, I don't see anything about medicines or drug companies or such.

What I do see is that they are trying to improve upon their algorithm with changes as dictated by the results. Gee that sounds an awfully familiar to what I've been saying

. What I'm saying is that as far as DF phase II is concerned, this is now known. Time for phase III. When phase III rolls out, I'll crunch again.

As far as F@H is concerned, they've produced results deemed valid by their peers, enough so that they were accepted and published in several esteemed journals. I've folded for them as well. They're seeking to understand how proteins fold.

So hmm, lets see, F@H is making progress on this front and if one understands how a protein folds then one could transfer this understanding to other proteins of other sizes and predict how they would fold as well, arriving at a final folded state...

Yep, your right, understanding protein folding mechanisms and applying this knowledge (in the future) to help predict the final folded state of other proteins has nothing what so ever to do with being able to predict a proteins final folded state like DF is trying to do.

Sorry, my bad.

prok

**prokaryote** · 09-25-2003, 03:46 AM

Okay, I got a little hot under the collar there. I apologize Aegion.

To each their own. Crunch DF phase II or not, it's entirely up to each individual and I respect that choice.

prok

**Paratima** · 09-25-2003, 08:02 AM

I'll try to work with the few words that are left.

Since this "Frustrations with DF" thread was started, many of the frustrations have been fixed! Howard reads and responds.

The main reason that I (and several others) left other projects, primarily F@H, was that their admins don't.

I (speaking only for myself here) count on the science guys to figure out what the project does. I know many of the words they use, but most of the sentences are incomprehensible. Forget the paragraphs. It's comfortable for me to run the client, therefore I do.

You, prokaryote, are debating the merits of the science in the wrong forum. The official DF forum is the place for that. This thread is about operations.

(And see, it is possible to defend a position in less than 10,000 words. I even had room to double-space.)

**rsbriggs** · 09-25-2003, 09:13 AM

Originally posted by Paratima
I (speaking only for myself here) count on the science guys to figure out what the project does. I know many of the words they use, but most of the sentences are incomprehensible. Forget the paragraphs. It's comfortable for me to run the client, therefore I do.

I'll second that. Paraphrasing what you said elsewhere - "you make the science work, I'll try to make it happen quickly".
I might not know anything about protein conformational spaces, but I CAN get a count of how many of those little spaces I can make conform per hour, as compared to everyone else.

**prokaryote** · 09-25-2003, 02:13 PM

Originally posted by Paratima
I'll try to work with the few words that are left.

Since this "Frustrations with DF" thread was started, many of the frustrations have been fixed! Howard reads and responds.

>> Sometimes he does which is better than most projects.

(And see, it is possible to defend a position in less than 10,000 words. I even had room to double-space.)

>> If I did that then how would I live up to my title? I'd write more but apparently there's a quota on the number of w....

**Paratima** · 09-25-2003, 03:43 PM

Touché.

**Aegion** · 09-25-2003, 07:11 PM

Originally posted by prokaryote
Okay, I got a little hot under the collar there. I apologize Aegion.

To each their own. Crunch DF phase II or not, it's entirely up to each individual and I respect that choice.

prok

I apologize too. I mistakenly though you one of the people in charge of the DHEP and the fact that I believed you would go to another forum area to criticize a different distributing computing program is what set me off.

The medicine issue is covered in several places including here.

Why fold proteins?

Many genetic diseases are the result of dysfunctional proteins, usually caused by a mutation in the DNA sequence which encodes the protein. By learning the structures of these proteins, scientists can understand how a mutation affects the structure, and thus the function of that protein. By understanding the exact cause of the faulty protein, better cures can be developed more quickly. Also, knowing the structures of viral proteins, such as the HIV integrase protein, allows researchers to develop drugs which specifically target those proteins. This results in better treatments with fewer side effects and faster results.

Unfortunately, solving the 3-D structure of a protein using experimental methods, such as X-ray Crystallography or Nuclear Magnetic Resonance Spectroscopy (NMR) can take from six months to over a year, for just a single protein. As well, certain types of proteins are not amenable to being solved by either of these methods. Of all the known proteins in the human genome, only about one quarter have had their structures experimentally solved. About another quarter can have their structures and functions inferred based upon their high sequence similarity to other proteins with known structures and/or functions. By solving the protein folding problem, we hope to be able to assign structures, and thus functions, to every protein in the human genome, and thus begin to decode what is perhaps the oldest historical document in existence, our DNA.

http://bioinfo.mshri.on.ca/trades/

Alot of the details regarding drug companies have been posted by Howard on the official forums over a prolonged period of time.

Incidentally, here's Dr. Christopher Hague's response to past accusations that Folding@Home has published more papers and DF isn't worthwhile at the moment.

Are we doing cutting edge software or just me-too distributed computing? Oh please. Check our paper on the MoBiDiCK infrastructure (oops, we published in the CS literature, I guess it doesn't exist...). You will kindly note that it indicates our intentions and work towards distributed computing applied to protein folding and it clearly pre-dates the entire F@H project. We weren't first, but we aren't me-too, we've been deliberately staging this for some time now. I'm confident by our user accolades that we've done the right thing in waiting till it was ready before releasing our software.

But, maybe this was the wrong strategy. Apparenty I need to do science with press releases, not in scientific publications, cause no-one reads them.

http://www.free-dc.org/forum/showthr...5&pagenumber=2

Thread: Frustrations With DF

Thread Tools

Rate This Thread

Display

Frustrations With DF

Re: Frustrations With DF

Slightly off topic but hey

Re: Slightly off topic but hey

As they say around my neck of the woods

Re: As they say around my neck of the woods

Sorry Doc

Re: Sorry Doc

Posting Permissions