PDA

View Full Version : Any updates on the supposed "Memory Leak"



PinHead
06-25-2003, 02:18 AM
I know it's not critical for most "folders", but when running thin clients with minimal ram, the extra few MB per generation causes the need for reboots every few days.

When I first noticed it, I thought it was due to the - g 25 switch that I was running. All though that significantly reduced the amount of ram being lost. It did not totally remove the problem.

I haven't crashed any boxen yet, but trying to keep tabs on the last reboot every 3 days is keeping me :elephant: .

Is there any info we can provide to help?

Stardragon
06-25-2003, 11:58 AM
One of the things to look out for is running in quiet mode, as that will obviously require less resources.

However, if it is indeed a memory leak, which we are searching for vigorously, then it will be equally noticeable on all systems, so I don't believe additional information at this point will get at the source of the problem :(

Paratima
06-27-2003, 09:22 PM
FWIW, I seem to be getting memory-leak symptoms on all my Win2K boxen, whether running as service or not. I run ALL of them "silent", either as service or with -qt, just because the graphics take up such a huge percentage of the time.

My lone Win98/SE box has been running for a week withOUT the memory-leak syndrome, likewise my Linux boxen. So it may very well be W2K/XP-specific.

^7_of_9
06-28-2003, 01:02 AM
Win2K Pro Machine - No problem running -rt -qt launched from DFGUI in invisible mode

Win2K Server - No problem running -rt -qt launched from DFGUI in invisible mode on either processor (Except that's it's slower than slow for some reason compared to same speed single processor machine :confused: )

WinXP - No problem running -rt -qt launched from DFGUI in invisible mode

Win98SE - No problem running -rt -qt launched from DFGUI in invisible mode

WinME - No problem running -rt -qt in Invisible Mode

Caldera Open Linux (Now A.K.A. SCO)- No problem running -rt -qt

CD-Nix - No problem running -rt -qt

All these Machines been running 24/7 since I installed the client on them on Tuesday and there is no evidence of a memory leak on any of them. Memory useage stays at ~80MB per process.

IronBits
06-28-2003, 01:18 AM
25 + w2k boxen, some have no SP, some have SP1, SP2 or SP3, all dedicated clients, that means that is all they do.
Standard install of w2k, the DF client and DFGui, all with -qt -rt.
DFGui starts the clients.

Hey... this sounds really *stupid* but...
I said -qt -rt
You said -rt -qt
Order of the switches maybe???? :confused:
Naw... can't be... or is it?

ASUS, MSI, BIOSTAR, SOYO motherboards.
Various brands of video cards, NIC cards, some onboard...
All have DDR Ram (different manufacturers of RAM), AMD processors from xp1700 thru xp2600 and a couple xp2800 Bartons.
/me runs off to test the -rt -qt switch thing

Paratima
06-28-2003, 05:52 AM
Nah! Couldn't be. I hope.

dfGUI sets foldit.bat to -qt -rt.

^7_of_9
06-28-2003, 08:48 AM
It's just the way I typed em out.

In the Foldit.bat it is -qt -rt though

IronBits
06-28-2003, 09:50 AM
What motherboard, CPU, chipset and RAM type do you use?

/me :Pokes: Paratima
Client is up to 120mb already, it is not the order of the switches, you can relax. :D

^7_of_9
06-28-2003, 10:06 AM
Originally posted by ^7_of_9
Win2K Pro Machine - No problem running -rt -qt launched from DFGUI in invisible mode

Win2K Server - No problem running -rt -qt launched from DFGUI in invisible mode on either processor (Except that's it's slower than slow for some reason compared to same speed single processor machine :confused: )

WinXP - No problem running -rt -qt launched from DFGUI in invisible mode

Win98SE - No problem running -rt -qt launched from DFGUI in invisible mode

WinME - No problem running -rt -qt in Invisible Mode

Caldera Open Linux (Now A.K.A. SCO)- No problem running -rt -qt

CD-Nix - No problem running -rt -qt

All these Machines been running 24/7 since I installed the client on them on Tuesday and there is no evidence of a memory leak on any of them. Memory useage stays at ~80MB per process.


Win2K Server - Tyan S2460 Dual MP 1900+ 512MB PC2700 DDR Non-ECC (Doubles as team Stats and sig server as well as it is my Domain Controller as well)

WinXP - ECS K7S5A XP 1700+ 256MB PC2100 DDR

Win2K Pro - ECS K7S5A XP 1800+ 256MB PC2100 DDR
Win2K Pro - MSI K7T266 Pro2-RU XP 1900+ 512MB PC2100 DDR

WinME - ECS K7S5A XP 2100+ 128MB PC2100 DDR (Yes it's actually running with the -rt switch :eek:

Caldera OpenLinux - Asus A7E T-Bird 1Ghz@1.11 256MB MB

CD-Nix - MSI K7T266 Pro XP1900+ 128MB PC2100 DDR (YEs it's actually using -rt too :eek: )

Ram is all Generic except for the Win2K server which I think I've got Crucial in there and the Win2K Pro 512MB one as well.

Can't use Cheap stuff in my main machine let alone the home Domain Controller (How'd we ever log onto our network :eek: :scared: :Pokes: )

IronBits
06-28-2003, 10:34 AM
Ok, well that pretty much covers the 'hardware' aspect of bug hunting.
I have most of what you have covered, and I still have the problem on all boxen. :cry:
Hope you don't mind humoring me a little bit while I try to figure out the differences.
ALL my computers, not just w2k, have this problem.

Where did you install the client?
Mine are in \dcprojects\distribfold (some C: and some D: )
Where are the environment variables TEMP and TMP pointing to?
I create a C:\TEMP and edit the user TEMP/TMP to point to this single directory.
Did you do anything special when you installed the client?
For example:
I created an autoupdate.cfg, and put my handle.txt file in there before running the client so I don't have to answer all the questions.
Did you go thru the manual setup and answer all the questions on each one?
I use DFGui on all the boxen to control them, do you?
Did you install any 3rd party software package that optimizes RAM or any other 'trick' software package?

When you bring up task manager, and you view the column labeled "Mem Usage" (you may have to click on View, Select Columns and add 'Memory Usage') it says it's only using ~80mb?
Can you add the columns called 'Page Faults' and 'Handles' and tell me what they show?
Thanks!

^7_of_9
06-28-2003, 05:45 PM
On one Win2K ("Central-Plexus")machine it's located in c:\Program Files\Distributed Folding

On the WinXP ("Enterprise") Machine it's in C:\Documents And Settings\Enterprise\Desktop\DF\Distribfold

Win2K Server ("Hive_Mind")instance #1 is: C:\Documents And Settings\Administrator.Domain_Name_Here\Desktop\DF-1

Win2K Server instance ("Hive_Mind") #2 is: C:\Documents And Settings\Administrator.Domain_Name_Here\Desktop\DF-2

WinME ("Vinculum") and Win98 ("Continuum") is C:\Windows\Desktop\DF

Caldera Linux ("Voyager")is /root/Desktop2/DC_Projects/Distributed_Folding

CD-Nix ("Picard" and "Kirk") (Now there's two as I just got another one up and running about 10 mins ago) is located on /mnt/distribfold via NFS on Voyager

Variables are all default on each machine.
Nothing special when I installed the client.
I setup each client manually (It's a PITA but I find it better off that way for now until I know this DC Project bettre. With SETI and others I can roll them out over the Domain Server when the machine logs on to the network :D )
All Windows Machines use DFGUI, Linux Machines use Client only
No 3rd party RAM optimizers (Never liked them)

You sure you wanna know what my Page Faults are? :lol:
266K on Central-Plexus :eek:
130 Handles Central-Plexus

28K and 113K Page Faults on Hive_Mind
110 and 90 Handles on Hive_Mind

/me goes to turn off Page Faults and Handles as he can't stand 2 more columns in the Task Manager. (It's bad enough having 6 columns already viewable in there w/o 2 more there :rotfl: )

Grumpy
06-28-2003, 06:03 PM
My 2400 MP Duallie is up to 170 MB per Client so far Win2000 SP3..anyone got some extra Ram :p

IronBits
06-28-2003, 08:02 PM
Nothing special when I installed the client.
I setup each client manually
Basically, the above seems to be about the only differences in the way we setup our computers.

Anyone else setup their clients manually AND still having the memory problem?
Thanks for the feed back! :thumbs:

Angus
06-28-2003, 08:41 PM
I install into a folder, then run foldit.bat the first time to set things up. I don't make any changes to TEMP space - whatever the prog wants to use is what it gets.

Then I install dfGUI in the same folder, and use it to tweak my settings for the title bar (CPU1, 2, etc.) and check th run the DF client hidden. I do not install as a service.

foldit.bat ends up with this line:
.\foldtrajlite -f protein -n native -qt -rt

On the server farm, I close dfGUI once the client is running, and monitor the progress using DCMonitor.

All of my W2K and W98 boxes exhibit the 'memory leak' problem.

I have gone to restarting each client very couple of days.

IronBits
06-28-2003, 11:54 PM
Well damn, that wasn't it either :(
I deleted the handle.txt and autoupdate.cfg file, and ran foldit.bat without any switches.
Answered all the questions and let it run for several hours.
Memory usage continued to climb :cry:

FBK
06-28-2003, 11:55 PM
FWIW: I didn't even realize this was an issue, until reading about it here. Most of my machines run either W2K or 98. Every machine that did not have DF2 restarted, every few days, had a huge memory footprint.

W2K AMD dually, running 238 hours using 217MB per instance.

Also a solid running, Win 98 machine was found with DF2 dead, with some message about not being to allocate memory.

I am running console in quite mode, not as a service. I am not running any 3rd party programs. This is happening on a variety of motherboards and hardware combinations.

Setup scheduler to delete lock and re-start FOLDIT daily as work-around for the time being,.

FBK

Darkness Productions
06-29-2003, 11:56 AM
I have noticed, on my Windows machines at least, that when run as a service, at some point along the generation, the service will restart itself for no apparent reason.

When run as the text only client, it would do the same, but the chances of it starting back up were slim. None of my windows machines have high memory usage, however, some of my linux machines do:


356 gms8994 20 19 118m 118m 2152 R 99.4 15.7 837:11 foldtrajlite
Notice that this one's only been running for 14 hours...


31394 gms8994 20 19 103m 103m 2120 R 99.7 20.5 1290:18 foldtrajlite
And this one for 21 hours.

Both of the above machines have 768M and 512M of RAM respectively, and they both test good via memtest86.

However, they both also use PC2100 DDR RAM. Could that have anything to do with it, as they are my only two machines that do?

FoBoT
06-29-2003, 06:26 PM
i just got home from vacation, one PC running ( -if -rt ) the whole time (WinXP Pro) has crunched the whole time, the memory being used by foldtrajlite.exe is up to 335MB

Grumpy
06-29-2003, 08:39 PM
I just hada look at one of my 98SE computers. Pysical Memory 256 MB Physucal Memory Free 311K The problem is that when you stop and restart to free up the memory, the Client almost always starts putting out RMSD that get higher and higher. So if you are on a good running and looking at breaking 5.00 and have to restart, you can kiss that hope goodbye :(

Beyond
06-29-2003, 09:41 PM
Interesting observation, had not reallly look that close at what was happening, but now that you have mentioned it to a look at the RMSD graphs on dfGUI and I see the same pattern of behavior. :(

PinHead
06-29-2003, 09:59 PM
Well so far I have it on:

Mandrake 8.2, 9.0, Win98SE, NT4 and Win2K.
P2, P3, P4 and XP processors.

Tried to let it go longer than 3 days and on day 5, 2 crashed with out of memory errors. 1 crashed with unable to find some file (basically it wasn't in memory any more).

magicfan241
07-01-2003, 08:17 AM
Is there a chance that the "memory leak", may not be a "memory leak" persay, but more something like, the client is storing all of the best ones from each generation in ram and not dumping them at the end of 250 generation set. I was going to do DF, but my boxen can't handle the memory usage that it is requiring.
I don't know if someone can test tghis, but it would explain the higher RMSD numbers when restarted, if it truely is a "learning" client, then the greater the data, the better the result. Or it could be a combination of a memory leak, and it doing something like holding past results in ram.

I don't know if someone could check this for me, my comp can't handle the mem req. for DF2.

Just my $0.02
Dancing Foliage
A.K.A.
magicfan241

Darkness Productions
07-01-2003, 08:34 AM
I think Howard made mention at one point that the logs and many other things were buffered in RAM, so that they wouldn't have to write to disk so often... If that's true, then it could explain somewhat why the memory usage keeps going up.

Angus
07-01-2003, 08:38 PM
:Pokes: (polite msg bump)

Anything new on this?

Does anyone at DF have ANY ideas about this yet? Seems like we've done a bit of data collection, but where is it leading?

TheOtherPhil
07-02-2003, 05:47 AM
After noticing my hourly production drop from ~12K to ~6K, I thought I'd restart my clients. The worst was my linux dually box (although it had been running the longest) with 440MB and 460MB per client. This machine has 1GB of RAM but it was still slowing it down. I'll check the next few updates to see if my daily production returns back to normal.

Angus
07-02-2003, 01:00 PM
Originally posted by Angus
:Pokes: (polite msg bump)

Anything new on this?

Does anyone at DF have ANY ideas about this yet? Seems like we've done a bit of data collection, but where is it leading?


BUMP

...waiting for some word from the DF team on this HUGE problem

Angus
07-03-2003, 03:58 PM
I take it from the silence of the DF team that we will not have a fix before the long weekend (US of A holiday).

Let's hope those suffering boxes will run that long without attention....

^7_of_9
07-04-2003, 09:32 AM
*bump* Everywhere I go I see people are droppig the client over because of the memory leak killing machines left right and centre. There's many a Bulletin Board where this is happening.

Can we get any kind of answer (Even if it's just to say you are working on it in some way) on this. Or do I have to hop on the next TTC Bus and drop on by Mt. Sinai in person to ask? :moon: :elephant::whip:

tpdooley
07-04-2003, 03:18 PM
All 5 of my win98se machines at work with 256Megs that have been running since the changeover to PhaseII were in memory starvation mode.. ranging from 1meg free to 20 megs free. Wondered why my scores seem so low.. and probably a blessing that some of my winxp systems stopped running the client because of the -s parameter.

AMDPHREAK
07-04-2003, 03:28 PM
Well if Howard and Co aren't responding, it is probably because they are beating their heads against another error or bug. Remember they want this to work too...:bang:

Can all of us with multi-boxen try and track this down? I have a prime candidate machine for a fresh install of Win2k, and a variety of other Win2K boxes to compare against. If Howard is busy, lets dig in and contribute what we can. (and thanx to those who have already done just that... keep it coming)

What I have gathered so far... (correct if wrong)

1) Non-specific to which MS OS is run. (any probs on linux, etc?)
2) Memory footprints of up to 300+ MB per instance when running with the -rt switch. (Any probs with -rt disabled?)
3) Restarting the process seems to dump the memory, but RMSD may suffer afterwords. Some machines even require reboot.
4) Possible many users may be experiencing problem and not know it, chalking it up to a crappy OS. (like moi)

Lets distill what we know and what we think and maybe Howard will be better able to fix it. Then again I may be full of it, so there ya go. :rotfl:

PinHead
07-04-2003, 06:32 PM
One thing I saw mentioned somewhere was asking about handles.

The other day I added Virtual Memory and Handles to my monitor.
To my suprise the handles were around 7500 after running for about 3 days. A restart brings the handles back down to between 125 and 150 and the memory back down to 90M.

So if you analyze the info in the pic ( hope it attached correctly), then you come up with one question:

What happens 1700 times in 12 hours and is approximately 22K in size?

When you know the answer, then that object (file handle, screen handle, blah blah handle) is not getting destroyed or released properly and is causing the memory drain.

P.S. Yea Linux has the problem also.

[edit]
I have had the linux boxen run out of memory in a little as 32 hours and as much as 5 days. Nice thing is that the OS brings the client down safely.

AMDPHREAK
07-05-2003, 12:30 AM
After checking, I have 3 boxen running under 100MB usage after 9-12 hours, and 1 with 170MB after 12 hours. So it looks like I have one affected machine. Considering I run all of the affected machines with the same settings, I am really baffled as to the potential trigger on this one machine.

Two Win2K systems with identical settings and CLI installs should perform the same, right? I mean does the Mobo/RAM mfr really factor in (that is the only real difference except vid card and CPU clock from a working PC)

Too weird... But the affected box has 512M of RAM so we will see how bad it is willing to get.

magicfan241
07-05-2003, 09:07 AM
The only thing that I could think of being updated a whole lot in 3 day is the progress.txt file. Unless it also is the filellist.txt file. Neither of them are 22k (at my last checking), but getting lots of them in memory could bring about the memory hogging we are all seeing.

Digital Parasite
07-05-2003, 10:05 AM
Yes that is interesting about threads. On my XP system, one of my clients is using 102MB of RAM (123MB of Virtual) and its handle count is up to 622 now. It has 2 threads.

For some reason, my other DF client on the same machine (Dual Athlon) has 3 threads and it is using 142MB RAM (161MB Virtual) and its handle count is up to 3282.

The other interesting thing that I noticed is that I don't think I have stopped either of these two clients but one finished its 250 gen cycle, restarted and is up to gen 58, and that is the one using 102MB of RAM even though I started both of them at the same time. It might be possible that after the client completes its 250gen cycle that it recycles the memory back to 88MB. I have not confirmed this but people should watch their systems if you are close to finishing 250 gens to see if this is the case for you and report it here.

Jeff.

JetBlack69
07-07-2003, 11:04 AM
Jeff,

After rolling over from gen 250 to 0 did NOT drop the memory. When I started it was 88MB (I think) then on gen 250 the memory was 156MB, now at gen 0 it is 157MB.

EDIT: The highest I've let it go was around 275 MB before I restarted the client.

This is a fresh client install running as a service on Windows XP with dfGUI running.

Digital Parasite
07-07-2003, 11:12 AM
I was just going to post the same thing. I watched a client wrap from gen 250 to 0 and the memory usage didn't go down and the # of handles open didn't go down either.

So much for that theory...

Digital Parasite
07-07-2003, 01:25 PM
I think I may have found something. The "memory leak" might be related to how often the progress.txt file gets updated. At home I have my systems set to -g 5 and after leaving the client run for a few days I am getting around 3000 handles.

On my machines at work I have -g 2 set and after 72 hours all my machines are using around 9000 handles so it seems the more updates that happen, the faster the handle count grows.

So the update rate seems to affect handle count but I'm not sure about memory usage. I will stop two of my clients, reset one to be -g 10 and see what the handle/memory usage is between the two tomorrow to compare.

Jeff.

Stardragon
07-07-2003, 01:30 PM
Hi all,

The reason for the silence is that we are still looking. We are going to run the client through special software that is designed to catch memory leaks that may exist in the code, but due to the way it works, it is a huge resource hog and would take a whole day to run. So this may not happen until the end of the week.

Would a few of you be kind enough to test Jeff's theory about the -g flag? It is entirely possible that more frequent writing will also result in more information sitting in memory waiting to be written out.

Thank you all for your help, as well as your patience.

Tateman
07-07-2003, 05:24 PM
After coming back this weekend to work, I noticed all my machines were borked. 900mb on one. 880mb on my main.

I'll try running two of my systems with different number of process updates.

JetBlack69
07-07-2003, 07:03 PM
I'll try with -g 0 (turns off updating progress.txt, right?) and see if memory use remains around 90-100.

BTW, my previous g setting was 5.

EDIT: I assume this well screw up dfGUI so it won't update my graphs.

Darkness Productions
07-07-2003, 08:56 PM
I think I speak for all of us when I say thanks for giving us an update, even if it's just a "we're looking, will report back later" kind of thing.

Project administrators who don't give their users any feedback don't stay project administrators for long


Originally posted by Stardragon
The reason for the silence is that we are still looking. We are going to run the client through special software that is designed to catch memory leaks that may exist in the code, but due to the way it works, it is a huge resource hog and would take a whole day to run. So this may not happen until the end of the week.

rsbriggs
07-07-2003, 09:21 PM
So - the current suspicion is that it might be a file handle leak? Possibly from progress.txt not being closed?

IronBits
07-07-2003, 10:50 PM
I don't use the -g flag at all. What does it use for default?

markhl
07-07-2003, 11:00 PM
I have not seen this memory leak issue; ran Phase II for about 2 weeks on a PIII 866 MHz running Windows XP SP1.
Running the text client, not as a service. Flags:

-df -rt -if for offline work
-df -rt -ut for brief connection to go online & upload results

However I never ran it for more than ~10 hours.

Grumpy
07-08-2003, 01:54 AM
I do not have that option ticked, so yeah, what does it default to ?

PinHead
07-08-2003, 01:54 AM
Originally posted by IronBits
I don't use the -g flag at all. What does it use for default?

I believe that "No -g" is the same as "-g5".

It updates the progress.txt file every 5 structures.

IronBits
07-08-2003, 02:27 AM
Using -g 0 on one boxen on UPS.
After 14 minutes, memory usage is not going up...
This is a good sign :)
Post follow up in the morning and let ya know what I find out.

Grumpy
07-08-2003, 09:42 AM
I have set my Duron 750 to 15, it is real slow and runs 98SE..it has the worst leak. I will post soon as to if the leak is reduced ;)

IronBits
07-08-2003, 10:04 AM
I have one that has been running for 2 hrs 15 min, with no -g switch.
It is using a tad more memory and has more pagefaults.
1st one listed below is the computer using -g 0 switch.
Hope it helps. I'll post again this evening.

Mem Usage Page Faults runtime
88,816 26,047 0:00 start -g 0
97,644 1,106,427 7:19
99,636 1,129,604 2:22 no -g switch

Digital Parasite
07-08-2003, 01:46 PM
Well it looks like I have pretty good evidence that the -g switch is directly related to our memory leak problem.

Here are some results after 24 hours of folding:

Machine 1
-g 2
136MB RAM Usage
6991 Handles

Machine 2
-g 10
104MB RAM Usage
275 Handles

IronBits
07-08-2003, 09:29 PM
Mem Usage Page Faults runtime
88,816 26,047 0:00 start -g 0
97,644 1,106,427 7:19
107,092 2,268,943 18:35 Handles 80

Grumpy
07-08-2003, 10:13 PM
I have DFGUI set to 15, memory usage is down to 43 MB Free Physical Ram. The same running time at defaults was 27 MB :( If it looks like a rat and smells like a rat, I say the rat leaks :mouserun:

JetBlack69
07-09-2003, 12:52 AM
After letting mine go for 24 hours, my mem usage is 145MB and I have 143 Handles with -g0.

MrMr
07-09-2003, 04:19 AM
Just some more datapoints:
850 minutes running : memory usage up from 105 to 130..159 M (IRIX64 bits clients -g 0, on-line)

1360 minutes running: memory usage up from 88 to 104M (Linux-icc clients -g 0 , off-line)

tpdooley
07-09-2003, 05:58 AM
how about adding the change in the generation# as well. (during the beta, I had a 26? hour gen, and someone else had a 2 day gen). are the handles related to the number of generations folded; or the number of incredibly difficult/easy generations folded?

Digital Parasite
07-09-2003, 08:39 AM
Here are some results after 42 hours of folding:

Machine 1
-g 2 (Current Gen: 105)
172MB RAM Usage
8582 Handles

Machine 2
-g 10 (Current Gen: 128)
116MB RAM Usage
395 Handles

As we can see with -g 2 the RAM usage is much higher and the number of handles is more than 10x the amount of the -g 10 machine. I have now recorded the generation # the clients are currently on so tomorrow we can see how many generations they have completed in that time.

IronBits
07-09-2003, 10:00 AM
Mem Usage Page Faults runtime Handles
88,816 26,047 0:00 start -g 0
97,644 1,106,427 7:19
107,092 2,268,943 18:35 80
116,696 3,425,957 30:44 113

IronBits
07-09-2003, 07:37 PM
Mem Usage Page Faults runtime Handles
88,816 26,047 0:00 start -g 0
97,644 1,106,427 7:19
107,092 2,268,943 18:35 80
116,696 3,425,957 30:44 113
125,600 4,473,971 39:58 126
Runs better with -g 0 switch for sure :)

pfb
07-10-2003, 05:07 AM
This is mine - ~32 hours run time, Windows XP, progress update set to 1, running as a service:

http://wibble.bounceme.net/DD/DF/hi_mem2.png

IronBits
07-10-2003, 09:17 AM
Mem Usage Page Faults runtime Handles
88,816 26,047 0:00 start -g 0
97,644 1,106,427 7:19
107,092 2,268,943 18:35 80
116,696 3,425,957 30:44 113
125,600 4,473,971 39:58 126
137,468 5,976,266 53:38 144

Digital Parasite
07-10-2003, 11:12 AM
My latest stats:

Machine 1
-g 2 (P4, Windows 2000, Service)
24 Hours: 136MB RAM Usage 6991 Handles
42 Hours: 172MB RAM Usage 8582 Handles (Current Gen: 105)
67 Hours: 196MB RAM Usage 9971 Handles (Current Gen: 141)


Machine 2
-g 10 (P4, Windows 2000, Service)
24 Hours: 104MB RAM Usage 275 Handles
42 Hours: 116MB RAM Usage 395 Handles (Current Gen: 128)
67 Hours: 128MB RAM Usage 533 Handles (Current Gen: 146)


So looking at my stats its obvious that updating the progress.txt file more often contributes to the memory leak. Looking at IronBits stats using -g 0 there is still a memory leak showing but much smaller (perhaps the updating of filelist.txt?).

Stardragon & Howard: That should at least give you an area to concentrate on in the code for your testing/investigation. Since this is happening on both Linux and Windows machines, it is unlikely an OS related thing leaking. There is probably something in the code that doesn't properly close or delete/free something.

Jeff.

JetBlack69
07-10-2003, 05:57 PM
Ok, it's been running for 62 hours and 50 minutes. It's using 176.6MB of ram and has 189 handles. It's on gen 165.

IronBits
07-10-2003, 07:46 PM
I think you have it about right Digital Parasite, not freeing up the ~malloc something. ;)


Mem Usage Page Faults runtime Handles
88,816 26,047 0:00 start -g 0
97,644 1,106,427 7:19
107,092 2,268,943 18:35 80
116,696 3,425,957 30:44 113
125,600 4,473,971 39:58 126
137,468 5,976,266 53:38 144
145,552 6,980,068 64:04 156

theBRAINbelly
07-10-2003, 09:04 PM
Here are some handles....

http://www.adflix.com/wb/Images/distrubfold-handles-1.jpg

rsbriggs
07-10-2003, 09:45 PM
Ouch - it's a wonder that system is still up and running......

Brian the Fist
07-10-2003, 10:25 PM
While I have scoured the code with a fine toothed comb (dont take that too literally) I have not located any possible leaks myself yet. I have not been able to do a more robust check for leaks (we use Purify among other things) since I've been away but will do so shortly. The progress.txt updating thing is very bizarre though, I'll take another look at that and see if there's any unclosed file handles though Ive checked before.

Digital Parasite
07-11-2003, 12:36 PM
Last posting before I cycle these two boxes:

Machine 1
-g 2 (P4, Windows 2000, Service)
24 Hours: 136MB RAM Usage 6991 Handles
42 Hours: 172MB RAM Usage 8582 Handles (Current Gen: 105)
67 Hours: 196MB RAM Usage 9971 Handles (Current Gen: 141)
93 Hours: 224MB RAM Usage 11326 Handles (Current Gen: 183)


Machine 2
-g 10 (P4, Windows 2000, Service)
24 Hours: 104MB RAM Usage 275 Handles
42 Hours: 116MB RAM Usage 395 Handles (Current Gen: 128)
67 Hours: 128MB RAM Usage 533 Handles (Current Gen: 146)
93 Hours: 146MB RAM Usage 711 Handles (Current Gen: 173)

Ned
07-11-2003, 02:07 PM
It seems to me that the operating systems are too lazy to give the memory for the file handles back to the memory pool until after the process ends...

Perhaps if you dig deep enough into the operating systems API's you might find a way to re-use the file handles. (This may be assuming toooo much!).

Ned :rolleyes:

Angus
07-11-2003, 02:17 PM
Here's an interesting effect:

All four of these clients (on a 4 CPU box) were all restarted at the same time yesterday afternoon, with the exact same settings. Each client runs from it's own folder.

Go figure.....

DB7654321
07-14-2003, 04:12 AM
I've been experiencing this issue on my DF boxen, too.


BUMP.

AMDPHREAK
07-14-2003, 07:57 AM
Just an "outside the box" thought... if Howard and Co. cannot solve this problem quickly (and it sounds elusive enough to warrant this assumption) could Jeff just make a sub-version of DFGui that restarts the cli every 250 gens?

No more babysitting, no more out-of-memory related crashes. Not to say the DF team shouldn't keep looking, but wouldn't this be a comparitivly easy stop-gap measure? I know there are lots of afflicted boxen and folders out there just waiting for SOME workaround other than manual restarts.

Just my .02 tossed in the ring for your consideration...:idea:

Ned
07-14-2003, 08:21 AM
It seems to me that the operating systems are too lazy to give the memory for the file handles back to the memory pool until after the process ends...

If Howard cannot find a generic way to stop the file handles from grabbing incremental amounts of memory, he should terminate foldtrajlite.exe every 50 generations (pick an appropriate number) or so with a specific return code that would signal foldit.bat to simply restart the client. That way, the program can pick a situation were no work is lost, the operating system will restore the resources, and we get a more client requiring less handholding.


could Jeff just make a sub-version of DFGui that restarts the cli every 250 gens?

dfGui does not have that kind of control over the client. Remember that it is designed to observe. dfGui could stop and restart the client on a time basis like say every 24 hours.

Ned

IronBits
07-14-2003, 09:44 AM
I have found a better solution
Windows Scheduler from http://www.splinterware.com/products/wincron.htm
It allows you to run a script HIDDEN, and has the ability to hide itself from the ICON tray
I use it, and a special script, to restart all my clients every 24 hours ;)

Digital Parasite
07-14-2003, 11:30 AM
dfGUI *could* watch the client and and restart it every 250 generations, and it could also do it after X hours.

dfGUI knows when the client has finished a 250 gen set (that is when it takes a snapshot of the graphs) so I could add a feature to stop and re-start the client again then. I could also do that every X hours as well.

I was hoping that I wouldn't have to code such a feature but since it is a major problem for many (including me) if this doesn't get fixed soon I think I will add that to dfGUI.

Jeff.

Dyyryath
07-14-2003, 01:16 PM
This is OT for this thread, but I'd like to see dfGUI give me the number of 'points' per hour/day based on the current generation being crunched. It'd make it easier to compare output rates...

PinHead
07-15-2003, 01:14 AM
Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please Please!

Keep checking on this error!

Maybe it's a window handle and not a file handle. I have seen the handle and memory leak occur in as little as 36 hours.

Check the code where a stuck client displays the current alternate structure or where it doesn't get displayed. If it can happen in under 36 hours then maybe it is when it doesn't display the alt calculation.

^7_of_9
07-15-2003, 01:22 PM
[Bad Joke On]

It's about time we got a handle on this now ....

[Bad Joke Off]

Angus
07-15-2003, 06:22 PM
It really is getting old - having to restart 25 clients every day.

What is the timeline for DF to fix this? Are they DOING ANYTHING about it? It certainly can't take long to test a fix - the things rears it's ugly head very quickly.

Let's get a DAILY update on the progess.

pfb
07-15-2003, 06:25 PM
Angus et al - you seen this (http://www.free-dc.org/forum/showthread.php?s=&threadid=3591) thread? There's a new Windows client that should fix this problem...

erk
07-18-2003, 05:05 PM
And a bug fixed FreeBSD or Linux client when?

Brian the Fist
07-18-2003, 05:19 PM
We will be switching to a new client next Tues, as posted today in the News of the website.

PinHead
07-18-2003, 10:45 PM
Apparently you don't consider the handle ( which consumes memory) a memory leak. So could you please apply a similar code change to the linux client?????

Pretty Please!! with StarDragon on top!!

Brian the Fist
07-19-2003, 01:32 PM
Firstly, all clients use the same code. If we found a memory leak, we would find it in all of them and fix it in all of them. However we found no memory leaks after rigorous searching, only a handle leak - a Windows registry handle leak, which obviously doesnt apply under Linux.

Try the new version next week on Linux and if you suspect memory is still leaking, let us know and please posty your evidence as well. Thanks.

erk
07-19-2003, 05:17 PM
Originally posted by Brian the Fist
Firstly, all clients use the same code. If we found a memory leak, we would find it in all of them and fix it in all of them. However we found no memory leaks after rigorous searching, only a handle leak - a Windows registry handle leak, which obviously doesnt apply under Linux.

Try the new version next week on Linux and if you suspect memory is still leaking, let us know and please posty your evidence as well. Thanks.

Well at the moment there is absolutely an increase in memory usage related to the value of the -g flag. eg -g5 uses a lot more memory in 24 hours thatn -g25 does. This happens on all the FreeBSD, Linux, and MacOS X servers that I run.

IronBits
07-19-2003, 05:23 PM
Correct,
The rest will be fixed on Tuesday when the new client is released.
Until then, only the Windows version was fixed in a 'beta' ...

PinHead
07-20-2003, 01:05 AM
Originally posted by Brian the Fist
Firstly, all clients use the same code. If we found a memory leak, we would find it in all of them and fix it in all of them. However we found no memory leaks after rigorous searching, only a handle leak - a Windows registry handle leak, which obviously doesnt apply under Linux.

Try the new version next week on Linux and if you suspect memory is still leaking, let us know and please posty your evidence as well. Thanks.

All I can tell you at the moment is that linux clients die after 3 to 5 days with an "out of memory error". The nice part about it is that the os sends the kill signal and the client doesn't corrupt it's files.
Sometimes the ".lock" file is there and sometimes it is not.

But yes linux client does consume memory like the windows client.
I will try tuesday's client and post the results.

bwkaz
07-20-2003, 11:34 AM
Originally posted by PinHead
I will try tuesday's client and post the results. As will I (though it won't be running for very long on any machine except the P3-800 that's being a router / firewall).

I did check up on that machine last night (it had been running since last time I restarted it because it was using too much memory -- that was July 7th), and found it using 280MB of virtual (VSZ, according ot ps / top) and 180MB of physical (at least, it was what ps / top call RSS) memory. It hadn't gotten hit by the OOM killer yet, though (the machine has 256MB, and 384MB of swap), so I manually restarted it.

Of course, none of this information really does any good for Howard and company. But whatever.

If it still seems to be happening after Tuesday, what information would you want in a bugreport? One line of output of "ps aux" (the line for the foldtrajlite process)? One line of top's output? Any special arguments to pass to top to get some of the info you'd need?

Brian the Fist
07-20-2003, 12:30 PM
Originally posted by bwkaz

If it still seems to be happening after Tuesday, what information would you want in a bugreport? One line of output of "ps aux" (the line for the foldtrajlite process)? One line of top's output? Any special arguments to pass to top to get some of the info you'd need?

top or ps output would be perfect

Grumpy
07-21-2003, 06:29 PM
Well, for those curious, I just had to reboot my Win2000 computer which is a Duallie. One Client has the original Client, the other the upgraded version. Both were restarted when the update was posted. The Client using the original was up to 386 MB, the modified Client 96 MB

:cheers: