PDA

View Full Version : DCMonitor and DFMonitor oddities



^7_of_9
10-30-2003, 12:39 PM
Been using DCMonitor for a while now and not too long before the last changeover it stopped monitoring three Linux clients properly for some reason.

Until that time it always told me if the client was running (I'm assuming both these programs only look for the lock file to see if it's running). Nothing has changed whatsoever on any of the Lnux clients at all. "Voyager" is the main machine running NFS exports for the other two machines running Diskless Path from the Windows machine that is running DF and DC Monitor is \\voyager\nfs\exports\Picard\distribfold and \\voyager\nfs\exports\Kirk\distribfold and then \\voyager\DC\distribfold (The last one is the local client on that machine)

No permissions problem as I've still got full access to those directories and other shared ones on the same machine. (Using SAMBA for Windows access)

Now for some reason They report the client as always being "stopped" ... Here's the URL of the HTML pages that are updated once every 5 mins for DCMonitor and 10 minutes

DC Monitor: http://sigs.teampicard.com/DCMonitor/
DF Monitor: http://sigs.teampicard.com/DFMonitor/

You'll see that Voyager, Kirk and Picard are all either "Stalled" or "Stopped" (Don't worry about Vinculum as that's a machine that the new client always seems to crash on .. :rolleyes: ), but I know for a fact they aren't as I've turned off quiet mode on the client and see it going through the processes and everything, but still receive the output that the client isn't running. I've even gone in and run a ps -e and a top to see what was going on and no problems there either. My output is about what it should be as well.

So I'm REALLY :confused: on this one now ...

pfb
10-30-2003, 12:55 PM
I can't answer for DCMonitor but for DFMonitor:


(I'm assuming both these programs only look for the lock file to see if it's running)

DFMonitor doesn't - it looks at progress.txt only...if it's there then it's running, if it's over stalled minutes old then it's there but hasn't been updated and if it's not there, DF is stopped.

Do you know what the last modfied date/time of those 3 progress.txt files are? And what do you have the stalled setting as? Is there a .lock file for each client?

I have noticed stalled being flagged a bit more often with this protein with the default setting of 10 mins - but would be a bit concerned if the age of progress.txt is a couple of hours...

It is a bit odd that the client is running (have you checked progress.txt to confirm this?) but the utils are saying stopped/stalled...if you had lost network access to them DFMon would say the clients had stopped...

/me getting a bit :confused: over what is happening as well

/edit - just rembered...had a similar issue with dfMon and my Linux client (shared via Samba) where Windows was removing an hour from the modified time which meant it was 'stalled' - restarting Samba on the Linux box fixed it.

^7_of_9
10-30-2003, 02:16 PM
Thanks for the quick reply.

I'll have to check the Date/Timestamp on the files through Windows and also the machines themselves to see if that's the problem there. I know it's not a network not reachable thing as I can still get to their individual files over the network. You might have a good idea on the Samba issue there, I'll restart Samba tonight as well and see if that's the problem.

I don't think that it's the issue with the new protein being so large either as it started last week as well when it was on the 64 one. For the life of me I can't think of ANYTHING that happened around that time to cause everything to go out of kilter at all.

Also I just realised that DCMonitor also prob looks at the progress.txt file as well to give info on Generations and stuff too.

pfb
10-30-2003, 02:35 PM
Originally posted by ^7_of_9
Thanks for the quick reply.

I'll have to check the Date/Timestamp on the files through Windows and also the machines themselves to see if that's the problem there. I know it's not a network not reachable thing as I can still get to their individual files over the network. You might have a good idea on the Samba issue there, I'll restart Samba tonight as well and see if that's the problem.

I don't think that it's the issue with the new protein being so large either as it started last week as well when it was on the 64 one. For the life of me I can't think of ANYTHING that happened around that time to cause everything to go out of kilter at all.

Also I just realised that DCMonitor also prob looks at the progress.txt file as well to give info on Generations and stuff too.

With the Samba thing - you didn't have a time change last week? My problem occured due to going back to GMT from BST...

^7_of_9
10-30-2003, 02:39 PM
Originally posted by pfb
With the Samba thing - you didn't have a time change last week? My problem occured due to going back to GMT from BST...

AWESOME! I'm EST here and TOTALLY! forgot about that (I'm so used to my Server machine acting as my time server that I forgot the Linux machines don't use that as a time server) Great news that it's not something I screwed up and is most likely only the"Fall Back" Time change! :D

^7_of_9
10-30-2003, 09:47 PM
Well the time trick wasn't it as the Linux machines succesfully updated ok by themselves to the proper time.

I SSH'd into the main Linux machine and restarted Samba ... BOOM! back up and showing as running now. :)

Thanks for the help.