So I'm sure all the farmers and cluster owners have their own preferred method of monitoring their clients.

I generally have mosixview or mosixgui running on my cluster controller and can see 50 machines that way. I know that KDFold -etal can watch the progress.

Problem is - all of these steal processing power and bandwidth. And none of them work well over a WAN. Since my system is offsite, I'd like to watch it from 10mi away now and again.

My non-clustered machines have a simple script that runs as a cron. Does an uptime and pushes that to a web server where I can read it. Not optimal - but it got me thinking.

Being an old sun hacker, I was wondering if rup still appears in modern unices.

So I logged into a linux machine and sure enough there it was!

You need portmap and rpc.rstatd running on each machine. Minimal processor utilization.

Then I wrote a simple script that runs as a cron job on my node controller:

rup 10.100.10.1 (for example)
rup 10.100.10.2
...

named the script dfuptime

That script spits out something that looks like:

Code:
10.100.10.39             up   6 days,  6:42,    load average: 0.99 0.97 0.95
10.100.10.40             up   6 days,  6:42,    load average: 0.99 0.97 0.98
10.100.10.41             up   6 days,  6:42,    load average: 0.99 0.97 0.96
10.100.10.42             up   6 days,  6:42,    load average: 1.00 0.99 0.95
10.100.10.43             up   6 days,  6:42,    load average: 0.99 0.97 0.95
10.100.10.44             up   6 days,  6:41,    load average: 0.99 0.97 0.94
10.100.10.45             up   6 days,  6:42,    load average: 0.99 0.97 0.94
I run it as a cron job every 20 mins.

Then I take the output from that script and as a cron every 22 mins (give it time to finish) - I pipe that into sendmail to my mobile address on my palmtop. Every 25mins or so I get an email that shows the uptime of all my machines.

They're dedicated to DF, so DF should be running all the time and be nearly the only thing running. That means I should see between .95 and 1.00 or so for a processor load. If I see between 0.00 and 0.10, I know that the client has crashed and can deal with it.

The next step will be an awk/sed or perl script to parse out the load averages above .80 and only email me if there's one that has crashed. When that happens, I can email my cell phone instead of my palmtop and have notification within +20mins or so of a client crashing.

Benefit here is that this works on both clustered and non-clustered unices. Unless there's an r* series for windows, you're outta luck. (I think cygnus may have such)

a
BIG WARNING however:

The r* commands are inherently dangerous to your network security. I'm using them here in a push-only mode behind a *very* strong firewall. Use them at your own risk! You've been WARNED. Possible side effects include. . . . .

Anyway - anyone else have a cleaner way?