ZZZ again

**PY 222** · 07-20-2005, 08:06 PM

Ok, so what is up with the server now?

I am getting ZZZ on my clients.

Maybe it needs a kick.

**vaughan** · 07-20-2005, 08:32 PM

Urghh! Lots of those annoying "ZZZ - Sleep for 60 seconds" messages today

And my backup project - BOINC Sztaki will only d/l 1 task at a time and won't auto fetch new work so my CPU has been idling way too much today.

Methinks I need a 2nd backup project ...

**black_civic55** · 07-20-2005, 08:36 PM

dumb question but i also havent had any time to figure it out for myself. Can i run eOn as my main project and have SOB on for when eOn fails or will SOB take over completely?

**PY 222** · 07-20-2005, 08:51 PM

Originally posted by vaughan
Methinks I need a 2nd backup project ...

I've just moved my eOn boxes back to Find-A-Drug.

Maybe you can check that project out too.

**PY 222** · 07-20-2005, 08:52 PM

Originally posted by black_civic55
dumb question but i also havent had any time to figure it out for myself. Can i run eOn as my main project and have SOB on for when eOn fails or will SOB take over completely?

That is completely possible. I am guessing that you would want to run both projects at the same time but have one project on a higher priority level than the other.

Is that correct?

**black_civic55** · 07-20-2005, 10:17 PM

correct, pretty much SOB chillin there untill eOn goes down.

**vaughan** · 07-21-2005, 12:39 AM

I have SOB running at Low Priority (not Idle) in the SOB Configuration and eOn running stock ie. as it installed, on a 3.2GHz P4 Prescott with hyperthreading. Windows XP Pro SP2 Task Manager shows 50% CPU usage for each project.

When eOn gets an attack of the ZZZs SOB is still running but it is only reported as using 50% CPU by Task Manager. At least the box isn't sitting idle.

**and_ped10** · 07-21-2005, 07:21 AM

Hi,

Sorry about the server down time.

The reason for the down time is that we are using a small unix script to nurse the server, a part of the nursing is to restart the server automaticly if it stops. When the server restart information from the old server (the one that stopped) is copied. There have been a bug related to this procedure but it should be fixed now.

Kindly Andreas

**PY 222** · 07-26-2005, 12:28 PM

I am seeing ZZZ again and its been a while now.

Any idea when the server will be up?

**Bok** · 07-26-2005, 12:38 PM

pop a few P4's on SoB PY

Yeah I'm impatient to get the overtake done....

Bok

**magnav0x** · 07-26-2005, 01:24 PM

Originally posted by Bok
pop a few P4's on SoB PY Yeah I'm impatient to get the overtake done....

Bok

According to USD's frontpage, they should be switching efforts over to E@H tomorrow:

07-13-05 to 07-26-05 -> SoB
07-27-05 to 08-09-05 -> Einstein@Home

**PY 222** · 07-26-2005, 02:34 PM

Originally posted by Bok
pop a few P4's on SoB PY Yeah I'm impatient to get the overtake done....

Bok

LOL. You guys don't need me there.

Be patient Bok. Eventually you will overtake them.

**PY 222** · 07-26-2005, 03:57 PM

Still no word from the eOn people?

**vaughan** · 07-27-2005, 03:11 AM

Knock knock Graeme - time to reboot the eOn server.

:sleepy:

:sleepy:

:sleepy:

**and_ped10** · 07-27-2005, 03:58 AM

Hi everrybody,

We are allways trying to answer all questions asap!

I would like to know if your clients outout still is zzz. The server is currently appearing to run steadly, we recieve data from clients and this is stored.

An issue could be that it is only every 5 minute it is checker if the server is running, so once in a while the server will be down but for 5 minutes at most. If you have checked the clients output in one of these bad blocks this could explain the zzz.

Hope the answer is satisfying.

Kindly Andreas

**black_civic55** · 07-27-2005, 10:54 AM

it was out for a lot longer then 5 mins but it appears to be running just fine now

**AMDave** · 07-29-2005, 09:03 AM

No zzz at the moment

I have noticed the client gulping huge amounts of RAM at times.
I was getting lot of "result of multiplier out of range" or some such.
eventually the client stopped altogether.

I trapped this from the log:

Code:

 [FIDA] ZZZ - Sleep for 60 seconds
 [FIDA] Reporting Results
 [FIDA] Connecting to Server
 [FIDA] Work Unit Completed
 [FIDA] Requesting Assignment
 [FIDA] Connecting to Server
 [FIDA] Shutting Down Client
Detected memory leaks!
Dumping objects ->
{6085632} normal block at 0x0153ABD8, 178480 bytes long.
 Data: <                > 0B 06 00 00 0C 06 00 00 0D 06 00 00 0E 06 00 00 
{5970611} normal block at 0x0145EA80, 37128 bytes long.
 Data: < ` f i >       >> 00 60 08 66 1C 69 A4 3E 00 D0 C6 98 91 C8 93 3E 
{5970610} normal block at 0x0132FB18, 6188 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
{5970609} normal block at 0x015258E0, 12376 bytes long.
 Data: < |  B  ?  " B  ?> 94 7C CD E8 42 A4 F0 3F 8B C5 22 E2 42 A4 F0 3F 
{5970608} normal block at 0x013631B8, 12376 bytes long.
 Data: <    L  ?  T7L  ?> D7 96 FF 11 4C 8E FD 3F BF 83 54 37 4C 8E FD 3F 
{5970607} normal block at 0x014189C8, 12376 bytes long.
 Data: < m  i, ?p y i, ?> 80 6D 9E 1D 69 2C C6 3F 70 CA 79 95 69 2C C6 3F 
{5970606} normal block at 0x01225288, 12376 bytes long.
 Data: < (_j   ? /     ?> C0 28 5F 6A 01 84 C0 3F 80 2F F5 B7 00 84 C0 3F 
{5970605} normal block at 0x010C9028, 12376 bytes long.
 Data: <        : ag    > 00 86 AA 8C B3 15 F7 BF 3A DB 61 67 B3 15 F7 BF 
{5970604} normal block at 0x01191028, 12376 bytes long.
 Data: <   ao  @ J6 n  @> E2 05 AC 61 6F F7 20 40 98 4A 36 B2 6E F7 20 40 
{5970603} normal block at 0x011F7AE8, 12376 bytes long.
 Data: <   6   @       @> EF D2 1B 36 96 EC 20 40 E3 91 96 80 95 EC 20 40 
{5970601} normal block at 0x0116D828, 16 bytes long.
 Data: <    (    @   @  > CC CD CD CD 28 10 19 01 80 40 19 01 80 40 19 01 
{5970599} normal block at 0x0116D760, 16 bytes long.
 Data: <     z  @   @   > CC CD CD CD E8 7A 1F 01 40 AB 1F 01 40 AB 1F 01 
{5970598} normal block at 0x01421028, 37128 bytes long.
 Data: <   @   ?   @   ?> 01 00 00 40 0C 02 F2 3F 01 00 00 40 0C 02 F2 3F 
{5970597} normal block at 0x014DAFB0, 74256 bytes long.
 Data: <   @   ?   @   ?> 00 00 00 40 0C 02 F2 3F 00 00 00 40 0C 02 F2 3F 
{5970510} normal block at 0x014D9F78, 4096 bytes long.
 Data: <                > A4 04 00 00 09 00 00 00 A5 04 00 00 09 00 00 00 
{5969577} normal block at 0x010E7570, 6192 bytes long.
 Data: <    !   A   l   > 00 00 00 00 21 00 00 00 41 00 00 00 6C 00 00 00 
{5969576} normal block at 0x01163158, 96 bytes long.
 Data: <p c             > 70 06 63 00 0B 06 00 00 0B 06 00 00 CD CD CD CD 
{5969575} normal block at 0x01162F30, 504 bytes long.
 Data: <                > 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
{5969574} normal block at 0x011A30B8, 8 bytes long.
 Data: <       ?> 00 00 00 00 00 00 F0 3F 
{5969573} normal block at 0x011A3080, 12 bytes long.
 Data: <         0  > 01 00 00 00 01 00 00 00 B8 30 1A 01 
{5969572} normal block at 0x006307C8, 4 bytes long.
 Data: <  c > E8 06 63 00 
{5969571} normal block at 0x00630790, 4 bytes long.
 Data: <  c > E8 06 63 00 
{5969570} normal block at 0x006306E8, 120 bytes long.
 Data: < n      lxz ,C ?> 98 6E 12 83 C0 CA F7 BF 6C 78 7A A5 2C 43 FC 3F 
{5969569} normal block at 0x011A3028, 33 bytes long.
 Data: < Mg             > 00 4D 67 00 CD CD CD CD CD CD CD CD CD CD CD CD 
{5969565} normal block at 0x01188FA8, 56 bytes long.
 Data: <          c   c > D0 F9 0C 10 CC CD CD CD 90 07 63 00 94 07 63 00 
{5969564} normal block at 0x00630878, 376 bytes long.
 Data: <0   p c         > 30 F5 0C 10 70 06 63 00 00 00 00 00 0B 06 00 00 
{5969562} normal block at 0x0119D2E0, 6188 bytes long.
 Data: <                > 0C 00 00 00 0C 00 00 00 0C 00 00 00 0C 00 00 00 
{5969561} normal block at 0x0134C6E8, 37128 bytes long.
 Data: <                > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
{5969559} normal block at 0x00630670, 72 bytes long.
 Data: <8   x c 0       > 38 F9 0C 10 78 08 63 00 30 8C 18 01 CC CD CD CD 
{5969558} normal block at 0x01188C30, 832 bytes long.
 Data: <           `  F@> 01 01 01 CD CD CD CD CD 00 00 00 60 8F 82 46 40 
{21122} normal block at 0x01256060, 5 bytes long.
 Data: <true > 74 72 75 65 00 
{21121} normal block at 0x01256028, 6 bytes long.
 Data: <false > 66 61 6C 73 65 00 
{21120} normal block at 0x0116D8A0, 1 bytes long.
 Data: < > 00 
{21115} normal block at 0x0116D7A8, 24 bytes long.
 Data: <            .   > AC 01 0D 10 01 00 00 00 A0 D8 16 01 2E 00 CD CD 
{21109} normal block at 0x0116D728, 8 bytes long.
 Data: <        > 8C 01 0D 10 01 00 00 00 
{35} normal block at 0x00631E28, 33 bytes long.
 Data: < C              > 00 43 00 CD CD CD CD CD CD CD CD CD CD CD CD CD 
{34} normal block at 0x00631DD0, 40 bytes long.
 Data: <                > EC 00 0D 10 16 00 00 00 00 00 00 00 00 00 00 00 
Object dump complete.
0 bytes in 0 Free Blocks.
475609 bytes in 35 Normal Blocks.
16 bytes in 0 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 36846381 bytes.
Total allocations: -556487932 bytes.
Dimer UndoAssignment
Dimer Uninitialize
Dimer UndoAssignment
~DimerClass
WARNING: Memory Leaks!!!

beginning to end of dll

after dimer creation to before destruction

hope that makes sense to someone.

On that occasion it was not using a massive amount of RAM, at other times it has operated between 50Mb and 100Mb.

I shut it down several days ago and ran some other projects which were all stable. but I am running it again just now and it is using only 8.6 Mb

I will keep an eye on it again over the weekend to see what happens.

**AMDave** · 07-29-2005, 09:35 AM

...didn't take long

The test that was running seemed to be taking far too long

I checked the memory useage and it had just climbed past 65Mb and was still going ! I have stopped the client again.

Is this normal bahaviour for the current tests ?

**vaughan** · 07-30-2005, 04:41 AM

The current tests are taking much longer to run and the HSize values are all over 2000.

AMDave, I've seen some of those Bad Prefs messages too but haven't checked the memory usage. Will keep an eye on it tho'.

**and_ped10** · 07-30-2005, 08:48 AM

Memory use,

We are aware of the problem with the memory leak. The reason why it have not been fixed is that it is extremly hard to debug and if the client software is restarted once in a while it is not a severe problem!

When a memory leak occurs the client application take up more memory than nessesary and as time goes on the the running client application ending up taking up far to much memory. When this is the case, the client should be restarted. The memory leak is only a problem if the client application has been running for like more than a week or so.

We are of cause doing our very best to fix the bug, but we prefer to use the best present software and get results rather than waiting for a perfect version that maybe never will be achieved.

Kindly Andreas

**AMDave** · 07-31-2005, 04:21 AM

Thanks Adreas.
I'll keep a look out for the new client.

**PY 222** · 08-04-2005, 12:00 AM

I think the server needs another kick.

I am seeing ZZZs again.

**and_ped10** · 08-04-2005, 09:24 AM

The server seems to be working OK! (05-08-04)

Are you still having trouble achieving new computational jobs?

Kindly Andreas

**alpha** · 08-04-2005, 10:36 AM

Yes. :sleepy:

**TeeJay** · 08-04-2005, 10:46 AM

Originally posted by and_ped10
The server seems to be working OK! (05-08-04)

Are you still having trouble achieving new computational jobs?

Kindly Andreas

YES ! Lots-o-ZZZ's still...

**IronBits** · 08-04-2005, 11:28 AM

YES! :sleepy:

**Thor** · 08-04-2005, 11:30 AM

Now I get many ZZZ's too

Seems like we are hitting that poor thing too hard

Thor

**PY 222** · 08-04-2005, 11:31 AM

On my home box I am seeing lots of ZZZs for about 3 minutes then there will be work, but once the work is done, it goes back to ZZZ.

Are we running low on work?

**and_ped10** · 08-05-2005, 04:26 AM

Hi guys,

We think the many zzz is due to congestion on the server. At the moment the server is only using like 2-5% of the CPU resources and results is comming in a steady flow, so the server seems to perform as intended.

An issue is that there is some troublesome states during the simulation. When the system has to leave one of these troublesome states the number of force calls (FC value) decrease to like 30 or so, the normal value is 300+. The force call value is an excelent indicator on how much work the clients has to do before reconnecting to the server.

As one can see from these values (300/30), the amount of communication goes up with a factor of at least 10 when a troublesome state is reached. We think this amount of communication is simply too much (maybe there is not enough bandwidth!) and a congestion on the server occurs.

Another issue is that the new client software is more efficient (less force calls are needed) and combined with the stuff described above this only increase the problem.

We are doing our very best to track down the bottle neck and get it fixed.

Kindly
Andreas

**AMDave** · 08-05-2005, 06:22 AM

Thanks Andreas.

So, it seems the "server" is I/O bound.

I recall from previous posts that the hard drives are replaced and so is the RAM and also that the server is on the Gigabit backbone of the campus. As these areas have each been addressed recently I rule them out. TCP/IP tests have shown that sometimes there are times of high I/O between the campus LAN and the Internet provider, but this is infrequent and not likely to be the source of the problem.

I still don't think it is the server itself but rather the WU handler.

If the CPU usage is so low then the WU handler is either waiting for something like disk I/O or has insufficient available "handles/threads" to serve the volume of the I/O requests.

So I have to return to 2 older questions...

Q - Has the WU handler been parallelised yet?

I have a reservation about this. Increasing the number of I/O channels may increase availability of the handler itself, but if there is a problem between the handler and the database it will only make a bottleneck worse to the point where no code optimisation is going to help. It will increase the maximum number of transactions waiting to complete but will not help to complete them faster.

thus the more important question is...

Q - When a WU is returned and the handler writes to the database, does it wait for the transaction to complete or does the database return a "queued reference" (releasing the database call) so that the handler can continue dealing with the I/O requests and handing out WUs whilst the database finished the write transaction (and Index updates etc...) by itself without holding up the handler?

Returning a queued transaction reference can allow the handler to move onto the next I/O request sooner whilst the database completes the read (or write) transaction in its own time.

"Perfect" data organisation, indexing and coding does not always lend itself well to high I/O volume databases. Tweaking the design with a high volume of fast reads and writes in mind can result in quite a different design. Design optimisations can prevent the read transactions from slowing down the write transactions in the database, and queuing the write transactions (usually the slowest) can really tune-out a database I/O bottleneck.

Just a couple of thoughts from past system observations. Hopefully I am waaaay off base and it is something really simple (like when I ran a LAN cable round the back of an audio speaker and had to diagnose my intermittent transmission problem LOL)

Keep at it Andreas.
We know you will track it down.

**PY 222** · 08-05-2005, 08:01 PM

How can it be server congestion when we are barely pushing 2500 points ever 30 minutes based on Bok's stats?

Am I missing something here?

**AMDave** · 08-05-2005, 10:24 PM

Mmm.
Down to 842 in half an hour for the whole project.
That's looking terminal.

Kick the server please.

**PY 222** · 08-06-2005, 12:14 AM

Now the stats are borked.

Damn.. on a Friday night! NOOoOOOoooooOO... I need my stats fix!

**TeeJay** · 08-07-2005, 12:27 AM

Lots of ZZZ's over here...
Time to kick the server again ?

>>TeeJay

**IronBits** · 08-07-2005, 01:27 AM

I wonder if the planned outage has already begun?

http://www.free-dc.org/forum/showthr...&threadid=9828

**AMDave** · 08-07-2005, 01:58 AM

The server is still up.
The website is still running.
The database is still up (stats are still live)

The WU handler has gone into a spin......again !

I hope the guys can kick the server remotely, although doing so during a scheduled power outage would no be recommended.

[edit]
Hey I just got a WU !
Wow. It is still going.
But still 90% ZZZ time.
Time to fire up the back up project.
[/edit]

**IronBits** · 08-07-2005, 03:39 AM

You're right! We be workin again

**Fozzie** · 08-07-2005, 04:51 AM

have switched to LHC.

I'll check back on Eon tomorrow.

**IronBits** · 08-07-2005, 05:20 AM

:sleepy: again, oh well...

**AMDave** · 08-07-2005, 08:43 AM

RIGHT !!!
Time to get the hands dirty.
NB - I am pleased to say that my Post about Inbound database communication bottlenecks has nothing to do with this scenario. FIDA writes to a file. What the project does with the file is out of FIDA's scope.

I have had a look through COSM and FIDA.
I have ruled out the keyUtil, dlfcn, prolog, secMan and appMan modules.

Most of the time the client is able to get a job on the first attempt. Sometimes this is not true but it soon resolves. Thus I am concentrating on the returning of a result...

The client tries to return a result...
"[FIDA] Connecting to server"
This is then tested and if the connection is not made we would see the following message "[FIDA] Failed - Connect"

So. At least we can establish that we are ALL successfully connecting to the server.

The next step is "SendIdaMessage" which the client uses to send a message to the server. If this failed we would see
"[FIDA] Failed - SendIdaMessage" and the connection is closed and and failed status is returned. we would also then see the rest of the messages we get. So we can establish that the client is sending the result-message without failure...

So then a failed status appears and the client evaluates this to...
"[FIDA] Failed - RecvIdaMessage"

The subsequent message "[FIDA] Failed - Report result" is a direct result of the RecvIdaMessage failure and comes as the next sequential step in the program.

Taking a closer look at the code, the message "[FIDA] Failed - RecvIdaMessage" is evalulated as "!= V3_PASS" in the "client.c" module in a test of the "status" variable. There is nothing out of the ordinary here. It is simply doing as it is told.

The V3_FAIL (ie != V3_PASS) is returned by the "RecvIdaMessage" function in the "common.c" module (server-side).

The "RecvIdaMessage" function in "common.c" operates on the message itself. The first thing it does is receive it using the COSM network function v3NetRecv. Reading the FIDA template header for the RecvdIdaMessage gave me some annoyance....the remarks in the header states "The message body can be arbitrarily long" and just a couple of lines later employs a 5000 millisecond time lapse in the "hope" that the message will have finshed arriving in that time. BUT...and this is a very big BUT the COSM reference for v3NetRecv states "A suggested time for wait_ms is ( length + 5000 ), giving 5 seconds plus 1sec/KB of data allowing for enough data transfer time over modems and other net glitches."

Hopefully the eOn server and client have been compiled with a constant greater than the default of 5000 milliseconds to allow for the varying length of messages that may be returned by the client as well as the day-to-day grind that continues to gradually slow down the internet.

I believe that this may be where the wheels may fall off as the result frame-size changes, it can exceed the allowed time limit.

** I have timed the Connect-to-Fail message cycle to about 20 seconds on the round trip for eOn at the moment **

v3NetRecv returns V3_PASS on success, or an error code on failure. It appears that it may be the first point at which the Fail code is returned.
The next set of tests on the header and body of the message generally all work fine & there are no hard-limits etc.. & no reason to suspect them.

If we assume that this call succeeds on the server, even then the trouble is not over, because the same code is used in the client for a message (WU) to be received by the client.

There is a very useful debug statement which is remarked-out in the FIDA template:
/* v3PrintA((ascii*)"number of bytes received = %u\n",bytes_received); */

Personally I think this should be enabled and disabled by passing a flag so that the message size & duration can be tested during run-time by the project operator without having to compile a separate executable.

Anyway that's my take on it.
I'm happy to hear from anyone else who'd like to trawl through FIDA & COSM code.
I wish I had the time to run tests using the eOn app, but I don't.

I hope that's the source of the problem, because at least that is easily fixed.
Best of luck.

[EDIT] Come to think of it, the time duration can be substituted with a variable and passed via a configuration setting contained within the outgoing messages, so you could actually change that length on-the-fly[/EDIT]

Thread: ZZZ again

Thread Tools

Rate This Thread

Display

ZZZ again

So that's where my memory went

ZZZ

Too many ZZZs here now

Posting Permissions