PDA

View Full Version : ZZZ again



PY 222
07-20-2005, 08:06 PM
Ok, so what is up with the server now?

I am getting ZZZ on my clients.

Maybe it needs a kick. :spank:

vaughan
07-20-2005, 08:32 PM
Urghh! Lots of those annoying "ZZZ - Sleep for 60 seconds" messages today :swear:

And my backup project - BOINC Sztaki will only d/l 1 task at a time and won't auto fetch new work so my CPU has been idling way too much today.

Methinks I need a 2nd backup project ...

black_civic55
07-20-2005, 08:36 PM
dumb question but i also havent had any time to figure it out for myself. Can i run eOn as my main project and have SOB on for when eOn fails or will SOB take over completely?

PY 222
07-20-2005, 08:51 PM
Originally posted by vaughan
Methinks I need a 2nd backup project ...

I've just moved my eOn boxes back to Find-A-Drug.

Maybe you can check that project out too.

PY 222
07-20-2005, 08:52 PM
Originally posted by black_civic55
dumb question but i also havent had any time to figure it out for myself. Can i run eOn as my main project and have SOB on for when eOn fails or will SOB take over completely?

That is completely possible. I am guessing that you would want to run both projects at the same time but have one project on a higher priority level than the other.

Is that correct?

black_civic55
07-20-2005, 10:17 PM
correct, pretty much SOB chillin there untill eOn goes down.

vaughan
07-21-2005, 12:39 AM
I have SOB running at Low Priority (not Idle) in the SOB Configuration and eOn running stock ie. as it installed, on a 3.2GHz P4 Prescott with hyperthreading. Windows XP Pro SP2 Task Manager shows 50% CPU usage for each project.

When eOn gets an attack of the ZZZs SOB is still running but it is only reported as using 50% CPU by Task Manager. At least the box isn't sitting idle.

and_ped10
07-21-2005, 07:21 AM
Hi,

Sorry about the server down time.

The reason for the down time is that we are using a small unix script to nurse the server, a part of the nursing is to restart the server automaticly if it stops. When the server restart information from the old server (the one that stopped) is copied. There have been a bug related to this procedure but it should be fixed now.

Kindly Andreas

PY 222
07-26-2005, 12:28 PM
I am seeing ZZZ again and its been a while now.

Any idea when the server will be up?

Bok
07-26-2005, 12:38 PM
pop a few P4's on SoB PY ;) Yeah I'm impatient to get the overtake done.... :)

Bok

magnav0x
07-26-2005, 01:24 PM
Originally posted by Bok
pop a few P4's on SoB PY ;) Yeah I'm impatient to get the overtake done.... :)

Bok

According to USD's frontpage, they should be switching efforts over to E@H tomorrow:

07-13-05 to 07-26-05 -> SoB
07-27-05 to 08-09-05 -> Einstein@Home

PY 222
07-26-2005, 02:34 PM
Originally posted by Bok
pop a few P4's on SoB PY ;) Yeah I'm impatient to get the overtake done.... :)

Bok

LOL. You guys don't need me there.

Be patient Bok. Eventually you will overtake them. :jester:

PY 222
07-26-2005, 03:57 PM
Still no word from the eOn people?

vaughan
07-27-2005, 03:11 AM
Knock knock Graeme - time to reboot the eOn server. :Pokes:


:sleepy:

:sleepy:

:sleepy:

and_ped10
07-27-2005, 03:58 AM
Hi everrybody,

We are allways trying to answer all questions asap!

I would like to know if your clients outout still is zzz. The server is currently appearing to run steadly, we recieve data from clients and this is stored.

An issue could be that it is only every 5 minute it is checker if the server is running, so once in a while the server will be down but for 5 minutes at most. If you have checked the clients output in one of these bad blocks this could explain the zzz.

Hope the answer is satisfying.

Kindly Andreas

black_civic55
07-27-2005, 10:54 AM
it was out for a lot longer then 5 mins but it appears to be running just fine now

AMDave
07-29-2005, 09:03 AM
No zzz at the moment

I have noticed the client gulping huge amounts of RAM at times.
I was getting lot of "result of multiplier out of range" or some such.
eventually the client stopped altogether.

I trapped this from the log:


[FIDA] ZZZ - Sleep for 60 seconds
[FIDA] Reporting Results
[FIDA] Connecting to Server
[FIDA] Work Unit Completed
[FIDA] Requesting Assignment
[FIDA] Connecting to Server
[FIDA] Shutting Down Client
Detected memory leaks!
Dumping objects ->
{6085632} normal block at 0x0153ABD8, 178480 bytes long.
Data: < > 0B 06 00 00 0C 06 00 00 0D 06 00 00 0E 06 00 00
{5970611} normal block at 0x0145EA80, 37128 bytes long.
Data: < ` f i > >> 00 60 08 66 1C 69 A4 3E 00 D0 C6 98 91 C8 93 3E
{5970610} normal block at 0x0132FB18, 6188 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{5970609} normal block at 0x015258E0, 12376 bytes long.
Data: < | B ? " B ?> 94 7C CD E8 42 A4 F0 3F 8B C5 22 E2 42 A4 F0 3F
{5970608} normal block at 0x013631B8, 12376 bytes long.
Data: < L ? T7L ?> D7 96 FF 11 4C 8E FD 3F BF 83 54 37 4C 8E FD 3F
{5970607} normal block at 0x014189C8, 12376 bytes long.
Data: < m i, ?p y i, ?> 80 6D 9E 1D 69 2C C6 3F 70 CA 79 95 69 2C C6 3F
{5970606} normal block at 0x01225288, 12376 bytes long.
Data: < (_j ? / ?> C0 28 5F 6A 01 84 C0 3F 80 2F F5 B7 00 84 C0 3F
{5970605} normal block at 0x010C9028, 12376 bytes long.
Data: < : ag > 00 86 AA 8C B3 15 F7 BF 3A DB 61 67 B3 15 F7 BF
{5970604} normal block at 0x01191028, 12376 bytes long.
Data: < ao @ J6 n @> E2 05 AC 61 6F F7 20 40 98 4A 36 B2 6E F7 20 40
{5970603} normal block at 0x011F7AE8, 12376 bytes long.
Data: < 6 @ @> EF D2 1B 36 96 EC 20 40 E3 91 96 80 95 EC 20 40
{5970601} normal block at 0x0116D828, 16 bytes long.
Data: < ( @ @ > CC CD CD CD 28 10 19 01 80 40 19 01 80 40 19 01
{5970599} normal block at 0x0116D760, 16 bytes long.
Data: < z @ @ > CC CD CD CD E8 7A 1F 01 40 AB 1F 01 40 AB 1F 01
{5970598} normal block at 0x01421028, 37128 bytes long.
Data: < @ ? @ ?> 01 00 00 40 0C 02 F2 3F 01 00 00 40 0C 02 F2 3F
{5970597} normal block at 0x014DAFB0, 74256 bytes long.
Data: < @ ? @ ?> 00 00 00 40 0C 02 F2 3F 00 00 00 40 0C 02 F2 3F
{5970510} normal block at 0x014D9F78, 4096 bytes long.
Data: < > A4 04 00 00 09 00 00 00 A5 04 00 00 09 00 00 00
{5969577} normal block at 0x010E7570, 6192 bytes long.
Data: < ! A l > 00 00 00 00 21 00 00 00 41 00 00 00 6C 00 00 00
{5969576} normal block at 0x01163158, 96 bytes long.
Data: <p c > 70 06 63 00 0B 06 00 00 0B 06 00 00 CD CD CD CD
{5969575} normal block at 0x01162F30, 504 bytes long.
Data: < > 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{5969574} normal block at 0x011A30B8, 8 bytes long.
Data: < ?> 00 00 00 00 00 00 F0 3F
{5969573} normal block at 0x011A3080, 12 bytes long.
Data: < 0 > 01 00 00 00 01 00 00 00 B8 30 1A 01
{5969572} normal block at 0x006307C8, 4 bytes long.
Data: < c > E8 06 63 00
{5969571} normal block at 0x00630790, 4 bytes long.
Data: < c > E8 06 63 00
{5969570} normal block at 0x006306E8, 120 bytes long.
Data: < n lxz ,C ?> 98 6E 12 83 C0 CA F7 BF 6C 78 7A A5 2C 43 FC 3F
{5969569} normal block at 0x011A3028, 33 bytes long.
Data: < Mg > 00 4D 67 00 CD CD CD CD CD CD CD CD CD CD CD CD
{5969565} normal block at 0x01188FA8, 56 bytes long.
Data: < c c > D0 F9 0C 10 CC CD CD CD 90 07 63 00 94 07 63 00
{5969564} normal block at 0x00630878, 376 bytes long.
Data: <0 p c > 30 F5 0C 10 70 06 63 00 00 00 00 00 0B 06 00 00
{5969562} normal block at 0x0119D2E0, 6188 bytes long.
Data: < > 0C 00 00 00 0C 00 00 00 0C 00 00 00 0C 00 00 00
{5969561} normal block at 0x0134C6E8, 37128 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{5969559} normal block at 0x00630670, 72 bytes long.
Data: <8 x c 0 > 38 F9 0C 10 78 08 63 00 30 8C 18 01 CC CD CD CD
{5969558} normal block at 0x01188C30, 832 bytes long.
Data: < ` F@> 01 01 01 CD CD CD CD CD 00 00 00 60 8F 82 46 40
{21122} normal block at 0x01256060, 5 bytes long.
Data: <true > 74 72 75 65 00
{21121} normal block at 0x01256028, 6 bytes long.
Data: <false > 66 61 6C 73 65 00
{21120} normal block at 0x0116D8A0, 1 bytes long.
Data: < > 00
{21115} normal block at 0x0116D7A8, 24 bytes long.
Data: < . > AC 01 0D 10 01 00 00 00 A0 D8 16 01 2E 00 CD CD
{21109} normal block at 0x0116D728, 8 bytes long.
Data: < > 8C 01 0D 10 01 00 00 00
{35} normal block at 0x00631E28, 33 bytes long.
Data: < C > 00 43 00 CD CD CD CD CD CD CD CD CD CD CD CD CD
{34} normal block at 0x00631DD0, 40 bytes long.
Data: < > EC 00 0D 10 16 00 00 00 00 00 00 00 00 00 00 00
Object dump complete.
0 bytes in 0 Free Blocks.
475609 bytes in 35 Normal Blocks.
16 bytes in 0 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 36846381 bytes.
Total allocations: -556487932 bytes.
Dimer UndoAssignment
Dimer Uninitialize
Dimer UndoAssignment
~DimerClass
WARNING: Memory Leaks!!!

beginning to end of dll

after dimer creation to before destruction

hope that makes sense to someone.

On that occasion it was not using a massive amount of RAM, at other times it has operated between 50Mb and 100Mb.

I shut it down several days ago and ran some other projects which were all stable. but I am running it again just now and it is using only 8.6 Mb

I will keep an eye on it again over the weekend to see what happens.

AMDave
07-29-2005, 09:35 AM
...didn't take long

The test that was running seemed to be taking far too long

I checked the memory useage and it had just climbed past 65Mb and was still going ! I have stopped the client again.

Is this normal bahaviour for the current tests ?

vaughan
07-30-2005, 04:41 AM
The current tests are taking much longer to run and the HSize values are all over 2000.

AMDave, I've seen some of those Bad Prefs messages too but haven't checked the memory usage. Will keep an eye on it tho'.

and_ped10
07-30-2005, 08:48 AM
Memory use,

We are aware of the problem with the memory leak. The reason why it have not been fixed is that it is extremly hard to debug and if the client software is restarted once in a while it is not a severe problem!

When a memory leak occurs the client application take up more memory than nessesary and as time goes on the the running client application ending up taking up far to much memory. When this is the case, the client should be restarted. The memory leak is only a problem if the client application has been running for like more than a week or so.

We are of cause doing our very best to fix the bug, but we prefer to use the best present software and get results rather than waiting for a perfect version that maybe never will be achieved.

Kindly Andreas

AMDave
07-31-2005, 04:21 AM
Thanks Adreas.
I'll keep a look out for the new client.

PY 222
08-04-2005, 12:00 AM
I think the server needs another kick.

I am seeing ZZZs again. :crazy:

and_ped10
08-04-2005, 09:24 AM
The server seems to be working OK! (05-08-04)

Are you still having trouble achieving new computational jobs?

Kindly Andreas

alpha
08-04-2005, 10:36 AM
Yes. :sleepy:

TeeJay
08-04-2005, 10:46 AM
Originally posted by and_ped10
The server seems to be working OK! (05-08-04)

Are you still having trouble achieving new computational jobs?

Kindly Andreas

YES ! Lots-o-ZZZ's still...

IronBits
08-04-2005, 11:28 AM
YES! :sleepy:

Thor
08-04-2005, 11:30 AM
Now I get many ZZZ's too

Seems like we are hitting that poor thing too hard:smoking:


Thor

PY 222
08-04-2005, 11:31 AM
On my home box I am seeing lots of ZZZs for about 3 minutes then there will be work, but once the work is done, it goes back to ZZZ.

Are we running low on work?

and_ped10
08-05-2005, 04:26 AM
Hi guys,

We think the many zzz is due to congestion on the server. At the moment the server is only using like 2-5% of the CPU resources and results is comming in a steady flow, so the server seems to perform as intended.

An issue is that there is some troublesome states during the simulation. When the system has to leave one of these troublesome states the number of force calls (FC value) decrease to like 30 or so, the normal value is 300+. The force call value is an excelent indicator on how much work the clients has to do before reconnecting to the server.

As one can see from these values (300/30), the amount of communication goes up with a factor of at least 10 when a troublesome state is reached. We think this amount of communication is simply too much (maybe there is not enough bandwidth!) and a congestion on the server occurs.

Another issue is that the new client software is more efficient (less force calls are needed) and combined with the stuff described above this only increase the problem.

We are doing our very best to track down the bottle neck and get it fixed.

Kindly
Andreas

AMDave
08-05-2005, 06:22 AM
Thanks Andreas.

So, it seems the "server" is I/O bound.

I recall from previous posts that the hard drives are replaced and so is the RAM and also that the server is on the Gigabit backbone of the campus. As these areas have each been addressed recently I rule them out. TCP/IP tests have shown that sometimes there are times of high I/O between the campus LAN and the Internet provider, but this is infrequent and not likely to be the source of the problem.

I still don't think it is the server itself but rather the WU handler.

If the CPU usage is so low then the WU handler is either waiting for something like disk I/O or has insufficient available "handles/threads" to serve the volume of the I/O requests.

So I have to return to 2 older questions...

Q - Has the WU handler been parallelised yet?

I have a reservation about this. Increasing the number of I/O channels may increase availability of the handler itself, but if there is a problem between the handler and the database it will only make a bottleneck worse to the point where no code optimisation is going to help. It will increase the maximum number of transactions waiting to complete but will not help to complete them faster.

thus the more important question is...

Q - When a WU is returned and the handler writes to the database, does it wait for the transaction to complete or does the database return a "queued reference" (releasing the database call) so that the handler can continue dealing with the I/O requests and handing out WUs whilst the database finished the write transaction (and Index updates etc...) by itself without holding up the handler?

Returning a queued transaction reference can allow the handler to move onto the next I/O request sooner whilst the database completes the read (or write) transaction in its own time.

"Perfect" data organisation, indexing and coding does not always lend itself well to high I/O volume databases. Tweaking the design with a high volume of fast reads and writes in mind can result in quite a different design. Design optimisations can prevent the read transactions from slowing down the write transactions in the database, and queuing the write transactions (usually the slowest) can really tune-out a database I/O bottleneck.

Just a couple of thoughts from past system observations. Hopefully I am waaaay off base and it is something really simple (like when I ran a LAN cable round the back of an audio speaker and had to diagnose my intermittent transmission problem LOL)

Keep at it Andreas.
We know you will track it down. :thumbs:

PY 222
08-05-2005, 08:01 PM
How can it be server congestion when we are barely pushing 2500 points ever 30 minutes based on Bok's stats?

Am I missing something here?

AMDave
08-05-2005, 10:24 PM
Mmm.
Down to 842 in half an hour for the whole project.
That's looking terminal.

Kick the server please.

PY 222
08-06-2005, 12:14 AM
Now the stats are borked.

Damn.. on a Friday night! NOOoOOOoooooOO... I need my stats fix! :swear:

TeeJay
08-07-2005, 12:27 AM
Lots of ZZZ's over here...
Time to kick the server again ?

>>TeeJay

IronBits
08-07-2005, 01:27 AM
I wonder if the planned outage has already begun? :bang:
http://www.free-dc.org/forum/showthread.php?s=&threadid=9828

AMDave
08-07-2005, 01:58 AM
The server is still up.
The website is still running.
The database is still up (stats are still live)

The WU handler has gone into a spin......again !

I hope the guys can kick the server remotely, although doing so during a scheduled power outage would no be recommended.


Hey I just got a WU !
Wow. It is still going.
But still 90% ZZZ time.
Time to fire up the back up project.

IronBits
08-07-2005, 03:39 AM
You're right! We be workin again :D

Fozzie
08-07-2005, 04:51 AM
have switched to LHC.

I'll check back on Eon tomorrow.

IronBits
08-07-2005, 05:20 AM
:sleepy: again, oh well...
:mouserun:

AMDave
08-07-2005, 08:43 AM
RIGHT !!!
Time to get the hands dirty.
NB - I am pleased to say that my Post about Inbound database communication bottlenecks has nothing to do with this scenario. FIDA writes to a file. What the project does with the file is out of FIDA's scope.

I have had a look through COSM and FIDA.
I have ruled out the keyUtil, dlfcn, prolog, secMan and appMan modules.

Most of the time the client is able to get a job on the first attempt. Sometimes this is not true but it soon resolves. Thus I am concentrating on the returning of a result...

The client tries to return a result...
"[FIDA] Connecting to server"
This is then tested and if the connection is not made we would see the following message "[FIDA] Failed - Connect"

So. At least we can establish that we are ALL successfully connecting to the server.

The next step is "SendIdaMessage" which the client uses to send a message to the server. If this failed we would see
"[FIDA] Failed - SendIdaMessage" and the connection is closed and and failed status is returned. we would also then see the rest of the messages we get. So we can establish that the client is sending the result-message without failure...

So then a failed status appears and the client evaluates this to...
"[FIDA] Failed - RecvIdaMessage"

The subsequent message "[FIDA] Failed - Report result" is a direct result of the RecvIdaMessage failure and comes as the next sequential step in the program.

Taking a closer look at the code, the message "[FIDA] Failed - RecvIdaMessage" is evalulated as "!= V3_PASS" in the "client.c" module in a test of the "status" variable. There is nothing out of the ordinary here. It is simply doing as it is told.

The V3_FAIL (ie != V3_PASS) is returned by the "RecvIdaMessage" function in the "common.c" module (server-side).

The "RecvIdaMessage" function in "common.c" operates on the message itself. The first thing it does is receive it using the COSM network function v3NetRecv. Reading the FIDA template header for the RecvdIdaMessage gave me some annoyance....the remarks in the header states "The message body can be arbitrarily long" and just a couple of lines later employs a 5000 millisecond time lapse in the "hope" that the message will have finshed arriving in that time. BUT...and this is a very big BUT the COSM reference for v3NetRecv states "A suggested time for wait_ms is ( length + 5000 ), giving 5 seconds plus 1sec/KB of data allowing for enough data transfer time over modems and other net glitches."

Hopefully the eOn server and client have been compiled with a constant greater than the default of 5000 milliseconds to allow for the varying length of messages that may be returned by the client as well as the day-to-day grind that continues to gradually slow down the internet.

I believe that this may be where the wheels may fall off as the result frame-size changes, it can exceed the allowed time limit.

** I have timed the Connect-to-Fail message cycle to about 20 seconds on the round trip for eOn at the moment **

v3NetRecv returns V3_PASS on success, or an error code on failure. It appears that it may be the first point at which the Fail code is returned.
The next set of tests on the header and body of the message generally all work fine & there are no hard-limits etc.. & no reason to suspect them.

If we assume that this call succeeds on the server, even then the trouble is not over, because the same code is used in the client for a message (WU) to be received by the client.

There is a very useful debug statement which is remarked-out in the FIDA template:
/* v3PrintA((ascii*)"number of bytes received = %u\n",bytes_received); */

Personally I think this should be enabled and disabled by passing a flag so that the message size & duration can be tested during run-time by the project operator without having to compile a separate executable.

Anyway that's my take on it.
I'm happy to hear from anyone else who'd like to trawl through FIDA & COSM code.
I wish I had the time to run tests using the eOn app, but I don't.

I hope that's the source of the problem, because at least that is easily fixed.
Best of luck.

Come to think of it, the time duration can be substituted with a variable and passed via a configuration setting contained within the outgoing messages, so you could actually change that length on-the-fly

AMDave
08-07-2005, 09:40 AM
Now the messages are "[FIDA] Failed - Connect"

The server is down.

A quick check on the website verifies this.


Can someone please restart the server ?

IronBits
08-07-2005, 12:46 PM
Nice work!
The server is in a planned power outage today :(

AMDave
08-08-2005, 05:58 AM
Eek.
The SQL Server is down.:eek:

graeme
08-08-2005, 05:14 PM
The server was down for a few hours on Sun, and mysql froze last night, and is now restarted.

This is some excellent advice from AMDave. We are currently using the suggested time limit of (bytes_to_receive + 5000) ms. I'll increase this to see how it changes things. We really appreciate that you dug into the code, and suggested this change.

The server will also be done for about an hour on Wed. Our bad memory has been replaces, so we'll add that to what we have, and update the operating system.

TeeJay
08-09-2005, 08:44 AM
Originally posted by graeme
The server was down for a few hours on Sun, and mysql froze last night, and is now restarted.

This is some excellent advice from AMDave. We are currently using the suggested time limit of (bytes_to_receive + 5000) ms. I'll increase this to see how it changes things. We really appreciate that you dug into the code, and suggested this change.

The server will also be done for about an hour on Wed. Our bad memory has been replaces, so we'll add that to what we have, and update the operating system.

Graeme, any idea why we are still getting lots-o-ZZZ's ?
Seeing ZZZ's since Saturday night...

Thanks,
>>TeeJay

AMDave
08-09-2005, 12:12 PM
TeeJay

I think we are getting close to the source of the ZZZ.
The project team are studing the data and may have to recompile and test and deploy their software. It is more than just a couple of days job.

Keep crunching mate. :thumbs:



graeme,

I have not been able to trap the average bytes_to_receive size, so I cannot calculate the increment accurately (milliseconds), but as you say a small increase of about 2000 to 3000 millseconds should clear most of the problems and prevent the server from getting hung up on the handshaking.

Obviously the increment should be the minimum possible to maximise the return rate (this sounds like a classic simplex goal programming problem for the students :) ) You will have a better idea on the server end of what the actual transmission length is.

It looks like the "common.c" portion of FIDA code is used in the build of both the client and the server apps, so you can't set a different time limit for inbound and outbound. But it does look like a new build will involve a new client version and server-side app. If you want to have a test run, I guess you could deploy on the backup server and hand out a few test clients and if all goes well then deploy in production.

There are points in the code where monitoring functions could be inserted so you could see on the server end if remote clients are having problems, but those would have to be written up.

Sounds like more beer and pizza for omer again ;)

graeme
08-09-2005, 06:03 PM
I tried increased the delay time, and it doesn't seem to help client communication. The problem seems to be associated with overloading the server. One major problem now is that Andreas is trying a new problem which contains very few atoms. This has dramatically increased the hits to the server. Also, we need to figure out a better way to initiate the searches, because too many are failing right now. Searches which fail after a short time are also contributing to server traffic.

I've switched back to our Mg system. We'll run this system until the end of the week, when Andreas will have a new (larger) Cu grain boundary system.

AMDave
08-10-2005, 05:10 AM
graeme,

per my note above, the "common.c" file (which contains the RcvdIdaMessage function) is used in both the client and server executables. (this is as defined in the Makefile in the "fida/source" folder (not the one in the "fida" folder)

To successfully deploy the time adjustment you would have to redeploy the updated client executable to all contributors in addition to replacing the server executable, because I expect that the clients are still communicating with the server on the shorter timeframe regardless of the change to the server executable.

The change back to the Mg model has made an enormous difference. I hope the results are still useful back in the lab.

However, down the road you may have another fast resolving / rejecting task which may result in the same symptom. This is a shortcoming of the FIDA code in it's current version, I think, but one that can be easily addressed if you amend the FIDA code to accept the #milliseconds as a parameter and pass it from the App. Then you would be able to remotely adjust the duration on the clients and the server as required from the comfort of your office without having to recompile and redeploy the client and server executables each time that you need to increase or decrease the duration, by including the appropriate duration value in the library file of each model.

Given that different projects are expected to have different duration requirements (as stated by the COSM references) I am surprised that the FIDA handler was not already written in this way.

I also note that there is scope for FIDA to handle multiple Apps, (ie you could have the server hande multiple models at once).

Do the Fida and Cosm code frameworks still have developer support ?

graeme
08-10-2005, 01:30 PM
Yes that's right. I increased the delay time in IdaMessageRecv to as much as 30 seconds, and rebuilt the server and a test client. So this was not a test of the entire network, but rather a test to see if one client would behave better. It did not. I was getting the same fraction of failed connection attempts -- they just happened at longer intervals.

What I see is that the tcp/ip stack on the server gets to a point where the number of incoming connections increases faster than they are being processed. It looks like it reaches this point, the stack builds up, and then they clear due to timing out. So although it's possible that having a longer wait time for all clients (and the server) might help, I have the impression that it's mainly a problem of having too many connections as compared to the rate at which the server can handle them. And, as you say, testing this change on all clients would be difficult because we have no automatic way of updating the client binary.

One of our goals in designing the fida system is that there should be a clean separation between the communication code and the computational code. Our idea was that scientists, who like to program in languages like fortran, should be able to use the system without writing any communication code into their computational program. This is one reason there is so little data passed between fida and the application being run.

That said, fida does know about the size of the data being passed to and from the application, and uses this information to set the wait time on IdaMessageRecv. Each time data is passed between client and server, a header is sent first, and this small packet contains the size of the message. Longer messages have a longer wait time, because the bytes_to_receive variable is set by the message size. We have allocated the 1ms/kB + 5s time suggested by the cosm guys. When the client app gets updated, for example, a very large message of about 1MB is sent (in chunks), and there is no problem even on slow connections. The additional 5 seconds, which dominates for small messages, is hard coded, but it's not clear that adjusting this would help.

There is no support in fida to run multiple apps at the same time.

Fida is still under development, but we have only been making small changes as needed, and seeing if we can improve efficiency. Omer was trying to make larger changes; adding an adaptive thread system for the server, support for a secondary server, and dealing with firewalls. Unfortunately he went off to do an internship for the summer, and we still need to see if eon can fit into his PhD project.

Cosm has been under development ever since I first saw it. The arrow on the page http://www.mithral.com/projects/cosm/ , indicating project status, has not budged since the year 2000. I'm not sure what's going on. The guys working on that project seem very sharp, but perhaps they have moved onto other things.

Bok
08-10-2005, 01:46 PM
Graeme,

I've been following this thread over the last few days with some interest.

I'm by no means an expert on the linux tcp/ip stack but has any tuning been done on it ?

What does sysctl -a return ?

I found a somewhat decent link here (http://www.psc.edu/networking/projects/tcptune/#Linux)

which might help (I presume it's a linux box ?)

Is the nic interface running at 100Mbps ? or 1000Mps hopefully.

Bok

graeme
08-10-2005, 02:15 PM
That sysctl command certainly returns a lot of information. I have not tweaked this at all, except on an older server to reduce the TCP timeout. This was a major problem when linux clients were not able to load the library app and pinged the server for a new one as fast as they could.

I'll read through the page you sent and see if we can improve the communication. It is a 2.4 linux kernel, on a Gbit line.

The following

/proc/sys/net/ipv4/tcp_timestamps
/proc/sys/net/ipv4/tcp_window_scaling
/proc/sys/net/ipv4/tcp_sack

are turned on, as suggested.

But the

/proc/sys/net/core/rmem_default - default receive window
/proc/sys/net/core/rmem_max - maximum receive window
/proc/sys/net/core/wmem_default - default send window
/proc/sys/net/core/wmem_max - maximum send window

/proc/sys/net/ipv4/tcp_rmem - memory reserved for TCP rcv buffers
/proc/sys/net/ipv4/tcp_wmem - memory reserved for TCP snd buffers

tend to have lower values than suggested on that page.

Thanks for this advice. I have been a little confused about why the tcp/ip stack seems to bottleneck when we are far from the bandwidth and processor limitations of the machine. I know that the tcp/ip protocol has significant overheads, and perhaps adjusting these values will reduce them.

Mustard
08-16-2005, 03:37 PM
everything I've got running is loaded out with pagefuls of 60 second timerouts...... :(

graeme
08-16-2005, 03:43 PM
The downtime that was supposed to take place last Wed. is happening now. We're increasing the ram and updating the OS. If everything goes smoothly, we'll be back in 1 hour.

and_ped10
08-16-2005, 03:46 PM
Hi Guys

The server is being upgraded at the moment (August 16). We should be running on full power (110%) tomorrow!

Kindly Andreas

graeme
08-16-2005, 04:05 PM
we're back.

Mustard
08-16-2005, 05:22 PM
Thank you for letting us know. :)

PY 222
08-17-2005, 12:18 AM
Did you guys made changes to the client or server again?

I see an increased in production points but then again I see ZZZs as well.

Looks like we are back to the same issue again.

black_civic55
08-17-2005, 12:36 AM
i havent noticed any zzz's but the output is surely way up

Mustard
08-17-2005, 02:24 AM
I'm seeing tons of zzz's a short work unit, and the server must be getting hammered. My net traffic is going nuts with all the retries...... :( Getting too much wasted time again. Think I'll move stuff to another project.

and_ped10
08-17-2005, 05:53 AM
Hi again,

The reason why you are seeing all these ZzZ is that the system has entered a "bad" state. We are doing our very best to "fix" the problem.

The case is that the system should be allowed to what ever it want to do, and with the new setup it sometimes wants to go into a state that the algorithm have yet not been optimized to handle. The optimization rely on tuning parameters and change parts in the source code.

Since we are doing the optimization we prefer to let the system stay in the state in order to gather information. With the information we will be able to make the right changes in the source code and tune the free parameters effeciently.

Hope you understand our priority eventhough you "waste" clock cycles.

Greetings Andreas

PY 222
08-17-2005, 12:33 PM
What else can we do to minimize the "waste" of CPU cycles?

Can we start up another client to soak up the reminder of the CPU cycles when the other client is attempting to connect to the server?

Fozzie
08-17-2005, 12:52 PM
4 running on a XP3200 and still haven't got constant 100%

Installed in folder eon1,2,3 amd 4.

and_ped10
08-17-2005, 03:28 PM
Hi guys,

I wrote "waste" clock cycles. The reason why I wrote "waste" is that during the ZzZ periods the client does not take any significant CPU resources.

The idear of starting more clients on the same node (processor) could reduse the problem but I do not think it will solve it completly, as Fozzie states as his experience. The reduction in "wasted" be due the fact that at least one client is doing work when the other clients are trying to contact the server.

Kindly Andreas

jasong
08-17-2005, 07:19 PM
I've got 5 clients running at once, which seems to be the optimal number.

On a side note, my seizure disorder is making displaying those 5 instances the mental equivalent of the 4th of July to a 5-year-old.

black_civic55
08-17-2005, 08:50 PM
sorry jason i dont know if i was supposed to laugh at that or not. if your taking your disorder lightly then yea that was pretty funny.

jasong
08-17-2005, 09:09 PM
Originally posted by black_civic55
sorry jason i dont know if i was supposed to laugh at that or not. if your taking your disorder lightly then yea that was pretty funny.
Yeah, I was kidding. With all my mental problems, I find it helpful to poke fun at myself. Makes it easier for all involved. I've seen other people in similar situations deny that they're unusual, which scares 90% of the population.

black_civic55
08-17-2005, 09:16 PM
its good that your that comfortable. so now.....:rotfl: at your 4th of july comment

PY 222
08-21-2005, 03:04 PM
graeme, anything up with the server today?

Seeing lots of ZZZ on a few of my boxes.

graeme
08-21-2005, 03:48 PM
Andreas fixed a big problem in the windows client code last week which has helped a lot. But still, his new system is quite small and quite a few of the work unit searches are failing. The combination of these factors means that each work unit is completing very quickly and we are still pushing the limits of the server. He will be working on this next week. Assuming he can tune the parameters of the server to get a higher fraction of successful searches, each will take longer as they complete, and the performance should be better. Also, his results are looking quite good, getting towards millisecond timescales, and he could afford to increase the system size.

PY 222
08-21-2005, 05:53 PM
I would be more than happy to let you guys run whatever you all need on one of my servers.

I have dual Xeons 2.8GHz HT and dual Opteron 242s, both with 2 GB RAM and I can put in whatever flavour of Linux you wish and give you complete root access to them.

Let me know if you are interested.

PCZ
08-21-2005, 06:03 PM
PY222

Can you give me root access to a hundred or so ?
Just short term a year or two, three at most :D

graeme
08-21-2005, 06:11 PM
Hey, thanks for the offer -- that's very generous. What we really need, though, is someone who knows something about client/server code. I don't understand why our server runs into trouble when it does. When it starts given communication errors, it is well within it's physical memory (4GB), using about 1/100th of the theoretical bandwidth limit (10mbits/s of a 1000 mbits/s, gigabit line), and almost none of the cpu (dual opteron 246). I think there is something in our server code which is hitting a limit -- perhaps something to do with timing, threads, or the tcp/ip stack. I'm pretty convinced that with better software, our hardware would be fine. We are truly a bunch of network neophytes.

PY 222
08-21-2005, 06:27 PM
Originally posted by PCZ
PY222

Can you give me root access to a hundred or so ?
Just short term a year or two, three at most :D

:crazy: NEVER! :jester:



graeme, lets hope you guys will solve issue soon. :thumbs: If you need hardware, you know where to find me.

PCZ
08-21-2005, 07:19 PM
Just did a pathping to your server to check out the network links and the results were excellent.

zero packet loss and a solid 127ms from the UK.

Tracing route to eon.cm.utexas.edu [146.6.143.207]
over a maximum of 30 hops:
0 xp-64 [172.31.158.93]
1 172.31.158.253
2 cr0.lscher.uk.easynet.net [82.108.10.150]
3 82.111.102.241
4 ge1-3-0-41.br1.enwkg.uk.easynet.net [82.110.108.161]
5 ge3-1-0-0.br0.bllon.uk.easynet.net [195.172.211.74]
6 ge0-0-0-0.br0.wslon.uk.easynet.net [195.172.211.210]
7 ge0-3-0-0.br1.wslon.uk.easynet.net [212.135.125.18]
8 ge0-0-0-0.br1.thlon.uk.easynet.net [195.172.211.214]
9 217.204.60.94
10 ge-5-0-2.402.ar2.lon3.gblx.net [67.17.212.93]
11 so0-0-0-2488m.ar3.jfk1.gblx.net [67.17.72.30]
12 qwest.ar3.jfk1.gblx.net [208.50.13.170]
13 jfk-core-02.inet.qwest.net [205.171.30.17]
14 iah-core-03.inet.qwest.net [205.171.31.6]
15 iah-edge-08.inet.qwest.net [205.171.31.86]
16 65.112.240.186
17 ser2-v60.gw.utexas.edu [192.12.10.2]
18 ser9-v703.gw.utexas.edu [128.83.9.1]
19 wel-v755.gw.utexas.edu [128.83.9.114]
20 eon.cm.utexas.edu [146.6.143.207]

Computing statistics for 500 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0 xp-64 [172.31.158.93]
0/ 100 = 0% |
1 1ms 0/ 100 = 0% 0/ 100 = 0% 172.31.158.253
0/ 100 = 0% |
2 14ms 0/ 100 = 0% 0/ 100 = 0% cr0.lscher.uk.easynet.net [82.108.
10.150]
0/ 100 = 0% |
3 9ms 0/ 100 = 0% 0/ 100 = 0% 82.111.102.241
0/ 100 = 0% |
4 10ms 0/ 100 = 0% 0/ 100 = 0% ge1-3-0-41.br1.enwkg.uk.easynet.ne
t [82.110.108.161]
0/ 100 = 0% |
5 17ms 0/ 100 = 0% 0/ 100 = 0% ge3-1-0-0.br0.bllon.uk.easynet.net
[195.172.211.74]
0/ 100 = 0% |
6 13ms 0/ 100 = 0% 0/ 100 = 0% ge0-0-0-0.br0.wslon.uk.easynet.net
[195.172.211.210]
0/ 100 = 0% |
7 12ms 0/ 100 = 0% 0/ 100 = 0% ge0-3-0-0.br1.wslon.uk.easynet.net
[212.135.125.18]
0/ 100 = 0% |
8 14ms 0/ 100 = 0% 0/ 100 = 0% ge0-0-0-0.br1.thlon.uk.easynet.net
[195.172.211.214]
0/ 100 = 0% |
9 12ms 0/ 100 = 0% 0/ 100 = 0% 217.204.60.94
0/ 100 = 0% |
10 12ms 0/ 100 = 0% 0/ 100 = 0% ge-5-0-2.402.ar2.lon3.gblx.net [67
.17.212.93]
0/ 100 = 0% |
11 79ms 0/ 100 = 0% 0/ 100 = 0% so0-0-0-2488m.ar3.jfk1.gblx.net [6
7.17.72.30]
0/ 100 = 0% |
12 78ms 0/ 100 = 0% 0/ 100 = 0% qwest.ar3.jfk1.gblx.net [208.50.13
.170]
0/ 100 = 0% |
13 80ms 0/ 100 = 0% 0/ 100 = 0% jfk-core-02.inet.qwest.net [205.17
1.30.17]
0/ 100 = 0% |
14 125ms 0/ 100 = 0% 0/ 100 = 0% iah-core-03.inet.qwest.net [205.17
1.31.6]
0/ 100 = 0% |
15 123ms 0/ 100 = 0% 0/ 100 = 0% iah-edge-08.inet.qwest.net [205.17
1.31.86]
0/ 100 = 0% |
16 127ms 0/ 100 = 0% 0/ 100 = 0% 65.112.240.186
0/ 100 = 0% |
17 132ms 0/ 100 = 0% 0/ 100 = 0% ser2-v60.gw.utexas.edu [192.12.10.
2]
0/ 100 = 0% |
18 127ms 0/ 100 = 0% 0/ 100 = 0% ser9-v703.gw.utexas.edu [128.83.9.
1]
0/ 100 = 0% |
19 127ms 0/ 100 = 0% 0/ 100 = 0% wel-v755.gw.utexas.edu [128.83.9.1
14]
0/ 100 = 0% |
20 127ms 0/ 100 = 0% 0/ 100 = 0% eon.cm.utexas.edu [146.6.143.207]


Trace complete.

Bok
08-21-2005, 07:30 PM
Originally posted by graeme
Hey, thanks for the offer -- that's very generous. What we really need, though, is someone who knows something about client/server code. I don't understand why our server runs into trouble when it does. When it starts given communication errors, it is well within it's physical memory (4GB), using about 1/100th of the theoretical bandwidth limit (10mbits/s of a 1000 mbits/s, gigabit line), and almost none of the cpu (dual opteron 246). I think there is something in our server code which is hitting a limit -- perhaps something to do with timing, threads, or the tcp/ip stack. I'm pretty convinced that with better software, our hardware would be fine. We are truly a bunch of network neophytes.

Perhaps we could persuade Dyyryath to give some input...

Bok

Thor
08-24-2005, 10:14 AM
I start to see ZZZ again...

Might be because a large number of wu's have a hsize of 0. Therefore they are processed quite fast.

Somebody should have a look at it.


Greets Thor

PY 222
08-26-2005, 07:33 PM
More ZZZs again.

You guys might want to look into the server again.

Thor
08-29-2005, 09:30 AM
ZZZ
ZZZ
ZZZZZZ again...

someone needs to kick the server again!

ZZZ
ZZZ
ZZZ

Greets Thor

vaughan
09-08-2005, 10:46 PM
ZZZ again.

graeme
09-09-2005, 02:37 AM
Thanks for the messages. We were sleeping on the job.

Thor
09-26-2005, 07:43 AM
Here they are again!

ZZZ ZZZ ZZZ ZZZ all over the place...:(


Greets Thor

EDIT: Seems like it was only a longer hick-up altough it latet at least 30min...

vaughan
09-30-2005, 11:56 AM
ZZZ again

AMDave
10-06-2005, 09:39 AM
:eek: and again

rcoulter
10-08-2005, 08:48 PM
Time to boot the server again.

AMDave
10-08-2005, 10:45 PM
:beep: ...and awaaaay we go again

Thanks fellas.

graeme
10-09-2005, 12:16 AM
np, thanks for the notes.

rcoulter
10-11-2005, 07:48 AM
It would appear that the ZZZZ's are back.

and_ped10
10-11-2005, 07:58 AM
Should be fixed by now.

Andreas

rcoulter
10-12-2005, 07:39 PM
Back in full force. Need a restart.

Randy

and_ped10
10-13-2005, 04:59 AM
Thank you for the info,

The server have been reatarted by now

Cheers,
Andreas

rcoulter
10-23-2005, 09:09 AM
Sunday Morning and time for a reboot.

Randy

and_ped10
10-23-2005, 11:08 AM
Has been done by now ;-)

AMDave
10-30-2005, 01:34 AM
here's a twist.
No ZZZs, everything is going including the stats database.

however, the stats updates have been in "trickle" mode now for several hours for some reason.

The stats update service till seems to be running but is only showing 1 or 2 wu's here and there.

Perhaps the results cache is building and building without being processed or the results are being processed and the stats update is running in some broken form.

At least we are all still crunching.
I hope the project is getting the benefit of the results.
Don't know. No way to tell at the moment.

AMDave
10-30-2005, 04:45 AM
I've been watching the stats recover.

It seems the project is processing the current wu's, but those wu's that were missing from the stats earlier seem to have stayed missing.

Oh well. we're back on track now anyway.

and_ped10
10-30-2005, 06:23 AM
Hi,

The behavior you are seeing is due to the system being in a troublesome state.

When the system is in troublesome states allmost all calculations done by the clients are discarded by the server. The reason is that there is a set of criterions that the results, from the clients, have to fullfill, an example is that the new state should be connected to the current state.

Eventhough the calculations are discarded the work done is nessesary, as the it is impossible on forehand to tell which calculations will give good or bad results.

Because the statistics are based on good searches a drop will show up when the system is in a troublesome state.

Take care,
Andreas

AMDave
10-31-2005, 03:49 AM
Thanks Andreas.

I think I understand.

Although the clients may find "Good Pref."(s) they may not fit the current state of the model as they move towards a lesser "high" point than the optimum solution, (ie the clients climb a lower peak that is not the summit)

My understanding is that this happens from time to time with the Monte Carlo method.

Would that be a fair analogy ?

Just to clarify, during the slow period my clients were finding "Good prefs." Does this mean that they were not deemed good by the model on the server even though they were deemed good by the client?

and_ped10
10-31-2005, 10:43 AM
Originally posted by AMDave
Thanks Andreas.

I think I understand.

Although the clients may find "Good Pref."(s) they may not fit the current state of the model as they move towards a lesser "high" point than the optimum solution, (ie the clients climb a lower peak that is not the summit)

When you get the output from the client "Good Pref", the client have always found a summit. The "Good Pref" indicates that the client have calculated the eigenfrequncy of the lowest eigenmode at the lowest summit point (saddle point), and that the obtained value was reasonable.


My understanding is that this happens from time to time with the Monte Carlo method.

Would that be a fair analogy ?

The characterestic of Monte Carlo algorithms is that they rely on a random number. In the algorithm used in EON code the random number decides which state shifts that happens from a table of possible state shifts. It is the work of achieving the table of states that is done by the clients.

When the server make its pick in the table it depents on the random number and two values that are unique for all the different state shifts. The two values are:
(I) the energy barrier the system has to overcome to make the shift happen (larger value less likely)
(II) the prefactor that decibe how often the system would try to make the transition (larger value more likely). It is this value that is calculated when the client output 'Good Pref'.


Just to clarify, during the slow period my clients were finding "Good prefs." Does this mean that they were not deemed good by the model on the server even though they were deemed good by the client?
Yes. An example of this could be that the client have found a saddle point, when it is analyzed it shows up that it is not connected to the original state. The client got lost during its search so to say. Keep in mind that the space defining the searchspace is more than 3000 dimensional so the is plenty of differnt ways to go!

Cheers
Andreas

AMDave
11-01-2005, 04:51 AM
Wow.
That was a great reply.
Thank you for taking the time to respond in such detail.
This is a real insight to how the decision-model works in relation to the the work done by the client.

Thanks again Andreas. :thumbs:

AMDave
11-06-2005, 03:59 PM
back on topic...

21 minutes of ZZZ so far (at time of post)

and_ped10
11-06-2005, 06:05 PM
The server has been restarted by now. Hopefully it will stay stable;-)

Cheers andreas

rcoulter
11-23-2005, 10:10 PM
Time for a reboot.

Randy

and_ped10
11-24-2005, 04:56 AM
The server is back on track now.

Cheers
Andreas

AMDave
11-26-2005, 09:06 PM
Hi andreas.
Happy thanksgiving to you too.

I notice that the clients are crunching away but the stats are not showing any increments in the results.

:idea:

Is this a "troublesome" state in the model at the moment, or has the SQL server got a problem ?

graeme
11-26-2005, 10:06 PM
Thanks for pointing this out. The stats are running now, and should be up to date, and reflect all work done.

Gunslinger
11-27-2005, 07:16 AM
Originally posted by graeme
Thanks for pointing this out. The stats are running now, and should be up to date, and reflect all work done.
I don't think all work has been credited - I would have expected to see well over 5000 units credited on my account to catch up the last 4 days, but only a couple of thousand have turned up... :(

rcoulter
11-30-2005, 07:25 PM
The server needs restarting

Randy

and_ped10
12-01-2005, 04:44 AM
The server seems to be running steadly now.

Cheers Andreas

rcoulter
12-07-2005, 08:09 AM
Time to reboot.

Randy

and_ped10
12-07-2005, 12:08 PM
Done ;-)

Thor
12-07-2005, 05:45 PM
I still see quite some ZZZ's:Pokes:

Are the wu's so small at the moment? They just fly past!



Greets Thor

and_ped10
12-08-2005, 04:52 AM
The server collects and process results as it should at the moment

Silverthorne
12-09-2005, 11:49 PM
Anyone else seeing lots of ZZZ's?

Thor
12-10-2005, 07:09 AM
Not right now, bu the stats file for Bok's stats doesn't seem to be updating...

Maybe someone can fix this so that the stats will kick in again?

Thanks!


Thor

Edit: Noe I also start to see some ZZZ's

rcoulter
12-10-2005, 08:03 AM
Andreas

The server is getting almost continuous ZZZ's, with short work units and the stats server has been down for the last 8 or 9 hours.

Randy

AMDave
12-10-2005, 08:15 AM
yep.
pages and pages and pages of zzz
all afternoon and evening (UTC+10 :) ) in fact
occasionally interspersed with a work unit here and there

I also notice that the stats file has not refreshed for several hours now, or rather it may have been, but the stats server has not updated any results for several hours, so the stats file has not changed at all.

I also notice that the average ping times has blown out. Me thinks there may be something else afoot

None-the-less, the stats server could cop a good kick about now :help:

ps andreas / graeme - tricky question - is there anywhere on the server that a script calls itself which also has an "includes" of an environment settings file that may append a file path to the path. If it calls itself enough times the file path gets too long and things start to fall over because some things cannot be found. I have tested this reproduceably on a Solaris box. Just wondering. Still trying to dig up some reason for the server's cyclic instability that is apparent from our end.

[ed]
hmm. nothing wrong with the internet
http://www.internettrafficreport.com/main.htm
[ed/]

and_ped10
12-10-2005, 08:16 AM
Hi guys,

The simulation has reached a very troublesome state. That is the reason why you are getting small work units and see lots of ZZZs.

I have tried to tweak the simulation in order to get out of the troublesome state. Hope that it is working a little better now ;-)

Cheers Andreas

AMDave
12-10-2005, 08:44 AM
I see WUs on all clients :cool:
cheers :cheers:

still no stats updates on the eOn site tho :confused:

AMDave
12-10-2005, 09:24 AM
:idea:

andreas

is there any way to script the detection of a troublesome state and perform the tweak you just did ?

graeme
12-10-2005, 09:32 AM
I was the cause of the problem. There was some error with mysql last night which prevented the addition of new groups. I restarted mysql, which solved the problem, but forgot to remove our lock files which prevents multiple scripts from updating the stats at the same time.

Thor
12-10-2005, 10:06 AM
So tis time not the server but graeme needed a kick:D

Thanks for repairing it and keeping us posted:thumbs:


Greets Thor

Silverthorne
12-10-2005, 08:33 PM
Lots of ZZZ's again. :confused:

Thor
12-11-2005, 04:48 AM
yep, here too!

Anybody home on a sunday???


Thor

Silverthorne
12-13-2005, 07:23 AM
Is it just me or does the EON client ZZZ alot?

Longbow
12-13-2005, 05:07 PM
It does. That is why I run another project along side it and use a program called SetPriority.

http://gilchrist.ca/jeff/SetPriority/
(Thanks again Jeff for this wonderful tool!)

I set the thread priority of Eon to -2 (default is -15) and it will use 100% of the processor. When it Zzzz's my other project (Distributed Particle Accelerator Design) will pick up the slack.

Edit:

I just checked the CPU times under task manager.
Eon 40:15:37
DPAD 58:47:49

So even with Eon getting the higher priority setting my secondary project gets more CPU time due to Eon's Zzzz's

and_ped10
12-14-2005, 04:06 AM
Is it s still behaving badly?

Silverthorne
12-14-2005, 07:21 AM
Seem to be running better here.

Thor
12-14-2005, 08:33 AM
Looks better now, but it was really bad :sleepy: :sleepy: :sleepy:


Thor

Silverthorne
12-17-2005, 08:21 PM
Originally posted by Longbow
It does. That is why I run another project along side it and use a program called SetPriority.

http://gilchrist.ca/jeff/SetPriority/
(Thanks again Jeff for this wonderful tool!)

I set the thread priority of Eon to -2 (default is -15) and it will use 100% of the processor. When it Zzzz's my other project (Distributed Particle Accelerator Design) will pick up the slack.

Edit:

I just checked the CPU times under task manager.
Eon 40:15:37
DPAD 58:47:49

So even with Eon getting the higher priority setting my secondary project gets more CPU time due to Eon's Zzzz's

Does the thread priority need to be reset every time the computer reboots?

Longbow
12-18-2005, 07:55 AM
Yes, but you might want to check out this message where some others are writing scripts to do it.

http://free-dc.org/forum/showthread.php?s=&threadid=10394

I rarely reboot so it isn't really an issue for me.

Edit: Some people are also running multiple instances of Eon to try and battle the Zzzz's, but I find this just leaves you with multiple sleeping clients.

IronBits
12-18-2005, 11:20 AM
There is also this program you can get for only $35.
http://www.iarsn.com/taskinfo.html
It does so much more, but
Change process priority and
Make Process Priority Settings Persistent
are nice features applicable to this thread. :thumbs:

Silverthorne
12-18-2005, 03:42 PM
Thanks for the info, Taskinfo is just what I need.

rcoulter
01-15-2006, 01:18 AM
ZZZs and more ZZZs

Randy

and_ped10
01-21-2006, 10:23 AM
Sorry that I have not responded earlier!

The system have been caught in a really bad state for the last week or so. I have tried to resolve the problem by restarting the server in a state prior to the current. Each time I have done this the system somehow always get back to the troublesome point as it was removed from. Therefore I have decided to let the simulation run for like 5-8 days eventhough it is in a bad state and just hope that it will get out somehow.

Cheers Andreas

rcoulter
01-24-2006, 10:46 PM
This must be real annoying having to restart the server(s) all the time.

Randy

vaughan
02-02-2006, 02:14 AM
Getting lots of :sleepy: again. :Pokes: the server please.

rcoulter
02-07-2006, 08:03 AM
Server has been almost completly down for several hours.

Randy

AMDave
02-07-2006, 10:13 AM
please :bouncy: the server
my monitors have gone all boring :coffee:
:rotfl:

Paratima
02-07-2006, 12:16 PM
More interesting now. :clap:

Fozzie
02-12-2006, 03:50 PM
you like looking at a load of Zzzzz's

:Pokes: :Pokes: :Pokes: that gorram server.

Paratima
02-15-2006, 09:47 PM
It's OK again now, but it's gone yo-yo on us maybe half a dozen times today. 2-3 hours of running like a bat, then solid ZZZ's for a half hour or better. :bang:

alpha_fruit
02-20-2006, 07:27 AM
I am a new e0n user and all I'm getting is this; Unable to open CuEMTCli3.dll for writing.

Seems like e0n does a lot of zzzzz's, maybe this isn't the project for me.

Anyone else have this problem again?

I really like participating, but if this keeps happening, I will terminate and start something else.

Paratima
02-20-2006, 08:35 AM
Sounds like it's not running in the proper directory. If you're starting it from a desktop icon, check the Properties and make sure that the "Start In" bit names the actual full path to client.exe.

Likewise, if it's starting from the Startup folder or somewhere else in the Start Menu, be sure the "Start In" is set correctly. :thumbs:

alpha_fruit
02-20-2006, 08:41 AM
Since I am not PC savey, could you tell me where to find what I need to fix.

When I downloaded the software, it worked great, the it went to zzzz's. Do we know why?

But if you will tell me where to look and what to do I will try to fix it.
TIA

Paratima
02-20-2006, 09:44 AM
alpha-fruit, I've sent you a Private Message. I think we can work this out pretty easily.

Paratima
02-20-2006, 10:00 AM
In fact, I just re-installed the Windows client and think I know the problem.

When Eon installs under Windows, it makes an entry in the Startup program group. It incorrectly sets the "Start In" location to C:\WINDOWS\SYSTEM32 when it should be "C:\Program Files\Eon", complete WITH the enclosing quotes. Consequently, whenever the program starts, it can't find its required files and just spins its wheels.

To correct this, left-click on Start, Programs, Startup, then right-click on Eon, then left-click on Properties. In the box that says Start in, delete what's there and type in "C:\Program Files\Eon" with the quotes. You could even copy and paste it in. Click OK. Then run the program.

To run it immediately, just go back to Start, Programs, Startup, and double-click on Eon. Should work a treat.

I'll send a copy of this conversation to Graeme and nudge him toward fixing the installer. Good work for uncovering a glitch! :hifi:

alpha_fruit
02-20-2006, 12:59 PM
alpha-fruit, I've sent you a Private Message. I think we can work this out pretty easily.


Got it and answered.:hifi:

Fozzie
02-20-2006, 01:02 PM
boxen switched to a living project.

Let me know when it's back if you would.

Paratima
02-20-2006, 01:13 PM
Erm, hiccuped a bit here, but it's fine now. But hey, no rush for you to come back... :rotfl:

Paratima
02-20-2006, 01:17 PM
Or maybe it's discriminating against Brits. You didn't unplug the trans-Atlantic cable did you? :jester:

Paratima
02-20-2006, 01:45 PM
Now that I look again at the stats, Fozz, I think you ought to take off maybe a week. Maybe two! :clap: :rock: