-
Administrator
Stats rewrite
I'll try to keep some updates on the rewrite...
Today I rewrote the XML parser using XML:Parser:SAX which greatly reduces the memory requirements by not holding the whole XML document in memory. Seems fairly quick too. I'm monitoring it running Einstein stats right now (parsing host stats) and it doesn't look too bad...
[freedc@statsbox test]$ cat /proc/3498/status
Name: a_gather2.pl
State: S (sleeping)
SleepAVG: 98%
Tgid: 3498
Pid: 3498
PPid: 409
TracerPid: 0
Uid: 501 501 501 501
Gid: 501 501 501 501
FDSize: 256
Groups: 501
VmPeak: 138600 kB
VmSize: 138596 kB
VmLck: 0 kB
VmHWM: 19376 kB
VmRSS: 19372 kB
VmData: 16608 kB
VmStk: 84 kB
VmExe: 12 kB
VmLib: 7688 kB
VmPTE: 252 kB
StaBrk: 00604000 kB
Brk: 01591000 kB
StaStk: 7fff913fde60 kB
Threads: 1
SigQ: 0/16379
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001086
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed: 7fffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Mems_allowed: 00000000,00000001
I've also started splitting the script into components. This is a bit more tricky the way I have the DB set up as I've got to synchronize them, but it shouldn't be that difficult..
I already have the single scripts that purely do user project ranking and team project ranking
All on the dev box right now only..
If people are interested in this level of detail, I'm mroe than happy to post more
Bok
-
=>Team Joker<=
All them 00000000 make my eyes go
I am sure I can speak for all of us and say
We really appreciate all that you do for us Bok
-
TeAm AnandTech
Yes, indeed. Although I have no idea what all the numbers mean, I'm just happy there's someone like Bok who KNOWS what they mean and gives us some fantastic stats.
-
Administrator
Ok, part 2...
I re-implemented again the gathering script using XML::Twig this time and for a change, inserting the data into a temporary table (more on that later)
Testing on Einstein hosts (~ 373000) took 20mins to insert all the data. Not sure if I can make that any faster.
Now to run ranking on the hosts...this is where I've used advice that came from Zain Upton over at BoincSynergy quite some time ago (thanks Zain!)
Let's make this simpler. I have a table that contains amongst other things
id, score, projrank
I create a new temp table containing id + rank (which is an auto-incrementing field meaning as you insert, mysql starts at 1 and incremements).
I then perform the following steps
select curtime();
alter table hosts add index forstats (score desc,id);
insert into temp (id) select id from hosts order by score desc,id;
update hosts,temp set hosts.projrank =temp.rank where hosts.id=temp.id;
select curtime();
And the results are... (index was already created hence the error)
+-----------+
| curtime() |
+-----------+
| 11:25:36 |
+-----------+
1 row in set (0.00 sec)
ERROR 1061 (42000): Duplicate key name 'forstats'
Query OK, 373261 rows affected (5.50 sec)
Records: 373261 Duplicates: 0 Warnings: 0
Query OK, 296975 rows affected (3 min 55.44 sec)
Rows matched: 373261 Changed: 296975 Warnings: 0
+-----------+
| curtime() |
+-----------+
| 11:29:37 |
+-----------+
1 row in set (0.00 sec)
4 mins to rank them isn't too bad.. Current stats script probably would take 10 times that using arrays in perl
Bok
-
almost retired
-
Administrator
Some more investigation on doing project ranking.. Playing about with mysql
This is actually on a table of users in Rosetta, all run on a dual xeon 3.6
mysql> select count(*) from tempuser2;
+----------+
| count(*) |
+----------+
| 138626 |
+----------+
1 row in set (0.00 sec)
Show the top 10 via score.. no ranks...
mysql> select projrank,name,metric1 from tempuser2 order by metric1 desc limit 0,10;
+----------+----------------------+-----------------+
| projrank | name | metric1 |
+----------+----------------------+-----------------+
| 0 | XS_DDTUNG | 26074289.110836 |
| 0 | PY 222 | 19296317.008225 |
| 0 | UW-Madison CAE | 12952821.188437 |
| 0 | lunarcom | 9870414.385457 |
| 0 | Housing and Food Ser | 8993245.2656 |
| 0 | Keith E. Laidig | 8424687.856815 |
| 0 | TeAm Enterprise | 7903186.796393 |
| 0 | lwh1 | 7256810.819434 |
| 0 | TCU Computer Science | 5758233.912814 |
| 0 | Oldguy | 5320341.380613 |
+----------+----------------------+-----------------+
10 rows in set (0.00 sec)
Then a wee bit of mysql magic...
mysql> set @rownum = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> update tempuser2, (select@rownum := @rownum+1 AS newrank, id as user_id from tempuser2 order by metric1 desc) r set tempuser2.projrank = newrank where user_id = id;
Query OK, 138626 rows affected (8.33 sec)
Rows matched: 138626 Changed: 138626 Warnings: 0
mysql> select projrank,name,metric1 from tempuser2 order by metric1 desc limit 0,10;
+----------+----------------------+-----------------+
| projrank | name | metric1 |
+----------+----------------------+-----------------+
| 1 | XS_DDTUNG | 26074289.110836 |
| 2 | PY 222 | 19296317.008225 |
| 3 | UW-Madison CAE | 12952821.188437 |
| 4 | lunarcom | 9870414.385457 |
| 5 | Housing and Food Ser | 8993245.2656 |
| 6 | Keith E. Laidig | 8424687.856815 |
| 7 | TeAm Enterprise | 7903186.796393 |
| 8 | lwh1 | 7256810.819434 |
| 9 | TCU Computer Science | 5758233.912814 |
| 10 | Oldguy | 5320341.380613 |
+----------+----------------------+-----------------+
10 rows in set (0.00 sec)
mysql>
Seems just a little faster.....
Now if only I could get the xml parsing to run faster..... still this is MUCH better than before.
Now I've just got to code in
1. Updates to actual tables from temp.
2. movements
3. milestones
And of course I have a new machine to play with at our server provider which is somewhat more powerful....
Bok
-
TeAm AnandTech
Hehe, glad you like your new "toy" !
-
=>Team Joker<=
-
-
=>Team Joker<=
A Thanks to you Bok and your coding team, for all the tireless effort put in this project
THANK YOU
-
-
TeAm AnandTech
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules