Results 1 to 17 of 17

Thread: First Step: Data Storage

  1. #1
    Administrator Dyyryath's Avatar
    Join Date
    Dec 2001
    Location
    North Carolina
    Posts
    1,850

    First Step: Data Storage

    From reading what's already been posted, I'd say that we're all more or less on the same sheet of music. Given that, I'd say our first step would probably be to design a system to store the data we are going to gather. Once that's complete (or at least in a useable form), then we can begin to work on ways to put stuff in it and take stuff out of it.

    Arachnid uses a sort of project agnostic schema. It contains more pre-calculated data than we are probably interested in (this makes it fast, but expensive in terms of CPU time and storage space), but it may make a good basis for a lighter design.

    In Arachnid, each project gets several tables:

    Team Index: This gives each team a unique ID, and contains descriptive stuff about the team.
    Team Entry: This is the big table for holding team data. It contains a great deal of pre-calculated data and is basically the data archive. For each update (by timestamp) there is 1 entry for each team that exists at that udpate.
    Team Current: Contains only the most recent update for each team. This gives us a small fast table to access when we don't need to pull history.
    User Index: This works just like the team index, giving each user a unique id, as well as containing their team affiliation, status, etc.
    User Entry: This works just like the team data table. It's an archive full of pre-calculated data.
    User Current: This is a table that always contains only latest update data for each user. It's designed to stay small and is used by display scripts when only the most recent data is needed. This allows for fast access by the front end.
    Movement: This table tracks user movement from team to team.
    SystemState: Currently contains the timestamp of the most recently completed update.
    Project Entry: For projects that I'm tracking in their entirety (i.e. DF), this contains the project wide data in roughly the same format as the Team Entry and User Entry tables.

    You can grab a version of the SQL used to create these tables right here.

    You'll notice that there is an awful lot of pre-calculated info in these tables. I think most of that needs to go. You'll also see that there is some redundant data in there that also needs to be dealt with.

    One of the things that *has* worked pretty well is allowing 2 data types for every project. Most projects have two, some have more, some have less. Two, however, seems like a pretty good compromise. For projects that don't have two, you just insert a NULL into the other. For projects with more than two, you can generally get away with ignoring all but the two most important. For example, in DF we have points and structures.

    Any thoughts or comments? I'll probably sit down and build a slimmer table structure for everyone's perusal later tonight...
    "So utterly at variance is destiny with all the little plans of men." - H.G. Wells

  2. #2
    Stats Developer prokaryote's Avatar
    Join Date
    Dec 2001
    Location
    Broomfield, Colorado. USA
    Posts
    270
    So, are we talking about what the database should look like then? Would this include "contrived" stats as well? The reason I'm asking is that if we want to do a version of a "Mega-Stat" then we'll need some additional columns to store commensurate data.

    Is there any way to port the contents of another thread to this forum? In particular, this thread.

    prok

  3. #3
    Stats Developer prokaryote's Avatar
    Join Date
    Dec 2001
    Location
    Broomfield, Colorado. USA
    Posts
    270
    DHEP has a bunch of different data types, though the most important are effort units and circuits.

    thanks.

  4. #4
    Stats God in Training Darkness Productions's Avatar
    Join Date
    Dec 2001
    Location
    The land of dp!
    Posts
    4,164
    Dyy -

    Silly question. I've noticed that you use an EntryID in your tables (as I did, since I started working with your design years ago). Is it really necessary? After looking though all my code, I can't find a place where it would be useful, and it's just extra fluff in the database taking up space.

    Could it easily be removed without negative side-effects, or does Arachnid make use of it?

  5. #5
    Administrator Dyyryath's Avatar
    Join Date
    Dec 2001
    Location
    North Carolina
    Posts
    1,850
    DP: I use it primarily as a unique key. That isn't strictly necessary for all database types, but some require it so I've left it in to ensure that I can migrate the data to different storage systems easily.

    Additionally, any changes we make here won't need to work with Arachnid. I was just offering those templates as a starting point for a new system. I expect the finished product will be much different.

    prokaryote: A 'mega-stat' system will require additional columns (or tables). Before we try to determine what those are and how they should be organized, we'll need to determine how the mega stat will be generated.

    I've seen a variety of ways to do this. My favorite is probably using a 'CPU power' style unit for each project. For example, we choose a processor to use as our baseline metric (let's say a P3-1000) and determine how it performs in each project. For Folding@Home, we may determine that the P3-1000 does 'z' points per hour. This becomes the official 'standard' for the Folding@Home megastat. To determine a user's megastat for that project, you simply divide their total points by 'z' to acquire the number of CPU hours they've contributed. The scores for each project they participate in can simply be added together for a total megastat.

    This is not the most accurate way to measure contributions, especially on projects like Distributed Folding where the amount of work generated by a processor in a set amount of time can change (i.e. protein changeovers). It is also less than ideal due to the differences in certain processor architectures and how they perform with different projects. However, it is extremely easy for end users to grasp, and I think that is important.

    In fact, that's probably my one 'guiding' principle with stats. If you make something that is ultra cool, but difficult to use or understand, you've gone too far. I've seen some unbelievably thorough stats systems over the last couple of years that the creators were justifiably proud of...except that they were a mess to actually use. You don't want to make something that a user can't figure out pretty quickly. Having to spend time 'learning' to use a complicated stats site means that it's either too complex or has a poor user interface.

    Now, that *doesn't* mean that I'm married to the idea of using a 'CPU power' style metric. It's just my first stab at the idea. If we can come up with a more elegant solution that can still be easily explained to end users, I'd love to hear it...
    "So utterly at variance is destiny with all the little plans of men." - H.G. Wells

  6. #6
    Administrator Bok's Avatar
    Join Date
    Oct 2003
    Location
    Wake Forest, North Carolina, United States
    Posts
    24,469
    Blog Entries
    13
    I've spent a little time looking at these tables..

    Interesting ideas in them. I guess I was thinking asomething like this, but not on this scale..

    So, do you download the data from a webpage into the user tables first and then a seperate script inserts/updates the values to the team tables and then the project tables? I actually like that idea. It would fix a lot of the issues I have with team generation without using cgi.

    As to the content. I would certainly increase the number of metrics to track to 3. DHEP, D2OL,TSC,LM all spring to mind. I don't think space is that much of an issue. CPU is more important but with a decent processor, especially SMP reading/updating thousands or rows is ok. We could easily split this out onto multiple cpu's anyway.

    Bok

  7. #7
    Administrator Dyyryath's Avatar
    Join Date
    Dec 2001
    Location
    North Carolina
    Posts
    1,850
    You'd be surprised out how quickly it adds up. My run for just the top 100 teams (Folding@Home) takes about an hour on a dual Athlon MP2200 rig.

    I actually do both the users and the teams in one script, but the loop generally goes something like this:
    • Get a team's data (usually an HTML page, TXT page, or delimited file)
    • Loop over and process every user in the team, adding a row in the user entry file and adding the user's values to the team's total values (variables in memory)
    • Add a row in the team entry table for this team based on the team's now completed total values and add the team's total values to the project's total values (also vars in memory)
    • Go on to the next team


    For this layout, I considered all non-team affiliated users to be on a 'meta-team' of sorts. They get processed in a loop just like the one above, save that no actual row gets entered into the team entry table at the end.

    Once all of the teams (including the meta-team of non-affiliated members) are processed, I add a row to the project entry table for the project as a whole.

    To illustrate this, I've grepped the comments from Arachnid's DF spider. You can get them right here. The formatting left over from the script code itself should serve to illustrate the loops pretty well.

    Keep in mind that the format of the data retrieved from the project can force changes to this flow. ChessBrain springs immediately to mind.

    As for multiple data types for our schema, three sounds fine to me. My biggest concern is moving the pre-calculated fields like LastUpdate, LastDay, and LastWeek to the rendering scripts and away from the spidering/gathering scripts. This should speed the processing dramatically by being less computationally expensive and requiring far fewer SQL selects/inserts which will mean less disk access. It will also reduce storage requirements.
    "So utterly at variance is destiny with all the little plans of men." - H.G. Wells

  8. #8
    Member Cmarc's Avatar
    Join Date
    Dec 2002
    Location
    SF Bay Area
    Posts
    70
    Interesting stuff here.

    Being cursed with a horrible connection I've been toying with the notion of moving to some sort of hosted solution from day one ( and pushing it to some date in the future when I would clean up my stats system sufficiently to make it possible). The obvious difficulties were to lower the data storage and CPU load requirements of my patchwork system (thrown together for a small ecc2 team stats page then expanded to other projects in a less than Ideal way).

    The most horrible issue I had was the monstrous size of the User entry table for sob. while this is a small project 24 updates a day with several thousand lines by updates quickly added up. after a few months of running the stats the table was well over a Gbyte in size which brought problems beyond excessive storage requirements.

    Removing Older entries was an easy fix but meant losing some of the history I always planned to use in some way.

    Finally I came to a compromise I decided to keep the entire history without all the calculated values I added to the source data and add additionnal tables for short term history.

    Looking at the data I displayed at the time I mainly needed a week's worth of the history. All further data could be calculated at need from the far more limited user entry table I ended up with. beyond shrinking my storage requirements by approximately 60% ( I had more calculated data than source information) at the time this sped up a lot things, from the actual update process to the the queries to display the resulting data simply by virtue of moving to smaller tables.

    Cheers,
    Marc

  9. #9
    Stats Developer prokaryote's Avatar
    Join Date
    Dec 2001
    Location
    Broomfield, Colorado. USA
    Posts
    270
    Originally posted by Dyyryath
    [prokaryote: A 'mega-stat' system will require additional columns (or tables). Before we try to determine what those are and how they should be organized, we'll need to determine how the mega stat will be generated.

    I've seen a variety of ways to do this. My favorite is probably using a 'CPU power' style unit for each project. For example, we choose a processor to use as our baseline metric (let's say a P3-1000) and determine how it performs in each project. For Folding@Home, we may determine that the P3-1000 does 'z' points per hour. This becomes the official 'standard' for the Folding@Home megastat. To determine a user's megastat for that project, you simply divide their total points by 'z' to acquire the number of CPU hours they've contributed. The scores for each project they participate in can simply be added together for a total megastat.

    This is not the most accurate way to measure contributions, especially on projects like Distributed Folding where the amount of work generated by a processor in a set amount of time can change (i.e. protein changeovers). It is also less than ideal due to the differences in certain processor architectures and how they perform with different projects. However, it is extremely easy for end users to grasp, and I think that is important.

    In fact, that's probably my one 'guiding' principle with stats. If you make something that is ultra cool, but difficult to use or understand, you've gone too far. I've seen some unbelievably thorough stats systems over the last couple of years that the creators were justifiably proud of...except that they were a mess to actually use. You don't want to make something that a user can't figure out pretty quickly. Having to spend time 'learning' to use a complicated stats site means that it's either too complex or has a poor user interface.

    Now, that *doesn't* mean that I'm married to the idea of using a 'CPU power' style metric. It's just my first stab at the idea. If we can come up with a more elegant solution that can still be easily explained to end users, I'd love to hear it...
    Okay, we're both in agreement that a system based upon CPU power is chok full of issues.

    Not so sure what's hard to understand about the mega stat mentioned before. It's simple and elegant in that there's no need for calculating some ideal or standard CPU benchmark.

    Let me take another stab at explaining how it works.

    1. If I only have one project then the case is pretty simple, the person with the most WU is the farthest along.

    ---|------------------------A----------------------B------>

    0


    as you can see above, person B is farther from the origin (0 point) than A is therefore he/she is ahead. How do you measure this? Just figure out the distance from the origin to the points A and B. That is your statistic... distance from the origin. Okay, simple enough in 1 dimension or for 1 project.


    2. Now lets imagine that we have two projects.


    as you can see above
    Attached Images Attached Images

  10. #10
    Administrator Dyyryath's Avatar
    Join Date
    Dec 2001
    Location
    North Carolina
    Posts
    1,850
    I'm assuming there was more coming with this explanation?

    I actually understood what you were getting at before, I just wasn't sure it would be intuitive for the average person. Maybe I'm underestimating the average person or maybe we just need a really clear, concise explanation.

    With that in mind, let's hear the rest of this one.
    "So utterly at variance is destiny with all the little plans of men." - H.G. Wells

  11. #11
    Target Butt IronBits's Avatar
    Join Date
    Dec 2001
    Location
    Morrisville, NC
    Posts
    8,619
    I would think the first step would be to 'gather' up all the DATA and pre-format it and VALIDATE it.
    2nd step, get it into a DB... from there, the skys the limit

  12. #12
    Administrator Dyyryath's Avatar
    Join Date
    Dec 2001
    Location
    North Carolina
    Posts
    1,850
    Hey prokaryote, I've got a rough draft of the new database schema ready, but I'm still waiting to hear you expound on your ideas for a meta-stat before I put it up.

    Let's hear it!
    "So utterly at variance is destiny with all the little plans of men." - H.G. Wells

  13. #13
    Stats Developer prokaryote's Avatar
    Join Date
    Dec 2001
    Location
    Broomfield, Colorado. USA
    Posts
    270
    Sorry,

    I managed to lose the reply box during editing and in all of the key mashing didn't realize that half of what I'd typed actually made it. In the interim, I've been rebuilding my main system which just came up today.

    Okay, let's see, two dimensions or two projects

    In the picture, the horizontal line is the project 1 axis, the vertical is the project 2 axis. The circles are just equidistance lines put for illustration. So in this case, the same principle applies, the farther from the origin one is the higher their ranking.

    Moving to three projects is more of the same only now it in 3D and it's still the farther or larger the distance from the origin, the higher the ranking, and so on.

    Since this is a relative stat, you don't need to calculate some standard CPU and then figure out how many work units it does for each project. Why? because each axis can only have values from 0 to 1.

    How to calculate this? For each project, just divide each person's total work units by the maximum that has been contributed for that project by one of the participating members of the mega-stat. This produces a unitless number between 0 and 1. You've just now made each project equally contributing to the mega-stat and you don't have to worry about all of the issues surrounding creating a standard CPU benchmark for each project.

    If you want to weight each project by popularity, or project age etc., it's easy enough to do, just scale the axes accordingly. The idea is very similar to the one site that ranks teams by particpation on most of the DC projects around except it doesn't use Hamming distance like his does (This is where you calculate distance by just adding the contributing project totals together instead of calculating the Euclidean distance which is what I'm proposing).

    NOTE: Why is Hamming distance not a very good thing? Because it warps the equidistance lines so that instead of being intuitive like regular distance, it is somewhat bizzare and leads to some non-intuitive strategies to increase your rank that isn't very fair. See the picture below:
    Attached Images Attached Images

  14. #14
    Stats Developer prokaryote's Avatar
    Join Date
    Dec 2001
    Location
    Broomfield, Colorado. USA
    Posts
    270
    In the Hamming distance picture, the equidistance lines are catenary curves with fixed points on each axis and sagging towards the origin. So how is this not fair? Well if you look at the illustration both A and B are equidistant from the origin since they lay on the same equidistance curve, but clearly B is a larger Euclidean (or intuitive) distance from the origin than is A!

    How does this happen? Because by using an additive scheme to calculate distance (using a two project or 2D example as illustrated) someone who has a value of 1/2 in project 1 and 1/2 in project 2 has a Hamming distance (Additive distance --> 1/2 + 1/2 = 1). This is the same as someone who has a value of 1 in project 1 and 0 in project 2 (1 + 0 = 1).

    Using Euclidean distance, person A would have a value of 0.707 and person B would have a value of 1 (Which holds for what our eyes tell us, that B is farther from the origin than A).

    That is the problem with the current relative based mega-stat scheme at the one website (I forgot where it is) that everyone uses.

    Of course, using some sort of computed standard CPU benchmark can fix this by scaling the axes in such a way that the equidistance lines are returned to their normal circular values (but it's alot of work and for some projects you'd have to compute a new standard value for each problem that they work on. I'm thinking DHEP here, where the amount of time to complete a circuit varies between project and within a project given the same CPU!).

    Or you can avoid the whole issue and use a relative based stat and Euclidean distance.

    Bottom-line, for the mega-stat, use Euclidean (Intuitive) distance which is easy to understand. The larger the value compared to everyone else, the higher you're rank.

    The rest of the garbage that I used in the one thread were just examples of how different types of comparisons could be made (the radar graph, the graph of potential distance versus actual distance, etc.). This might be of interest to some people who'd like to figure out exactly where they stand relative to someone else and to the maximum possible obtainable distance they could get.

  15. #15
    Administrator Dyyryath's Avatar
    Join Date
    Dec 2001
    Location
    North Carolina
    Posts
    1,850
    This doesn't sound like a bad way to do it, so long as people grasp that it's completely relative in nature. If we display the values for the mega stat (rather than just a ranking), people who are not outproducing the leader will notice that their numbers get smaller, rather than larger even though they are turning in work.

    Additionally, let's see if I can get a little clarification on how you think we should combine the scoring for multiple projects. We'll use two notional guys and two projects as an example.

    Guy 1 has 1500wu in project 1. He's the leader.
    Guy 1 has 0wu in project 2. He's not contributing to this project.

    Guy 2 has 750wu in project 1. He's middle of the pack.
    Guy 2 has 500wu in project 2, where the leader has 1000wu. Again, he's middle of the pack.

    That should reflect this:

    Originally posted by prokaryote
    How does this happen? Because by using an additive scheme to calculate distance (using a two project or 2D example as illustrated) someone who has a value of 1/2 in project 1 and 1/2 in project 2 has a Hamming distance (Additive distance --> 1/2 + 1/2 = 1). This is the same as someone who has a value of 1 in project 1 and 0 in project 2 (1 + 0 = 1).

    Using Euclidean distance, person A would have a value of 0.707 and person B would have a value of 1 (Which holds for what our eyes tell us, that B is farther from the origin than A).
    Now, using this:

    Originally posted by prokaryote
    How to calculate this? For each project, just divide each person's total work units by the maximum that has been contributed for that project by one of the participating members of the mega-stat. This produces a unitless number between 0 and 1. You've just now made each project equally contributing to the mega-stat and you don't have to worry about all of the issues surrounding creating a standard CPU benchmark for each project.
    I'm getting the following:

    Guy 1 gets a '1' for project 1, as he's the leader. He gets a '0' for project 2, since he's not competing. So far, so good.

    Guy 2 gets a '.5' for project 1 (750/1500) and a '.5' for project 2 (500/1000).

    How did you combine project 1 and project 2 to arrive at '.707' for Guy 2?
    "So utterly at variance is destiny with all the little plans of men." - H.G. Wells

  16. #16
    Stats Developer prokaryote's Avatar
    Join Date
    Dec 2001
    Location
    Broomfield, Colorado. USA
    Posts
    270
    Hi Dyyryath,

    Yes it's important that everyone realize that it is a completely relative stat. So if someone wants absolutes, that's a much stickier problem. The CPU benchmark is one way to do that.

    If you throw in a contrived "max-man" and a "zero guy" as two participants then a person can see what kind of progress they're making compared to the best possible score and to the origin. So even though their number may be shrinking versus max man, they'll be making progress versus zero guy.

    To get the value 0.707 I just used the formula for calculating Euclidean distance:

    Distance = square root {(coordinate project A)^2 + (coordinate project B)^2}.

    --> Distance = sqrt ( (1/2)^2 + (1/2)^2 ) = sqrt ( 1/2 ) = 1 / sqrt (2) ~ 0.707

    The concept continues in a similar manner for more dimensions.

    prok.
    Last edited by prokaryote; 03-09-2004 at 12:58 PM.

  17. #17
    Stats God in Training Darkness Productions's Avatar
    Join Date
    Dec 2001
    Location
    The land of dp!
    Posts
    4,164
    This method seems like it would scale at O(1) which would be nice if adding new projects... My main concern would be the amount of calculations needing to be done for every person. It would be 1 calc per person per project, so x*y. Assuming we run this mega-stat for, say, 10 projects, and each project has 500 users, that's not too bad (5000 calculations). What happens when you go to 30 projects at 5000 users per project (erring on the side of caution)? Even at 10 calcs per second, it's gonna take quite a while to get all the data down and stored...

    Or am I blowing smoke?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •