Warning: Declaration of action_plugin_subjectindex_indexer::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/subjectindex/action/indexer.php on line 15

Warning: Declaration of action_plugin_mathjax_enable::register(Doku_Event_Handler &$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/mathjax/action/enable.php on line 62

Warning: Declaration of action_plugin_googleanalytics::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/googleanalytics/action.php on line 40

Warning: Declaration of action_plugin_folded::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/folded/action.php on line 40

Warning: Declaration of action_plugin_hidden::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/hidden/action.php on line 28

Warning: Declaration of action_plugin_include::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/include/action.php on line 354

Warning: Declaration of action_plugin_tag::register(&$contr) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/tag/action.php on line 175

Warning: Cannot modify header information - headers already sent by (output started at /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/subjectindex/action/indexer.php:15) in /data/web/virtuals/28604/virtual/www/subdom/bo/inc/auth.php on line 532

Warning: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback instead in /data/web/virtuals/28604/virtual/www/subdom/bo/inc/auth.php on line 818

Warning: Cannot modify header information - headers already sent by (output started at /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/subjectindex/action/indexer.php:15) in /data/web/virtuals/28604/virtual/www/subdom/bo/inc/actions.php on line 656

Warning: Cannot modify header information - headers already sent by (output started at /data/web/virtuals/28604/virtual/www/subdom/bo/lib/plugins/subjectindex/action/indexer.php:15) in /data/web/virtuals/28604/virtual/www/subdom/bo/inc/actions.php on line 656
lcthw:ex43

======Exercise 43: A Simple Statistics Engine====== This is a simple algorithm I use for collecting summary statistics "online", or without storing all of the samples. I use this in any software that needs to keep some statistics such as mean, standard deviation, and sum, but where I can't store all the samples needed. Instead I can just store the rolling results of the calculations which is only 5 numbers. ======Rolling Standard Deviation And Mean====== The first thing you need is a sequence of samples. This can be anything from time to complete a task, numbers of times someone accesses something, or even star ratings on a website. Doesn't really matter what, just so long as you have a stream of numbers and you want to know the following summary statistics about them: sum This is the total of all the numbers added together. sum squared (sumsq) This is the sum of the square of each number. count (n) This is the number samples you've taken. min This is the smallest sample you've seen. max This is the largest sample you've seen. mean This is the most likely middle number. It's not actually the middle, since that's the median, but it's an accepted approximation for it. stddev Calculated using $sqrt(sumsq - (sum * mean)) / (n - 1) ))$ where sqrt is the square root function in the math.h header. I will confirm this calculation works using R since I know R gets these right: > s <- runif(n=10, max=10) > s [1] 6.1061334 9.6783204 1.2747090 8.2395131 0.3333483 6.9755066 1.0626275 [8] 7.6587523 4.9382973 9.5788115 > summary(s) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.3333 2.1910 6.5410 5.5850 8.0940 9.6780 > sd(s) [1] 3.547868 > sum(s) [1] 55.84602 > sum(s * s) [1] 425.1641 > sum(s) * mean(s) [1] 311.8778 > sum(s * s) - sum(s) * mean(s) [1] 113.2863 > (sum(s * s) - sum(s) * mean(s)) / (length(s) - 1) [1] 12.58737 > sqrt((sum(s * s) - sum(s) * mean(s)) / (length(s) - 1)) [1] 3.547868 > You don't need to know R, just follow along while I explain how I'm breaking this down to check my math: lines 1-4 I use the function runif to get a "random uniform" distribution of numbers, then print them out. I'll use these in the unit test later. lines 5-7 Here's the summary, so you can see the values that R calculates for these. lines 8-9 This is the stddev using the sd function. lines 10-11 Now I begin to build this calculation manually, first by getting the sum. lines 12-13 Next piece of the stdev formula is the sumsq, which I can get with sum(s * s) which tells R to multiple the whole s list by itself and then sum those. The power of R is being able to do math on entire data structures like this. lines 14-15 Looking at the formula, I then need the sum multiplied by mean, so I do sum(s) * mean(s). lines 16-17 I then combine the sumsq with this to get sum(s * s) - sum(s) * mean(s). lines 18-19 That needs to be divided by $n-1$, so I do (sum(s * s) - sum(s) * mean(s)) / (length(s) - 1). lines 20-21 Finally, I sqrt that and I get 3.547868 which matches the number R gave me for sd above. ======Implemention====== That's how you calculate the stddev, so now I can make some simple code to implement this calculation. #ifndef lcthw_stats_h #define lctwh_stats_h typedef struct Stats { double sum; double sumsq; unsigned long n; double min; double max; } Stats; ======Stats *Stats_recreate(double sum, double sumsq, unsigned long n, double min, dou====== ble max); ======Stats *Stats_create();====== double Stats_mean(Stats *st); double Stats_stddev(Stats *st); void Stats_sample(Stats *st, double s); void Stats_dump(Stats *st); #endif Here you can see I've put the calculations I need to store in a struct and then I have functions for sampling and getting the numbers. Implementing this is then just an exercise in converting the math: #include <math.h> #include <lcthw/stats.h> #include <stdlib.h> #include <lcthw/dbg.h> ======Stats *Stats_recreate(double sum, double sumsq, unsigned long n, double min, dou====== ble max) { Stats *st = malloc(sizeof(Stats)); check_mem(st); st->sum = sum; st->sumsq = sumsq; st->n = n; st->min = min; st->max = max; return st; error: return NULL; } ======Stats *Stats_create()====== { return Stats_recreate(0.0, 0.0, 0L, 0.0, 0.0); } double Stats_mean(Stats *st) { return st->sum / st->n; } double Stats_stddev(Stats *st) { return sqrt( (st->sumsq - ( st->sum * st->sum / st->n)) / (st->n - 1) ); } void Stats_sample(Stats *st, double s) { st->sum += s; st->sumsq += s * s; if(st->n == 0) { st->min = s; st->max = s; } else { if(st->min > s) st->min = s; if(st->max < s) st->max = s; } st->n += 1; } void Stats_dump(Stats *st) { fprintf(stderr, "sum: %f, sumsq: %f, n: %ld, min: %f, max: %f, mean: %f, std dev: %f", st->sum, st->sumsq, st->n, st->min, st->max, Stats_mean(st), Stats_stddev(st)); } Here's what each function in stats.c does: Stats_recreate I'll want to load these numbers from some kind of database, and this function let's me recreate a Stats struct. Stats_create Simply called Stats_recreate with all 0 values. Stats_mean Using the sum and n it gives the mean. Stats_stddev Implements the formula I worked out, with the only difference being that I calculate the mean with st->sum / st->n in this formula instead of calling Stats_mean. Stats_sample This does the work of maintaining the numbers in the Stats struct. When you give it the first value it sees that n is 0 and sets min and max accordingly. Every call after that keeps increasing sum, sumsq, and n. It then figures out if this new sample is a new min or max. Stats_dump Simple debug function that dumps the stats so you can view them. The last thing I need to do is confirm that this math is correct. I'm going to use my numbers and calculations from my R session to create a unit test that confirms I'm getting the right results. #include "minunit.h" #include <lcthw/stats.h> #include <math.h> const int NUM_SAMPLES = 10; double samples[] = { 6.1061334, 9.6783204, 1.2747090, 8.2395131, 0.3333483, 6.9755066, 1.0626275, 7.6587523, 4.9382973, 9.5788115 }; ======Stats expect = {====== .sumsq = 425.1641, .sum = 55.84602, .min = 0.333, .max = 9.678, .n = 10, }; double expect_mean = 5.584602; double expect_stddev = 3.547868; #define EQ(X,Y,N) (round((X) * pow(10, N)) == round((Y) * pow(10, N))) char *test_operations() { int i = 0; Stats *st = Stats_create(); mu_assert(st != NULL, "Failed to create stats."); for(i = 0; i < NUM_SAMPLES; i++) { Stats_sample(st, samples[i]); } Stats_dump(st); mu_assert(EQ(st->sumsq, expect.sumsq, 3), "sumsq not valid"); mu_assert(EQ(st->sum, expect.sum, 3), "sum not valid"); mu_assert(EQ(st->min, expect.min, 3), "min not valid"); mu_assert(EQ(st->max, expect.max, 3), "max not valid"); mu_assert(EQ(st->n, expect.n, 3), "max not valid"); mu_assert(EQ(expect_mean, Stats_mean(st), 3), "mean not valid"); mu_assert(EQ(expect_stddev, Stats_stddev(st), 3), "stddev not valid"); return NULL; } char *test_recreate() { Stats *st = Stats_recreate(expect.sum, expect.sumsq, expect.n, expect.min, e xpect.max); mu_assert(st->sum == expect.sum, "sum not equal"); mu_assert(st->sumsq == expect.sumsq, "sumsq not equal"); mu_assert(st->n == expect.n, "n not equal"); mu_assert(st->min == expect.min, "min not equal"); mu_assert(st->max == expect.max, "max not equal"); mu_assert(EQ(expect_mean, Stats_mean(st), 3), "mean not valid"); mu_assert(EQ(expect_stddev, Stats_stddev(st), 3), "stddev not valid"); return NULL; } char *all_tests() { mu_suite_start(); mu_run_test(test_operations); mu_run_test(test_recreate); return NULL; } ======RUN_TESTS(all_tests);====== There's nothing new in this unit test, except maybe the EQ macro. I felt lazy and didn't want to look up the standard way to tell if two double values are close, so I made this macro. The problem with double is that equality assumes totally equal, but I'm using two different systems with slightly different rounding errors. The solution is to say I want the numbers to be "equal to X decimal places". I do this with EQ by raising the number to a power of 10, then using the round function to get an integer. This is a simple way to round to N decimal places and compare the results as an integer. I'm sure there's a billion other ways to do the same thing, but this works for now. The expected results are then in a Stats struct and then I simply make sure that the number I get is close to the number R gave me. ======How To Use It====== You can use the standard deviation and mean to determine if a new sample is "interesting", or you can use this to collect statistics on statistics. The first one is easy for people to understand so I'll explain that quickly using an example for login times. Imagine you're tracking how long users spend on a server and you're using stats to analyze it. Every time someone logs in, you keep track of how long they are there, then you call Stats_sample. I'm looking for people are a on "too long" and also people who seem to be on "too quickly". Instead of setting specific levels, what I'd do is compare how long someone is on with the mean (plus or minus) 2 * stddev range. I get the mean and 2 * stddev, and consider login times to be "interesting" if they are outside these two ranges. Since I'm keeping these statistics using a rolling algorithm this is a very fast calculation and I can then have the software flag the users who are outside of this range. This doesn't necessarily point out people who are behaving badly, but instead it flags potential problems that you can review to see what's going on. It's also doing it based on the behavior of all the users, which avoids the problem where you pick some arbitrary number that's not based on what's really happening. The general rule you can get from this is that the mean (plus or minus) 2 * stddev is an estimate of where 90% of the values are expected to fall, and that anything outside those ranges is interesting. The second way to use these statistics is to go meta and calculate them for other Stats calculations. You basically do your Stats_sample like normal, but then you run Stats_sample on the min, max, n, mean, and stddev on that sample. This gives a two-level measurement, and let's you compare samples of samples. Confusing right? I'll continue my example above and add that you have 100 servers that each hold a different application. You are already tracking user's login times for each application server, but you want to compare all 100 applications and flag any users that are logging in "too much" on all of them. Easiest way to do that is each time someone logs in, calculate the new login stats, then add that Stats structs elements to a second Stat. What you end up with is a series of stats that can be named like this: mean of means This is a full Stats struct that gives you mean and stddev of the means of all the servers. Any server or user who is outside of this is work looking at on a global level. mean of stddevs Another Stats struct that produces the statistics of how all of the servers range. You can then analyze each server and see if any of them have unusually wide ranging numbers by comparing their stddev to this mean of stddevs statistic. You could do them all, but these are the most useful. If you wanted to then monitor servers for erratic login times you'd do this: * User John logs into and out of server A. Grab server A's stats, update them. * Grab the mean of means stats, and take A's mean and add it as a sample. I'll call this m_of_m. * Grab the mean of stddevs stats, and add A's stddev to it as a sample. I'll call this m_of_s. * If A's mean is outside of m_of_m.mean + 2 * m_of_m.stddev then flag it as possibly having a problem. * If A's stddev is outside of m_of_s.mean + 2 * m_of_s.stddev then flag it as possible behaving too erratically. * Finally, if John's login time is outside of A's range, or A's m_of_m range, then flag it as interesting. Using this "mean of means" and "mean of stddevs" calculation you can do efficient tracking of many metrics with a minimal amount of processing and storage. ======Extra Credit====== * Convert the Stats_stddev and Stats_mean to static inline functions in the stats.h file instead of in the stats.c file. * Use this code to write a performance test of the string_algos_test.c. Make it optional and have it run the base test as a series of samples then report the results. * Write a version of this in another programming language you know. Confirm that this version is correct based on what I have here. * Write a little program that can take a file full of numbers and spit these statistics out for them. * Make the program accept a table of data that has headers on one line, then all the other numbers on lines after it separated by any number of spaces. Your program should then print out these stats for each column by the header name. Copyright (C) 2010 Zed. A. Shaw Credits