Part B--Other Applications

7.7 Some Statistical Tools

One of the most common applications of computing machinery is that of the analysis of data. It is often the case that a researcher is faced with large amounts of data describing the results of experiments. In order to make sense of this, certain standard tools are often applied, and in this section, two of the simplest are considered. The first of these is the mean or ordinary average:

The mean of a group of data is their sum divided by the number of items in the group.

Besides the mean itself, one often wants to know something about how closely clustered are the data points. For instance the mean of the data

13,14,13,15,16,13,15,16,14,15,14,14,13,12,13

is 14 and so is the mean of the data

1,1,1,1,1,1,1,1,1,1,1,1,1,1,196

yet these two data sets are very different, and there has to be a way to measure that difference.

One way of expressing the scattering of data from the mean is given by the standard deviation. It is found by taking the difference between the mean and each data item, squaring this number and adding all these squares. One then divides by the number of items and takes the square root of the final result.

where M is the mean. That is, the standard deviation is the square root of the expression on the right.

NOTES: The expression above is used only when the standard deviation is computed on the entire population (all possible data.) If the statistics are gathered on a sample of the population, the denominator is changed to n - 1.

The quantity (standard deviation)2 is called the variance of the data.

To construct a satisfactory Modula-2 routine for computing standard deviation, note that if the numerator of the above expression is expanded, one gets:

or

Substituting the formula for the mean, namely

yields

or

In the case of a sample population, the n in the main denominator becomes n - 1 and this formula is

If a summation notation is employed, these two are written:

Whichever of these (whole population or sample) is needed, this form is much easier to work with than the original definition, because the algorithm can operate by storing running totals of numbers as they are read or entered and also storing (separately) the running sum of their squares. The standard deviation can be determined at any point for the data entered thus far, or after all items have been entered, for the division by n or by n - 1 can be done at any time.

One could even decide to save storage space by not storing all the data items in an array as they are examined, but instead keeping track only of the two running totals and the number of items. While this suggestion is not in fact adopted in the code that follows, it could be an important one if the number of data entries in some disk file were very large and the amount of available memory comparatively small.

To obtain maximum flexibility, the statistical functions have been divided into two library modules, one a low level module whose only purpose is to accumulate the running number of items, sum, and sum of squares as data is fed to it, and to report the three statistical measures when desired. The second, higher level module drives the lower level one and does the computations of mean and standard deviation using results obtained from it.

DEFINITION MODULE LowStats;

(* Library of commonly used low level statistical functions
design by R. Sutcliffe & portions of implementation by Mark Harder
last revision 1993 04 06 *)

PROCEDURE Reset ();
  (* Use this procedure before starting to call accumulating variables for a new calculation
  Pre: none
  Post: the number of items, sum, and sum of squares are all set to zero. Max is set to MIN (REAL) and min is set to MAX (REAL)
  NOTE: The initialization code of the module calls Reset. *)

PROCEDURE Enter (x : REAL);
  (* Pre: If this is the first call of this procedure for a new set of data, Reset must be called first.
  Post: the number of items, running sum and running sum of squares are updated *)

PROCEDURE Size () : CARDINAL;
  (* Pre: none
  Post: returns the number of items accumulated since the last call to Reset. *)

PROCEDURE Sum () : REAL;
  (* Pre: none
  Post: returns the sum of the items accumulated since the last call to Reset. *)

PROCEDURE SumSquares () : REAL;
  (* Pre: none
  Post: returns the sum of the squares of the items accumulated since the last call to Reset. *)

PROCEDURE Max () : REAL;
  (* Pre: none
  Post: returns the largest of the items accumulated since the last call to Reset. *)

PROCEDURE Min () : REAL;
  (* Pre: none
  Post: returns the smallest of the items accumulated since the last call to Reset. *)

END LowStats.

IMPLEMENTATION MODULE LowStats;

(* Library of commonly used low level statistical functions
design by R. Sutcliffe & portions of implementation by Mark Harder
last revision 1993 04 06 *)

VAR
  count : CARDINAL;
  sum, sumSq, max, min : REAL;

PROCEDURE Reset ();

BEGIN
  count := 0;
  sum := 0.0;
  sumSq := 0.0;
  max := MIN (REAL);
  min := MAX (REAL);
END Reset;

PROCEDURE Enter (x : REAL);

BEGIN
  INC (count);
  sum := sum + x;
  sumSq := sumSq + x * x;
  IF x > max
    THEN
      max := x
    END;
  IF x < min
    THEN
      min := x
    END;
END Enter;

PROCEDURE Size () : CARDINAL;

BEGIN
  RETURN count;
END Size;

PROCEDURE Sum () : REAL;

BEGIN
  RETURN sum;
END Sum;

PROCEDURE SumSquares () : REAL;

BEGIN
  RETURN sumSq;
END SumSquares;

PROCEDURE Max () : REAL;

BEGIN
  RETURN max;
END Max;

PROCEDURE Min () : REAL;

BEGIN
  RETURN min;
END Min;


BEGIN (* initialization code *)
  Reset;

END LowStats.

Observe that no error handling has been done. There could be problems if the real type is overflowed, for example. In the exercises, the reader is asked to remedy this oversight. In the higher level module that follows, there is some redundancy, for the procedures Largest and Smallest return results that could be obtained by calling the lower level module directly. However, there are other possible higher end modules that could be written here; not all of them might need the maximum and minimum items.

DEFINITION MODULE Stats;
     
(* Library of commonly used statistical functions
design by R. Sutcliffe & portions of implementation by Mark Harder
last revision 1993 04 06 *)

PROCEDURE EnterData (items : ARRAY OF REAL;
                      numItems : CARDINAL);
  (* Pre: the items in use are numbered 0 .. numItems - 1
  Post: data is ready for analysis with statistical functions below *)

PROCEDURE Largest () : REAL;
  (* Pre: none
  Post: The highest value in the last array submitted with EnterData is returned *)

PROCEDURE Smallest () : REAL;
  (* Pre: none
  Post: The lowest value in the last array submitted with EnterData is returned *)

PROCEDURE Mean() : REAL;
  (* Pre: none
  Post: The mean of all the values in the last array submitted with EnterData is returned *)

PROCEDURE VariancePop () : REAL;
  (* Pre: none
  Post: The population variance of all the values in the last array submitted with EnterData is returned *)

PROCEDURE VarianceSamp () : REAL;
  (* Pre: none
  Post: The sample variance of all the values in the last array submitted with EnterData is returned *)

PROCEDURE StdDevPop () : REAL;
  (* Pre: none
  Post: The population standard deviation of all the values in the last array submitted with EnterData is returned *)
     
PROCEDURE StdDevSamp () : REAL;
  (* Pre: none
  Post: The sample standard deviation of all the values in the last array submitted with EnterData is returned *)

END Stats.

IMPLEMENTATION MODULE Stats;
     
(* Library of commonly used statistical functions
design by R. Sutcliffe & portions of implementation by Mark Harder
last revision 1993 04 06 *)

FROM LowStats IMPORT
  Reset, Enter, Size, Sum, SumSquares, Max, Min;
FROM RealMath IMPORT
  sqrt;

PROCEDURE EnterData (items : ARRAY OF REAL; numItems : CARDINAL);
VAR
  count :  CARDINAL;

BEGIN
  Reset;
  FOR count := 0 TO numItems - 1
    DO
      Enter (items [count]);
    END;
END EnterData;

PROCEDURE Largest () : REAL;

BEGIN
  RETURN Max();
END Largest;

PROCEDURE Smallest () : REAL;

BEGIN
  RETURN Min();
END Smallest;

PROCEDURE Mean () : REAL;

BEGIN
  RETURN Sum() / FLOAT (Size());
END Mean;

PROCEDURE VariancePop () : REAL;

VAR
  size : REAL;
  
BEGIN
  size := FLOAT ( Size ());
  RETURN (SumSquares () - (( Sum() *  Sum()) / size)) / size;
END VariancePop;

PROCEDURE VarianceSamp () : REAL;

VAR
  size : REAL;
  
BEGIN
  size := FLOAT ( Size ());
  RETURN (SumSquares () - (( Sum() *  Sum()) / size)) / (size - 1.0);
END VarianceSamp;

PROCEDURE StdDevPop () : REAL;

BEGIN
  RETURN sqrt (VariancePop ());
END StdDevPop;

PROCEDURE StdDevSamp () : REAL;

BEGIN
  RETURN sqrt (VarianceSamp ());
END StdDevSamp;

END Stats.

Notice how the work has been distributed so that most of the procedures have only a line or two of code. This makes easier to debug than it would be otherwise. In fact, when the test program below was compiled and run, the only errors found were in the client. The library modules needed no corrections after the initial compilation; all their functions worked correctly the first time.

MODULE TestStats;
     
(* by R. Sutcliffe
to test the statistics modules
last revision 1993 04 06 *)

FROM Stats IMPORT
  EnterData, Largest, Smallest, Mean, VariancePop, VarianceSamp,
  StdDevPop, StdDevSamp;
FROM SRealIO IMPORT
  ReadReal, WriteFixed;
FROM STextIO IMPORT
  WriteString, WriteLn;
FROM SWholeIO IMPORT
  WriteCard;
FROM RedirStdIO (* non-standard *) IMPORT
  OpenOutput, CloseOutput;

  
PROCEDURE WriteStats;

BEGIN
  WriteString ("largest is "); 
  WriteFixed (Largest (), 2, 0);
  WriteLn;
  WriteString ("smallest is "); 
  WriteFixed (Smallest (), 2, 0);
  WriteLn;
  WriteString ("mean is "); 
  WriteFixed (Mean (), 2, 0);
  WriteLn;
  WriteString ("population variance is "); 
  WriteFixed (VariancePop (), 2, 0);
  WriteLn;
  WriteString ("sample variance is "); 
  WriteFixed (VarianceSamp (), 2, 0);
  WriteLn;
  WriteString ("population standard deviation is "); 
  WriteFixed (StdDevPop (), 2, 0);
  WriteLn;
  WriteString ("sample standard deviation is "); 
  WriteFixed (StdDevSamp (), 2, 0);
  WriteLn;
END WriteStats;

TYPE
  DataArray = ARRAY [0 .. 20] OF REAL;

VAR
  theData : DataArray;

BEGIN
  theData [0] := 13.0; theData [1] := 14.0; theData [2] := 13.0;
  theData [3] := 15.0; theData [4] := 16.0; theData [5] := 13.0;
  theData [6] := 15.0; theData [7] := 16.0; theData [8] := 14.0;
  theData [9] := 15.0; theData [10] := 14.0; theData [11] := 14.0;
  theData [12] := 13.0; theData [13] := 12.0;
  theData [14] := 13.0;
  OpenOutput;
  EnterData (theData, 15);
  WriteString ("       First Run:");
  WriteLn; 
  WriteStats;
  WriteLn;
  theData [0] := 1.0; theData [1] := 1.0; theData [2] := 1.0;
  theData [3] := 1.0; theData [4] := 1.0; theData [5] := 1.0;
  theData [6] := 1.0; theData [7] := 1.0; theData [8] := 1.0;
  theData [9] := 1.0; theData [10] := 1.0; theData [11] := 1.0;
  theData [12] := 1.0; theData [13] := 1.0; theData [14] := 196.0;
  EnterData (theData, 15);
  WriteString ("       Second Run:");
  WriteLn; 
  WriteStats;
  CloseOutput;

END TestStats.

As can be seen, the data chosen for the two runs are the very collections with which this discussion began. The results are given below:

       First Run:
largest is  16.00
smallest is  12.00
mean is  14.00
population variance is  1.33
sample variance is  1.43
population standard deviation is  1.15
sample standard deviation is  1.20

       Second Run:
largest is  196.00
smallest is  1.00
mean is  14.00
population variance is  2366.00
sample variance is  2535.00
population standard deviation is  48.64
sample standard deviation is  50.35

Contents