Dispersion analysis

in mathematical statistics

A statistical method for detecting the effect of individual factors on the results of an experiment, and for the subsequent planning of similar experiments. Dispersion analysis was originally proposed by R.A. Fisher [1] for the processing of the results of agricultural trials, aimed at establishing the conditions under which a given agricultural crop yields a maximal harvest. Modern applications of dispersion analysis embrace a wide scope of problems in economics, sociology, biology, and technology; they are usually treated in terms of the statistical theory of detection of systematic differences between the results of direct measurements carried out under specific varying conditions.

Suppose that the values of unknown constants $ a _ {1} \dots a _ {I} $ can be measured by certain methods or using certain instruments $ M _ {1} \dots M _ {J} $, and that the systematic error $ b _ {ij } $ in each case may depend, in principle, both on the method $ M _ {j} $ chosen and on the unknown value $ a _ {i} $ to be measured. Then the results of such experiments are sums of the form

$$ x _ {ijk} = a _ {i} + b _ {ij} + y _ {ijk} , $$

$$ i = 1 \dots I ; \ j = 1 \dots J ; \ k = 1 \dots K , $$

where $ K $ is the number of independent measurements of the unknown magnitude $ a _ {i} $ by the method $ M _ {j} $, and $ y _ {ijk } $ is the random error of the $ k $- th measurement of $ a _ {i} $ by the method $ M _ {j} $( it is assumed that all $ y _ {ijk } $ are independent identically-distributed random variables with mathematical expectation zero: $ {\mathsf E} y _ {ijk } = 0 $). Such a linear model is known as a two-factor scheme of dispersion analysis; the first factor is the true value of the magnitude being measured, the second is the method of measurement; moreover, in this case the same number $ K $ of independent measurements is effected for any possible combination of values of the first and second factors (this assumption is immaterial for the purposes of dispersion analysis, and has only been introduced for the sake of clarity).

An example of such a situation is a competition between $ I $ sportsmen, the performance in which is evaluated by $ J $ referees, each participant in the competition appearing $ K $ times (is allowed $ K $" attempts" ). Here, $ a _ {i} $ is the true value of the performance index of sportsman number $ i $; $ b _ {ij} $ is the systematic error introduced in the evaluation of the performance of the $ i $- th sportsman by the $ j $- th referee; $ x _ {ijk} $ is the evaluation of the performance of the $ i $- th sportsman at the $ k $- th attempt, given by the $ j $- th referee; while $ y _ {ijk} $ is the respective random error. Such a setup is typical of the so-called subjective examination of the quality of a number of objects, effected by a group of independent experts. Another example is a statistical study of the productivity of an agricultural crop, in dependence of one of $ I $ kinds of soil and $ J $ methods of soil tillage, $ K $ independent experiments being performed for each type of soil $ i $ and each tillage method $ j $. In this example, $ b _ {ij} $ is the true value of the productivity of the crop for the $ i $- th type of soil tilled by the $ j $- th method, $ x _ {ijk} $ is the respective observed productivity of the crop in the $ k $- th trial; while $ y _ {ijk} $ is its random error caused by random factors; as regards the value of $ a _ {i} $, this may reasonably be equated to zero in agricultural experiments (see also [5]).

Let $ c _ {ij} = a _ {i} + b _ {ij} $, and let $ c _ {i*} $, $ c _ {*} j $ and $ c _ {**} $ be the results of averaging $ c _ {ij} $ over the corresponding indices, i.e.

$$ c _ {i*} = \frac{1}{J} \sum _ { j } c _ {ij} ,\ \ c _ {*} j = \frac{1}{I} \sum _ { i } c _ {ij} , $$

$$ c _ {**} = \frac{1}{IJ} \sum _ { ij } c _ {ij} = \frac{1}{I} \sum _ { i } c _ {i*} = \frac{1}{J} \sum _ { j } c _ {*} j . $$

Also, let $ \alpha = c _ {**} $, $ \beta _ {i} = c _ {i*} - c _ {**} $, $ \gamma _ {j} = c _ {*} j - c _ {**} $, and $ \delta _ {ij} = c _ {ij} - c _ {i*} - c _ {*} j + c _ {**} $. The idea of dispersion analysis is based on the obvious identity

$$ \tag{1 } c _ {ij} = \alpha + \beta _ {i} + \gamma _ {j} + \delta _ {ij} ,\ \ i = 1 \dots I ,\ j= 1 \dots J . $$

If the symbol $ ( c _ {ij} ) $ denotes a vector of dimension $ IJ $, obtained from a matrix $ \| c _ {ij} \| $ of order $ I \times J $ by some pre-set mode of ordering of its entries, then (1) may be written down as the equation

$$ \tag{2 } ( c _ {ij} ) = ( \alpha _ {ij} ) + ( \beta _ {ij} ) + ( \gamma _ {ij} ) + ( \delta _ {ij} ) , $$

where all the vectors are of dimension $ IJ $, and $ \alpha _ {ij} = \alpha $, $ \beta _ {ij} = \beta _ {i} $, $ \gamma _ {ij} = \gamma _ {j} $. Since the four vectors on the right-hand side of (2) are orthogonal, $ \alpha _ {ij} = \alpha $ is the best approximation of the function $ c _ {ij} $ in the arguments $ i $ and $ j $ by a constant magnitude (in the sense of the minimum sum of the square deviations $ \sum _ {ij} ( c _ {ij} - \alpha ) ^ {2} $). In the same sense $ \alpha _ {ij} + \beta _ {ij} = \alpha + \beta _ {i} $ is the best approximation of $ c _ {ij} $ by a function which depends only on $ i $; $ \alpha _ {ij} + \gamma _ {ij} = \alpha + \gamma _ {j} $ is the best approximation of $ c _ {ij} $ by a function depending only on $ j $, and $ \alpha _ {ij} + \beta _ {ij} + \gamma _ {ij} = \alpha + \beta _ {i} + \gamma _ {j} $ is the best approximation of $ c _ {ij} $ by a sum of functions, one of which (e.g. $ \alpha + \beta _ {i} $) depends on $ i $ only, while the other depends on $ j $ only. This fact, which had been established by Fisher [1] in 1918, subsequently served as the foundation of the theory of quadratic approximation of functions.

In the above example related to sports competition, the function $ \delta _ {ij} $ expresses the "interaction" of the $ i $- th sportsman with the $ j $- th referee (a positive value of $ \delta _ {ij} $ signifies an "overestimate" , i.e. systematically high estimates by the $ j $- th referee of the performances of the $ i $- th sportsman, while a negative value of $ \delta _ {ij} $ signifies an "underestimate" , i.e. estimates which are systematically too low). A necessary condition to be met by a group of experts is for all $ \delta _ {ij} $ to be equal to zero. In the case of agricultural experiments, such an equality is regarded as a hypothesis to be verified by experimental results, since the main objective is to find values of $ i $ and $ j $ such that the function (1) attains its maximum value. If this hypothesis is correct, then

$$ \max c _ {ij} = \alpha + \max \beta _ {i} + \max \gamma _ {j} , $$

which means that the detection of the best "soil" and "tillage" may be effected separately, with the result that the experimental work required is considerably reduced (for example, one may test out all $ I $ types of "soil" for one specific mode of tillage, thus finding the best type of soil, after which one tests out all $ J $ modes of "tillage" on that type of soil and finds out the best way; the total number of trials, including repetitions, will be $ ( I + J ) K $). If, on the other hand, the hypothesis $ \{ \textrm{ all } \delta _ {ij} = 0 \} $ is false, $ \max c _ {ij} $ can only be found by performing the "complete plan" described above, involving $ IJK $ experiments for $ K $ repetitions.

In the case of sports competitions the function $ \gamma _ {ij} = \gamma _ {j} $ may be treated as the systematic error committed by the $ j $- th referee in relation to all the sportsmen. Thus, $ \gamma _ {j} $ is a measure of the "rigour" or "mildness" of the $ j $- th referee. Ideally, all $ \gamma _ {j} $ are zero, but under the conditions occurring in practice one has to deal with non-zero values of $ \gamma _ {j} $ and has to take this fact into account in summing up the results of evaluations (e.g. one may base the comparison of the performance of the individual sportsman not on the sequence of true values of $ \alpha + \beta _ {1} + \gamma _ {j} \dots \alpha + \beta _ {I} + \gamma _ {j} $, but rather on the results of ordering these numbers by their values, since for all $ j = 1 \dots J $ the ordering will be the same). Finally, the sum of two remaining functions $ \alpha _ {ij} + \beta _ {ij} = \alpha + \beta _ {i} $ depends only on $ i $, and may therefore be used as a measure of the performance of the $ i $- th competitor. However, it must be borne in mind that here $ \alpha + \beta _ {i} = a _ {i} + b _ {i*} \neq a _ {i} $, and for this reason ordering the competitors according to the values of $ \alpha + \beta _ {i} $( or according to $ \alpha + \beta _ {i} + \gamma _ {j} $ for any given $ j $) may not be identical with the ordering according to the value of $ a _ {i} $. In the practical processing of expert evaluations this fact is neglected, since the above-mentioned "complete plan" does not provide for separate evaluations of $ a _ {i} $ and $ b _ {i*} $. Thus, the number $ \alpha + \beta _ {i} = a _ {i} + b _ {i*} $ is a characteristic not only of the performance of the $ i $- th sportsman, but also, to a certain extent, of the attitude of the experts towards his performance. This is why the results of subjective expert evaluations made at different times (in particular, during different Olympic games) can hardly be regarded as comparable. In the case of agricultural trials, on the other hand, no such difficulties arise, since all $ a _ {i} = 0 $, i.e. $ \alpha + \beta _ {i} = b _ {i*} $.

The true values of the functions $ \alpha $, $ \beta _ {i} $, $ \gamma _ {i} $, and $ \delta _ {ij} $ are not known and are expressed in terms of the unknown functions $ c _ {ij} $. Accordingly, the first stage in dispersion analysis is to find statistical estimators for $ c _ {ij} $ from the results $ x _ {ijk} $ of observations. An unbiased linear estimator for $ c _ {ij} $ with minimal dispersion is expressed by the formula

$$ {\widehat{c} } _ {ij} = x _ {ij*} = \frac{1}{K} \sum _ { k } x _ {ijk} . $$

Since $ \alpha $, $ \beta _ {i} $, $ \gamma _ {j} $, and $ \delta _ {ij} $ are linear functions of the entries of the matrix $ \| c _ {ij} \| $, the unbiased linear estimators for these functions with minimal dispersion are obtained by replacing the arguments $ c _ {ij} $ by the respective estimators $ {\widehat{c} } _ {ij} $, viz.

$$ \widehat \alpha = x _ {***} ,\ {\widehat \beta } _ {i} = x _ {i**} - x _ {***} , \ {\widehat \gamma } _ {j} = x _ {*} j* - x _ {***} , $$

$$ {\widehat \delta } _ {ij} = x _ {ij*} - x _ {i**} - x _ {*} j* + x _ {***} , $$

and the random vectors $ ( {\widehat \alpha } _ {ij} ) $, $ ( {\widehat \beta } _ {ij} ) $, $ ( {\widehat \gamma } _ {ij} ) $, and $ ( {\widehat \delta } _ {ij} ) $, defined in the same way as $ ( \alpha _ {ij} ) $, $ ( \beta _ {ij} ) $, $ ( \gamma _ {ij} ) $, and $ ( \delta _ {ij} ) $ introduced above, are orthogonal, i.e. are uncorrelated random vectors (in other words, any two components belonging to different vectors have correlation coefficient zero). In addition, any difference of the form

$$ x _ {ijk} - x _ {ij*} = x _ {ijk} - {\widehat{c} } _ {ij} $$

is uncorrelated with any component of these four vectors. Consider the five sets of random variables $ \{ x _ {ijk} \} $, $ \{ x _ {ijk} - x _ {ij*} \} $, $ \{ {\widehat \beta } _ {i} \} $, $ \{ {\widehat \gamma } _ {j} \} $, and $ \{ {\widehat \delta } _ {ij} \} $. Since

$$ x _ {ijk} - x _ {ij*} = y _ {ijk} - y _ {ij*} ,\ \ {\widehat \beta } _ {i} = \beta _ {i} + ( y _ {i**} - y _ {***} ) , $$

$$ {\widehat \gamma } _ {j} = \gamma _ {j} + ( y _ {*} j* - y _ {***} ) , $$

$$ {\widehat \delta } _ {ij} = \delta _ {ij} + ( y _ {ij*} - y _ {i**} - y _ {*} j* + y _ {***} ) , $$

the dispersions of the empirical distributions corresponding to these sets are expressed by the formulas

$$ S ^ {2} = \frac{1}{IJK} \sum _ { ijk } ( x _ {ijk} - x _ {***} ) ^ {2} , $$

$$ S _ {0} ^ {2} = \frac{1}{IJK} \sum _ { ijk } ( x _ {ijk} - x _ {ij*} ) ^ {2} = \frac{1}{IJK} \sum _ { ijk } ( y _ {ijk} - y _ {ij*} ) ^ {2} , $$

$$ S _ {1} ^ {2} = \frac{1}{I} \sum _ { i } {\widehat \beta } _ {i} ^ {2} = \frac{1}{I} \sum _ { i } [ \beta _ {i} + ( y _ {i**} - y _ {***} ) ] ^ {2} , $$

$$ S _ {2} ^ {2} = \frac{1}{J} \sum _ { j } {\widehat \gamma } {} _ {j} ^ {2} = \frac{1}{J} \sum _ { j } [ \gamma _ {j} + ( y _ {*} j* - y _ {***} ) ] ^ {2} , $$

$$ S _ {3} ^ {2} = \frac{1}{IJ} \sum _ { ij } {\widehat \delta } _ {ij} ^ {2} = \frac{1}{IJ} \sum _ { ij } [ \delta _ {ij} + ( y _ {ij*} - y _ {i**} - y _ {*} j* + y _ {***} ) ] ^ {2} . $$

These empirical dispersions are sums of squares of random variables, any two of which are uncorrelated provided they belong to different sums; also, the identity

$$ S ^ {2} = S _ {0} ^ {2} + S _ {1} ^ {2} + S _ {2} ^ {2} + S _ {3} ^ {2} , $$

explaining the origin of the term "dispersion analysis" , is valid for all $ y _ {ijk} $.

Let $ I, J, K \geq 2 $ and let

$$ s _ {0} ^ {2} = \frac{K}{K-} 1 S _ {0} ^ {2} ,\ s _ {1} ^ {2} = \frac{IJK}{I-} 1 S _ {1} ^ {2} ,\ s _ {2} ^ {2} = \frac{IJK}{J-} 1 S _ {2} ^ {2} , $$

$$ s _ {3} ^ {2} = \frac{IJK}{( I- 1 ) ( J- 1 ) } S _ {3} ^ {2} ; $$

then

$$ {\mathsf E} s _ {0} ^ {2} = \sigma ^ {2} ,\ \ {\mathsf E} s _ {1} ^ {2} = \ \sigma ^ {2} + \frac{JK}{I-} 1 \sum _ { i } \beta _ {i} ^ {2} ,\ {\mathsf E} s _ {2} ^ {2} = \sigma ^ {2} + \frac{IK}{J-} 1 \sum _ { j } \gamma _ {j} ^ {2} , $$

$$ {\mathsf E} s _ {3} ^ {2} = \sigma ^ {2} + \frac{K}{ ( I- 1 )( J- 1) } \sum _ { ij } \delta _ {ij} ^ {2} , $$

where $ \sigma ^ {2} $ is the dispersion of the random errors $ y _ {ijk} $.

These formulas form the base of the second stage in dispersion analysis — to wit, the clarification of the effect of the first and of the second factor on the experimental results (in agricultural trials the first factor is the "soil" type, the second is the mode of "tillage" ). For instance, in order to verify the hypothesis that the two factors are mutually "independent" , i.e. that $ \sum _ {ij} \delta _ {ij} ^ {2} = 0 $, it is reasonable to compute the dispersion proportion $ s _ {3} ^ {2} / s _ {0} ^ {2} = F _ {3} $. If this ratio is significantly different from one, the hypothesis is rejected. In the same way, the hypothesis $ \sum _ {j} \gamma _ {j} ^ {2} = 0 $ is usefully verified by the proportion $ s _ {2} ^ {2} / s _ {0} ^ {2} = F _ {2} $, which should also be compared with one; if it also known that $ \sum _ {ij} \delta _ {ij} ^ {2} = 0 $, the expression

$$ \frac{( IJK - I - J - 1 ) s _ {2} ^ {2} }{IJ ( K- 1 ) s _ {0} ^ {2} + ( I- 1 ) ( J- 1 ) s _ {3} ^ {2} } = F _ {2} ^ { * } , $$

rather than $ F _ {2} $, should be compared with one. A statistic for the verification of the hypothesis $ \sum _ {i} \beta _ {i} ^ {2} = 0 $ can be constructed in a similar manner.

The exact meaning of the concept of a significant difference of the above expressions from one may be defined only in terms of the distribution law of the random errors $ y _ {ijk} $. The situation most extensively studied in dispersion analysis is that of all $ y _ {ijk} $ being normally distributed. In such a case $ ( {\widehat \alpha } _ {ij} ) $, $ ( {\widehat \beta } _ {ij} ) $, $ ( {\widehat \gamma } _ {ij} ) $, $ ( {\widehat \delta } _ {ij} ) $ are independent random vectors, while $ s _ {0} ^ {2} $, $ s _ {1} ^ {2} $, $ s _ {2} ^ {2} $, $ s _ {3} ^ {2} $ are independent random variables, and the statistics

$$ IJ ( K- 1 ) \frac{s _ {0} ^ {2} }{\sigma ^ {2} } ,\ ( I- 1 ) \frac{s _ {1} ^ {2} }{\sigma ^ {2} } ,\ ( J- 1 ) \frac{s _ {2} ^ {2} }{\sigma ^ {2} } , $$

$$ ( I- 1)( J- 1 ) \frac{s _ {3} ^ {2} }{\sigma ^ {2} } $$

will have non-central chi-squared distributions with $ f _ {m} $ degrees of freedom and with non-centrality parameters $ \lambda _ {m} $, $ m= 0 , 1 , 2 , 3 $, where

$$ f _ {0} = IJ ( K- 1 ) ,\ f _ {1} = I- 1 ,\ f _ {2} = J- 1 ,\ f _ {3} = ( I- 1 )( J- 1 ) ; $$

$$ \lambda _ {0} = 0 ,\ \lambda _ {1} = JK \sum _ { i } \frac{\beta _ {i} ^ {2} }{\sigma ^ {2} } ,\ \lambda _ {2} = IK \sum _ { j } \frac{\gamma _ {j} ^ {2} }{\sigma ^ {2} } , $$

$$ \lambda _ {3} = K \sum _ { ij } \frac{\delta _ {ij} ^ {2} }{\sigma ^ {2} } . $$

If the non-centrality parameter is zero, the non-central chi-squared distribution becomes identical with the ordinary chi-squared distribution. Thus, if the hypothesis $ \lambda _ {3} = 0 $ is true, the proportion $ s _ {3} ^ {2} / s _ {0} ^ {2} = F _ {3} $ has an $ F $- distribution with parameters $ f _ {3} $ and $ f _ {0} $( the distribution of the dispersion proportion). Let $ x $ be the number for which the probability of the event $ \{ F _ {3} > x \} $ equals a pre-set value $ \epsilon $ known as the significance level (tables of the function $ x = x ( \epsilon ; f _ {3} , f _ {0} ) $ can be found in most textbooks on mathematical statistics). The verification criterion of the hypothesis $ \lambda _ {3} = 0 $ is that if the observed value of $ F _ {3} $ is greater than $ x $, the hypothesis is rejected; otherwise, the hypothesis is said not to be in contradiction with the experimental results. Criteria based on the statistics $ F _ {2} $ and $ F _ {2} ^ { * } $ are constructed in a similar manner.

The following stages in dispersion analysis materially depend not only on the nature of the problem to be solved, but also on the results of the statistical verification of the hypothesis during the second stage. Thus, as has been seen, the truth of the hypothesis $ \lambda _ {3} = 0 $ in agricultural trials permits a more economical design of subsequent experiments (if the hypotheses $ \lambda _ {3} = 0 $ and $ \lambda _ {2} = 0 $ are both true, the productivity depends only on the type of "soil" , and subsequent experiments may be performed in the framework of one-factor dispersion analysis); if the hypothesis $ \lambda _ {3} = 0 $ is false, it is reasonable to look for a third, hitherto unrecognized, factor which is relevant to the problem. If the types of "soil" and "tillage" methods were varied not only locally but in different geographic zones, climatic or geographic conditions may act as such a third factor, and the processing of the observations must involve a three-factor dispersion analysis.

In the case of expert evaluations, if the hypothesis $ \lambda _ {3} = 0 $ has been statistically confirmed, it is permissible to order the objects being compared (e.g. sportsmen) according to the values of $ \widehat \alpha + {\widehat \beta } _ {i} $, $ i= 1 \dots I $. If the hypothesis $ \lambda _ {3} = 0 $ is false (in the case of sports competition this indicates "interaction" between some some competitors and referees), the obvious course is to recalculate all results after discarding the values $ x _ {ijk} $ with pairs of indexes $ ( i, j) $ for which the absolute values of the statistical estimators $ \delta _ {ij} $ exceed some pre-set permissible level. This means that certain entries of the matrix $ \| x _ {ij*} \| $ are deleted, and the plan of dispersion analysis becomes incomplete.

Models of modern dispersion analysis comprise a wide circle of real experimental schemes (e.g. schemes of incomplete plans, with randomly or non-randomly selected elements $ x _ {ij*} $). The respective statistical conclusions are often still in the stage of development. At the time of writing (1987) particular problems in which the results of the observations $ x _ {ijk} = c _ {ij} + y _ {ijk} $ are not identically-distributed random variables are still far from being solved; even more difficult problems are those in which the values $ x _ {ijk} $ are dependent. The problem of factor selection has not been solved, even in the linear case. This problem may be formulated as follows. Let $ c = c( u , v) $ be a continuous function and let $ u = u ( z, w) $ and $ v = v( z, w) $ be arbitrary linear functions in the variables $ z $ and $ w $. Given the values of $ z _ {1} \dots z _ {I} $ and $ w _ {1} \dots w _ {J} $, $ c _ {ij} $ may be determined for any given choice of the linear functions $ u $ and $ v $ by the formula

$$ c _ {ij} = c [ u ( z _ {i} , w _ {j} ), v ( z _ {i} , w _ {j} )] , $$

and one can construct the dispersion analysis of these variables from the results of the respective observations $ x _ {ijk} $. The problem is to find the linear functions $ u $ and $ v $ for which the value of the sum of the squares $ \sum _ {ij} \delta _ {ij} ^ {2} $, where

$$ \delta _ {ij} = c _ {ij} - c _ {i*} - c _ {*} j + c _ {**} , $$

is minimal (on the assumption that the function $ c( u , v) $ is not known). In terms of dispersion analysis, the problem is reduced to a statistical determination of the factors $ z = z( u , v) $ and $ w = w( u , v) $ corresponding to "least interaction" .

References[edit]

[1]	R.A. Fisher, "Statistical methods of research workers" , Oliver & Boyd (1925)
[2]	H. Scheffé, "The analysis of variance" , Wiley (1959)
[3]	A. Hald, "Statistical theory with engineering applications" , Wiley (1952)
[4]	G.W. Snedecor, W.G. Cochran, "Statistical methods: applied to experiments in agriculture and biology" , Iowa State College Collegiate Press (1957)
[5]	M.S. Nikulin, "Application of the model of two-factor analysis of variance without interaction" J. Soviet Math. , 25 : 3 (1984) pp. 1196–1207 Zap. Nauchn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. Stud. Mat. Stat. , 108 : 5 (1981) pp. 134–153

Comments[edit]

The phrase "dispersion analysis" is out of use and has been replaced by analysis of variance.