Correlation (in statistics)

A dependence between random variables not necessarily expressed by a rigorous functional relationship. Unlike functional dependence, a correlation is, as a rule, considered when one of the random variables depends not only on the other (given) one, but also on several random factors. The dependence between two random events is manifested in the fact that the conditional probability of one of them, given the occurrence of the other, differs from the unconditional probability. Similarly, the influence of one random variable on another is characterized by the conditional distributions of one of them, given fixed values of the other. Let $X$ and $Y$ be random variables with given joint distribution, let $m_{X}$ and $m_{Y}$ be the expectations of $X$ and $Y$ , let $σ_{X}^{2}$ and $σ_{Y}^{2}$ be the variances of $X$ and $Y$ , and let $ρ$ be the correlation coefficient of $X$ and $Y$ . Assume that for every possible value $X = x$ the conditional mathematical expectation $y (x) = E [Y ∣ X = x]$ of $Y$ is defined; then the function $y (x)$ is known as the regression of $Y$ given $X$ , and its graph is the regression curve of $Y$ given $X$ . The dependence of $Y$ on $X$ is manifested in the variation of the mean values of $Y$ as $X$ varies, although for each fixed value $X = x$ , $Y$ remains a random variable with a well-defined spread. In order to determine to what degree of accuracy the regression reproduces the variation of $Y$ as $X$ varies, one uses the conditional variance of $Y$ for a given $X = x$ or its mean value (a measure of the spread of $Y$ about the regression curve):

$σ_{Y ∣ X}^{2} = E [Y - E (Y ∣ X = x)]^{2} .$

If $X$ and $Y$ are independent, then all conditional mathematical expectations of $Y$ are independent of $x$ and coincide with the unconditional expectations: $y (x) = m_{Y}$ ; and then also $σ_{Y ∣ X}^{2} = σ_{Y}^{2}$ . When $Y$ is a function of $X$ in the strict sense of the word, then for each $X = x$ the variable $Y$ takes only one definite value and $σ_{Y ∣ X}^{2} = 0$ . Similarly one defines $x (y) = E [X ∣ Y = y]$ ( the regression of $X$ given $Y$ ). A natural index of the concentration of the distribution near the regression curve $y (x)$ is the correlation ratio

$η_{Y ∣ X}^{2} = 1 - \frac{σ_{Y ∣ X}^{2}}{σ_{Y}^{2}} .$

One has $η_{Y ∣ X}^{2} = 0$ if and only if the regression has the form $y (x) = m_{Y}$ , and in that case the correlation coefficient $ρ$ vanishes and $Y$ is not correlated with $X$ . If the regression of $Y$ given $X$ is linear, i.e. the regression curve is the straight line

$y (x) = m_{y} + ρ \frac{σ_{Y}}{σ_{X}} (x - m_{X}),$

then

$σ_{Y ∣ X}^{2} = σ_{Y}^{2} (1 - ρ^{2}) and η_{Y ∣ X}^{2} = ρ^{2};$

if, moreover, $| ρ | = 1$ , then $Y$ is related to $X$ through an exact linear dependence; but if $η_{Y ∣ X}^{2} = ρ^{2} < 1$ , there is no functional dependence between $Y$ and $X$ . There is an exact functional dependence of $Y$ on $X$ , other than a linear one, if and only if $ρ^{2} < η_{Y ∣ X}^{2} = 1$ . With rare exceptions, the practical use of the correlation coefficient as a measure of the lack of dependence is justifiable only when the joint distribution of $X$ and $Y$ is normal (or close to normal), since in that case $ρ = 0$ implies that $X$ and $Y$ are independent. Use of $ρ$ as a measure of dependence for arbitrary random variables $X$ and $Y$ frequently leads to erroneous conclusions, since $ρ$ may vanish even when a functional dependence exists. If the joint distribution of $X$ and $Y$ is normal, then both regression curves are straight lines and $ρ$ uniquely determines the concentration of the distribution near the regression curves: When $| ρ | = 1$ the regression curves merge into one, corresponding to linear dependence between $X$ and $Y$ ; when $ρ = 0$ one has independence.

When studying the interdependence of several random variables $X_{1} \dots X_{n}$ with a given joint distribution, one uses multiple and partial correlation ratios and coefficients. The latter are evaluated using the ordinary correlation coefficients between $X_{i}$ and $X_{j}$ , the totality of which form the correlation matrix. A measure of the linear relationship between $X_{1}$ and the totality of the other variables $X_{2} \dots X_{n}$ is provided by the multiple-correlation coefficient. If the mutual relationship of $X_{1}$ and $X_{2}$ is assumed to be determined by the influence of the other variables $X_{3} \dots X_{n}$ , then the partial correlation coefficient of $X_{1}$ and $X_{2}$ with respect to $X_{3} \dots X_{n}$ is an index of the linear relationship between $X_{1}$ and $X_{2}$ relative to $X_{3} \dots X_{n}$ .

For measures of correlation based on rank statistics (cf. Rank statistic) see Kendall coefficient of rank correlation; Spearman coefficient of rank correlation.

Mathematical statisticians have developed methods for estimating coefficients that characterize the correlation between random variables or tests; there are also methods to test hypotheses concerning their values, using their sampling analogues. These methods are collectively known as correlation analysis. Correlation analysis of statistical data consists of the following basic practical steps: 1) the construction of a scatter plot and the compilation of a correlation table; 2) the computation of sampling correlation ratios or correlation coefficients; 3) testing statistical hypothesis concerning the significance of the dependence. Further investigation may consist in establishing the concrete form of the dependence between the variables (see Regression).

Among the aids to analysis of two-dimensional sample data are the scatter plot and the correlation table. The scatter plot is obtained by plotting the sample points on the coordinate plane. Examination of the configuration formed by the points of the scatter plot yields a preliminary idea of the type of dependence between the random variables (e.g. whether one of the variables increases or decreases on the average as the other increases). Prior to numerical processing, the results are usually grouped and presented in the form of a correlation table. In each entry of this table one writes the number $n_{i j}$ of pairs $(x, y)$ with components in the appropriate grouping intervals. Assuming that the grouping intervals (in each of the variables) are equal in length, one takes the centres $x_{i}$ ( or $y_{i}$ ) of the intervals and the numbers $n_{i j}$ as the basis for calculation.

For more accurate information about the nature and strength of the relationship than that provided by the scatter plot, one turns to the correlation coefficient and correlation ratio. The sample correlation coefficient is defined by the formula

$\hat{ρ} = \frac{\sum_{i} \sum_{j} (x_{i} - \overset{―}{x}) (y_{j} - \overset{―}{y}) n_{i j}}{\sqrt{\sum_{i} n_{i \cdot} (x_{i} - \overset{―}{x})^{2}} \sqrt{\sum_{j} n_{\cdot j} (y_{j} - \overset{―}{y})^{2}}},$

where

$n_{i \cdot} = \sum_{j} n_{i j}, n_{\cdot j} = \sum_{i} n_{i j}$

and

$\overset{―}{x} = \frac{\sum_{i} n_{i \cdot} x_{i}}{n}, \overset{―}{y} = \frac{\sum_{j} n_{\cdot j} y_{j}}{n} .$

In the case of a large number of independent observations, governed by one and the same near-normal distribution, $\hat{ρ}$ is a good approximation to the true correlation coefficient $ρ$ . In all other cases, as characteristic of strength of the relationship the correlation ratio is recommended, the interpretation of which is independent of the type of dependence being studied. The sample value $\hat{η}_{Y ∣ X}^{2}$ is computed from the entries in the correlation table:

$\hat{η}_{Y ∣ X}^{2} = \frac{\frac{1}{n} \sum_{i} n_{i \cdot} ({\overset{―}{y}}_{i} - \overset{―}{y})^{2}}{\frac{1}{n} \sum_{j} n_{\cdot j} (y_{j} - \overset{―}{y})^{2}},$

where the numerator represents the spread of the conditional mean values ${\overset{―}{y}}_{i}$ about the unconditional mean $\overset{―}{y}$ (the sample value $\hat{η}_{X ∣ Y}^{2}$ is defined analogously). The quantity $\hat{η}_{Y ∣ X}^{2} - \hat{ρ}^{2}$ is used as an indicator of the deviation of the regression from linearity.

The testing of hypotheses concerning the significance of a relationship are based on the distributions of the sample correlation characteristics. In the case of a normal distribution, the value of the sample correlation coefficient $\hat{ρ}$ is significantly distinct from zero if

$(\hat{ρ})^{2} > {[1 + \frac{n - 2}{t_{α}^{2}}]}^{- 1},$

where $t_{α}$ is the critical value of the Student $t$ -distribution with $(n - 2)$ degrees of freedom corresponding to the chosen significance level $α$ . If $ρ \neq 0$ one usually uses the Fisher $z$ -transform, with $\hat{ρ}$ replaced by $z$ according to the formula

$z = \frac{1}{2} \ln (\frac{1 + \hat{ρ}}{1 - \hat{ρ}}) .$

Even at relatively small values $n$ the distribution of $z$ is a good approximation to the normal distribution with mathematical expectation

$\frac{1}{2} \ln \frac{1 + ρ}{1 - ρ} + \frac{ρ}{2 (n - 1)}$

and variance $1 / (n - 3)$ . On this basis one can now define approximate confidence intervals for the true correlation coefficient $ρ$ .

For the distribution of the sample correlation ratio and for tests of the linearity hypothesis for the regression, see [3].

References[edit]

[1]	H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)
[2]	B.L. van der Waerden, "Mathematische Statistik" , Springer (1957)
[3]	M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979)
[4]	S.A. Aivazyan, "Statistical research on dependence" , Moscow (1968) (In Russian)