Encyclosphere.org ENCYCLOREADER
  supported by EncyclosphereKSF

Correlation (in statistics)

From Encyclopedia of Mathematics - Reading time: 6 min


A dependence between random variables not necessarily expressed by a rigorous functional relationship. Unlike functional dependence, a correlation is, as a rule, considered when one of the random variables depends not only on the other (given) one, but also on several random factors. The dependence between two random events is manifested in the fact that the conditional probability of one of them, given the occurrence of the other, differs from the unconditional probability. Similarly, the influence of one random variable on another is characterized by the conditional distributions of one of them, given fixed values of the other. Let X and Y be random variables with given joint distribution, let mX and mY be the expectations of X and Y, let σX2 and σY2 be the variances of X and Y, and let ρ be the correlation coefficient of X and Y. Assume that for every possible value X=x the conditional mathematical expectation y(x)=E[YX=x] of Y is defined; then the function y(x) is known as the regression of Y given X, and its graph is the regression curve of Y given X. The dependence of Y on X is manifested in the variation of the mean values of Y as X varies, although for each fixed value X=x, Y remains a random variable with a well-defined spread. In order to determine to what degree of accuracy the regression reproduces the variation of Y as X varies, one uses the conditional variance of Y for a given X=x or its mean value (a measure of the spread of Y about the regression curve):

σYX2= E[YE(YX=x)]2.

If X and Y are independent, then all conditional mathematical expectations of Y are independent of x and coincide with the unconditional expectations: y(x)=mY; and then also σYX2=σY2. When Y is a function of X in the strict sense of the word, then for each X=x the variable Y takes only one definite value and σYX2=0. Similarly one defines x(y)=E[XY=y]( the regression of X given Y). A natural index of the concentration of the distribution near the regression curve y(x) is the correlation ratio

ηYX2= 1σYX2σY2.

One has ηYX2=0 if and only if the regression has the form y(x)=mY, and in that case the correlation coefficient ρ vanishes and Y is not correlated with X. If the regression of Y given X is linear, i.e. the regression curve is the straight line

y(x)=my+ρσYσX(xmX),

then

σYX2= σY2(1ρ2)   and   ηYX2=ρ2;

if, moreover, |ρ|=1, then Y is related to X through an exact linear dependence; but if ηYX2=ρ2<1, there is no functional dependence between Y and X. There is an exact functional dependence of Y on X, other than a linear one, if and only if ρ2<ηYX2=1. With rare exceptions, the practical use of the correlation coefficient as a measure of the lack of dependence is justifiable only when the joint distribution of X and Y is normal (or close to normal), since in that case ρ=0 implies that X and Y are independent. Use of ρ as a measure of dependence for arbitrary random variables X and Y frequently leads to erroneous conclusions, since ρ may vanish even when a functional dependence exists. If the joint distribution of X and Y is normal, then both regression curves are straight lines and ρ uniquely determines the concentration of the distribution near the regression curves: When |ρ|=1 the regression curves merge into one, corresponding to linear dependence between X and Y; when ρ=0 one has independence.

When studying the interdependence of several random variables X1Xn with a given joint distribution, one uses multiple and partial correlation ratios and coefficients. The latter are evaluated using the ordinary correlation coefficients between Xi and Xj, the totality of which form the correlation matrix. A measure of the linear relationship between X1 and the totality of the other variables X2Xn is provided by the multiple-correlation coefficient. If the mutual relationship of X1 and X2 is assumed to be determined by the influence of the other variables X3Xn, then the partial correlation coefficient of X1 and X2 with respect to X3Xn is an index of the linear relationship between X1 and X2 relative to X3Xn.

For measures of correlation based on rank statistics (cf. Rank statistic) see Kendall coefficient of rank correlation; Spearman coefficient of rank correlation.

Mathematical statisticians have developed methods for estimating coefficients that characterize the correlation between random variables or tests; there are also methods to test hypotheses concerning their values, using their sampling analogues. These methods are collectively known as correlation analysis. Correlation analysis of statistical data consists of the following basic practical steps: 1) the construction of a scatter plot and the compilation of a correlation table; 2) the computation of sampling correlation ratios or correlation coefficients; 3) testing statistical hypothesis concerning the significance of the dependence. Further investigation may consist in establishing the concrete form of the dependence between the variables (see Regression).

Among the aids to analysis of two-dimensional sample data are the scatter plot and the correlation table. The scatter plot is obtained by plotting the sample points on the coordinate plane. Examination of the configuration formed by the points of the scatter plot yields a preliminary idea of the type of dependence between the random variables (e.g. whether one of the variables increases or decreases on the average as the other increases). Prior to numerical processing, the results are usually grouped and presented in the form of a correlation table. In each entry of this table one writes the number nij of pairs (x,y) with components in the appropriate grouping intervals. Assuming that the grouping intervals (in each of the variables) are equal in length, one takes the centres xi( or yi) of the intervals and the numbers nij as the basis for calculation.

For more accurate information about the nature and strength of the relationship than that provided by the scatter plot, one turns to the correlation coefficient and correlation ratio. The sample correlation coefficient is defined by the formula

ρ^= ij(xix)(yjy)nijini(xix)2jnj(yjy)2,

where

ni= jnij,  nj= inij

and

x= inixin,  y= jnjyjn.

In the case of a large number of independent observations, governed by one and the same near-normal distribution, ρ^ is a good approximation to the true correlation coefficient ρ. In all other cases, as characteristic of strength of the relationship the correlation ratio is recommended, the interpretation of which is independent of the type of dependence being studied. The sample value η^YX2 is computed from the entries in the correlation table:

η^YX2= 1nini(yiy)21njnj(yjy)2,

where the numerator represents the spread of the conditional mean values yi about the unconditional mean y (the sample value η^XY2 is defined analogously). The quantity η^YX2ρ^2 is used as an indicator of the deviation of the regression from linearity.

The testing of hypotheses concerning the significance of a relationship are based on the distributions of the sample correlation characteristics. In the case of a normal distribution, the value of the sample correlation coefficient ρ^ is significantly distinct from zero if

(ρ^)2> [1+n2tα2]1,

where tα is the critical value of the Student t-distribution with (n2) degrees of freedom corresponding to the chosen significance level α. If ρ0 one usually uses the Fisher z-transform, with ρ^ replaced by z according to the formula

z=12ln(1+ρ^1ρ^).

Even at relatively small values n the distribution of z is a good approximation to the normal distribution with mathematical expectation

12ln1+ρ1ρ+ρ2(n1)

and variance 1/(n3). On this basis one can now define approximate confidence intervals for the true correlation coefficient ρ.

For the distribution of the sample correlation ratio and for tests of the linearity hypothesis for the regression, see [3].

References[edit]

[1] H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)
[2] B.L. van der Waerden, "Mathematische Statistik" , Springer (1957)
[3] M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979)
[4] S.A. Aivazyan, "Statistical research on dependence" , Moscow (1968) (In Russian)

How to Cite This Entry: Correlation (in statistics) (Encyclopedia of Mathematics) | Licensed under CC BY-SA 3.0. Source: https://encyclopediaofmath.org/wiki/Correlation_(in_statistics)
16 views |
↧ Download this article as ZWI file
Encyclosphere.org EncycloReader is supported by the EncyclosphereKSF