In statistics and in probability theory, distance correlation or distance covariance is a measure of dependence between two paired random vectors of arbitrary, not necessarily equal, dimension. The population distance correlation coefficient is zero if and only if the random vectors are independent. Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables. Distance correlation can be used to perform a statistical test of dependence with a permutation test. One first computes the distance correlation (involving the re-centering of Euclidean distance matrices) between two random vectors, and then compares this value to the distance correlations of many shuffles of the data.
The classical measure of dependence, the Pearson correlation coefficient,[1] is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by Gábor J. Székely in several lectures to address this deficiency of Pearson's correlation, namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.[2][3] It was proved that distance covariance is the same as the Brownian covariance.[3] These measures are examples of energy distances.
The distance correlation is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation, and distance covariance. These quantities take the same roles as the ordinary moments with corresponding names in the specification of the Pearson product-moment correlation coefficient.
Let us start with the definition of the sample distance covariance. Let (Xk, Yk), k = 1, 2, ..., n be a statistical sample from a pair of real valued or vector valued random variables (X, Y). First, compute the n by n distance matrices (aj, k) and (bj, k) containing all pairwise distances
where ||⋅ ||denotes Euclidean norm. Then take all doubly centered distances
where
The statistic Tn = n dCov2n(X, Y) determines a consistent multivariate test of independence of random vectors in arbitrary dimensions. For an implementation see dcov.test function in the energy package for R.[4]
The population value of distance covariance can be defined along the same lines. Let X be a random variable that takes values in a p-dimensional Euclidean space with probability distribution μ and let Y be a random variable that takes values in a q-dimensional Euclidean space with probability distribution ν, and suppose that X and Y have finite expectations. Write
Finally, define the population value of squared distance covariance of X and Y as
One can show that this is equivalent to the following definition:
where E denotes expected value, and
This identity shows that the distance covariance is not the same as the covariance of distances, cov(||X − X' ||, ||Y − Y' ||). This can be zero even if X and Y are not independent.
Alternatively, the distance covariance can be defined as the weighted L2 norm of the distance between the joint characteristic function of the random variables and the product of their marginal characteristic functions:[6]
where
The distance variance is a special case of distance covariance when the two variables are identical. The population value of distance variance is the square root of
where
The sample distance variance is the square root of
which is a relative of Corrado Gini's mean difference introduced in 1912 (but Gini did not work with centered distances).[8]
The distance standard deviation is the square root of the distance variance.
The distance correlation [2][3] of two random variables is obtained by dividing their distance covariance by the product of their distance standard deviations. The distance correlation is the square root of
and the sample distance correlation is defined by substituting the sample distance covariance and distance variances for the population coefficients above.
For easy computation of sample distance correlation see the dcor function in the energy package for R.[4]
This last property is the most important effect of working with centered distances.
The statistic
An unbiased estimator of
Equality holds in (iv) if and only if one of the random variables X or Y is a constant.
Distance covariance can be generalized to include powers of Euclidean distance. Define
Then for every
One can extend
This is non-negative for all such
The original distance covariance has been defined as the square root of
Alternately, one could define distance covariance to be the square of the energy distance:
Under these alternate definitions, the distance correlation is also defined as the square
Brownian covariance is motivated by generalization of the notion of covariance to stochastic processes. The square of the covariance of random variables X and Y can be written in the following form:
where E denotes the expected value and the prime denotes independent and identically distributed copies. We need the following generalization of this formula. If U(s), V(t) are arbitrary random processes defined for all real s and t then define the U-centered version of X by
whenever the subtracted conditional expected value exists and denote by YV the V-centered version of Y.[3][13][14] The (U,V) covariance of (X,Y) is defined as the nonnegative number whose square is
whenever the right-hand side is nonnegative and finite. The most important example is when U and V are two-sided independent Brownian motions /Wiener processes with expectation zero and covariance |s| + |t| − |s − t| = 2 min(s,t) (for nonnegative s, t only). (This is twice the covariance of the standard Wiener process; here the factor 2 simplifies the computations.) In this case the (U,V) covariance is called Brownian covariance and is denoted by
There is a surprising coincidence: The Brownian covariance is the same as the distance covariance:
and thus Brownian correlation is the same as distance correlation.
On the other hand, if we replace the Brownian motion with the deterministic identity function id then Covid(X,Y) is simply the absolute value of the classical Pearson covariance,
Other correlational metrics, including kernel-based correlational metrics (such as the Hilbert-Schmidt Independence Criterion or HSIC) can also detect linear and nonlinear interactions. Both distance correlation and kernel-based metrics can be used in methods such as canonical correlation analysis and independent component analysis to yield stronger statistical power.
![]() | Original source: https://en.wikipedia.org/wiki/Distance correlation.
Read more |