Correlation describes the relationship of two factors to one another (see cause and effect). When two factors are found to go together, there are only three explanations. First, the two things may be entirely unrelated. Second, they may both be caused by another factor. Third, one of the factors might be causing the other. It is the task of statistics to point out the existence of a correlation, but it is the task of science to discover what relationship, if any, exists.
Often cancer clusters have been said to indicate the existence of a biological hazard, but by the laws of chance in any group of people who get a rare disease (or win the lottery) there will always be some who live close together. This does not mean that something in their environment caused their tragedy or good fortune.
If you flip coins or throw dice, from time to time you will get runs of similar results like 4 heads in a row. You might try to find the cause for these runs in what you were thinking about at the time, hoping to find that you could influence the toss or roll. But this is nonsense.
When unrelated events occur, it is dismissed as a matter of coincidence.
Two commuters might happen to see each other on a train, not because either intended to meet the other, but simply because each happens to dislike crowded trains and prefer trains with dining cars. When the regular train is crowded or lacks a dining car, both may independently decide to take a later train. Seeing each other in the dining car of one would not be merely a coincidence, but would stem from the conditions on the earlier train.
Cases of shark attacks correlate with sales of ice cream, but no one thinks that the bites make people buy ice cream, or that eating ice cream makes you vulnerable to sharks. Both things go up and down in an annual cycle, because people go to the beach more in the summer. They swim and buy ice cream because it's hot. Swimming exposes them to (rare) shark attacks.
In scientific subjects like chemistry, physics and astronomy many laws of cause and effect have been discovered. Put two chemicals together, and they form a compound. Push something, and its momentum increases. The scientific method is useful in describing the relationship between events, especially when despite their best efforts, no one has been able to find an exception (see independent review).
In common usage, it denotes an association of one variable with another in quite general terms; for example, one might say, "success is correlated with hard work". In mathematics, however, and in science and engineering, which make use of mathematical concepts, correlation is a technical term with a precise definition.
Correlation must be distinguished from causation (see article on correlation is not causation). When one factor changes and another factor changes with it, there is usually a direct relationship between the two factors, observed as a correlation. Alternatively, both changed could be the result of changes in a third factor. For example, the prices of two unrelated goods might increase during a period of inflation; the two price rises are correlated with each other but neither has caused the other.
Once a correlation is established, scientists may conduct research to determine causation. Are respiration deaths causing air pollution, or is it the other way around? It is easy to determine that sickness among the elderly does not cause air pollution. Rather, it is chemicals like sulfur dioxide (typically from coal burning power plants) which are the culprits. Cities and states measure the amount of pollutants in the air and epidemiologists can use these data, comparing them to the number of people who develop respiratory diseases.
Regulations which restrict air pollution are made on the basis of these correlations, and on the cause and effect relationships which the correlations help scientists to discover. However, activists have sometimes created false correlations by selective use of data. [2]
This section is at the level of advanced high school maths (e.g. A-level or Baccalaureat) and can be skipped by general readers.
The correlation coefficient, also known as Pearson's r, is a statistical measure of association between two continuous variables. It is defined as:
r= ΣZxZy/n
Where: Zx= the Z-score of the independent variable X, Zy= the Z-score of the dependent variable Y, and n= the number of observations of variables X and Y.
Thus, Pearson's r is the arithmetic mean of the products of the variable Z-scores. The Z-scores used in the correlation coefficient must be calculated using the population formula for the standard deviation and thus:
Zx= (X-Mx)/SDx
Where: Mx= the arithmetic mean of the variable x and SDx= the standard deviation of the variable x.
SDx= Σ(X-Mx)2/n
Where: n= the number of observations of variables x and y
Zy= (Y-My)/SDy
Where: My= the arithmetic mean of the variable Y and SDy= the standard deviation of the variable Y.
SDy= Σ(Y-My)2/n
Where: n= the number of observations of variables X and Y
Pearson's r varies between -1 and +1. A value of zero indicates that no association is present between the variables, a value of +1 indicates that the strongest possible positive association exists and a value of -1 indicates that there is the strongest possible negative association. In a positive relationship, as variable X increases in value, variable Y will also increase in value (e.g. as number of hours worked (X) increases, weekly pay (Y) also increases). In contrast, in a negative relationship, as variable X increases in value, variable Y will decrease in value (e.g. as number of alcoholic beverages consumed (X) increases, score on a test of hand-eye coordination (Y) decreases).
It is important to note that while a correlation coefficient may be calculated for any set of numbers there are no guarantees that this coefficient is statistically significant. That is to say, without statistical significance we cannot be certain that the computed correlation is an accurate reflection of reality. Significance may be assessed using a variant on the standard Student's t-test defined as:
t= (r)√(n-2)/√(1-r2)
The results of this test should be evaluated against the standard t-distribution with degrees of freedom equal to n-2. As usual in significance tests, a result described as significant (i.e. a low value of P in the t test) means that the chance of getting a correlation coefficient at least as large (whichever the direction) as that observed, if there is in fact no correlation, is small. A statistically significant result can therefore arise if the actual correlation is strong, even if the dataset is small, or if the actual correlation is weak but many data are observed.
One should note that for most random variables a correlation of 0 does not imply that they are independent. However, if the two variables come from the Normal Distribution then it can be claimed that a correlation of 0 implies independence.
It is important to note that the correlation coefficient should not be calculated when either of the variables is not continuous. That is, when they do not vary continuously and have a meaningful zero. As such, correlating a dichotomous variable (e.g. sex) with a ratio variable (e.g. IQ) is inappropriate and will return uninterpretable results.
Additionally, correlation estimates a linear relationship between X and Y. Thus, an increase in variable X is assumed to exert the same influence on Y across all values of X and Y.
Suppose a polling firm was hired to talk to voters as they voted. Each voter was asked "Are you a Republican?", "Are you a blue collar worker?", "Are you male?", "Do you live in West Virginia?", and "Did you vote for Donald Trump?" We can record that data on a spreadsheet using one for yes and zero for no:
Person | Republican? | Blue collar? | Male? | WVa? | Trump? |
A | 1 | 1 | 1 | 1 | 1 |
B | 1 | 1 | 1 | 1 | 1 |
C | 1 | 1 | 1 | 0 | 1 |
D | 0 | 1 | 1 | 1 | 1 |
E | 0 | 0 | 1 | 0 | 0 |
F | 0 | 0 | 0 | 0 | 0 |
G | 0 | 0 | 0 | 1 | 0 |
correlation coeff | 0.75 | 1 | 0.73 | 0.42 |
Whether the person voted for Trump is the dependent variable, and each of the other questions are independent variables. The correlation coefficient can then be computed comparing the column of each dependent variable with the Trump column. What this data tells us is that being blue collar is a perfect predictor of voting for Trump—the correlation coefficient is 1. The next best predictive factor would be whether the voter was a Republican, where they match 6 out of 7 times. Then comes gender and West Virginia residence. Assuming that the data sample was large enough to be statistically significant, we could build a model to predict whether other voters would vote for Trump by asking the same questions and computing a score using the formula:
.75*Republican + 1*Blue collar + .73*Male + .42*WVa
A perfect score would be 2.897, and someone with a 0 score would be most unlikely to vote for Trump. Of course, the polling firm would be using many more questions and would experiment to fine-tune the questions to get the best predictive scores. They would also use much larger data samples.
This fine-tuning process is a form of learning. When a computer does this by itself it is called machine learning.
For a more detailed treatment, see Correlation is not causation.
Correlation is a linear statistic and makes no mathematical distinction between independent and dependent variables. As a consequence, it is impossible to assert causation based on a correlation even if that correlation is statistically significant. Therefore, when a significant correlation has been identified it is possible that X -> Y (i.e. X causes Y), Y -> X (i.e. Y causes X), or that X <- Z -> Y (i.e. Both X and Y are caused by an additional variable, Z). Caution must be exercised in asserting causation using correlation and claims to this effect must be viewed with considerable skepticism.
Aron, Arthur, Elaine N. Aron and Elliot J. Coups. 2008. Statistics for the Behavioral and Social Sciences: A Brief Course. 4/e. Upper Saddle River, New Jersey: Prentice Hall.