Cause and effect can be observed

Correlation and Causality

A well-known example from statistics: The more people get married in Kentucky, the more people drown after falling out of a fishing boat. With a correlation coefficient of r = 0.952, this relationship is, from a statistical point of view, almost perfect. But is that why you shouldn't get married in Kentucky? Or is the per capita consumption of cheese responsible for an unhappy death caused by tangling in the sheets? At least a strong correlation can also be observed here (r = 0.947).

For both cases, the answer is more likely “no”. Instead, these examples are intended to make it clear that a correlation is far from proving causality. So what are correlation analyzes for and what do you have to pay attention to when interpreting them?

Correlative relationships

An analysis of the correlation between two variables is always interesting when we want to know whether there is a statistical connection between these variables and in which direction it runs. We distinguish between four basic scenarios, which the following example should clarify: "Is there a connection between the number of weekly working hours and the frequency of a person going to a restaurant?"

  • No connection: By knowing the weekly working hours, no statement about the frequency of restaurant visits can be made.
  • Positive correlation: the more a person works per week, the more often they go to a restaurant.
  • Negative correlation: the more a person works per week, the less often they go to a restaurant.
  • Non-linear relationship: Both a below and above average number of weekly working hours increase the frequency of restaurant visits.

Whether the observed connection also has a causal link, which variable is cause and which effect - these questions remain unanswered by the correlation analysis. Let us assume that we observed a positive correlation for our example. An explanation could then be that people who work longer have less time to cook and therefore switch to restaurants more often. Alternatively, it is also conceivable that people who like to eat out have to work more in order to be able to afford their frequent restaurant visits. A purely coincidental emergence of the correlation cannot be ruled out, as the two initial examples should make clear.

No causality in correlation

So we do not know whether there is a causal, causal connection, what exactly is cause and what is effect. Nevertheless, it can of course be desirable to derive a causality from a correlative context through (thoroughly researched) interpretation of the content. However, it is very important to be aware that these interpretations, as conclusive as they may appear, are never statistically supported by the correlation.

Prove causality

In fact, a causal relationship can never be fully proven with statistical methods (although there are new approaches in statistics, e.g. on the subject of causal inference). We get the best approximation through a controlled experiment, i.e. by manipulating the independent variable X (assumed as the cause, e.g. weekly working hours) while simultaneously observing the dependent variable Y (assumed as the effect, e.g. number of restaurant visits). If Y changes as a result of the manipulation of X, it can be assumed, at least statistically, that the two factors are related.

Correlation coefficient

Various correlation coefficients are available to the scientist to calculate correlations. These are selected depending on the scale of the data and the assumed relationship. The two most important correlation coefficients are the Pearson correlation coefficient and the Spearman correlation coefficient. The former is used if both variables to be correlated are metric or interval-scaled and normally distributed. The Spearman correlation, on the other hand, is calculated based on rank data and is also suitable for ordinal and non-normally distributed data. Both coefficients are defined in the interval between r = -1 and r = 1, where r = -1 describes a perfect negative and r = 1 a perfect positive relationship.

Practical use of correlations

In statistical practice, correlations are often used in the context of exploratory data analysis, i.e. as a first indication of any statistical effects that are further investigated with more complex methods such as regression analysis. This also becomes clearer against the background that with simple correlation analyzes no further variables can be used to control the effect. It is therefore assumed that there is only an effect of X on Y and that no other factors influence Y. For most experiments this is an extremely implausible assumption.

Summary

It is important to understand that statistical correlations cannot be used to make any statements about causal relationships. All statistical models are just simple abstractions of reality and in most cases will never be able to depict the actual causal relationship between variables. But, to use the words of the famous statistician George Box: "All models are wrong ... but some of them are useful." If you need support in the selection or calculation of correlations, our statistics team will be happy to help you.

About the author

Jakob Gepp

Numbers were always my passion and as a data scientist and a statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog (at) statworx.com.