Saturday, April 5, 2008

Cause & correlation

There is a common misconception in quantitative analysis - and in the general media - that because two things are strongly correlated, either positively or negatively, there exists a causal relationship between the two. That is, changes in one thing are causing changes in another.

A causal relationship is something like: I stub my toe, and my toe hurts. A lightning strike can cause bush-fires. For example, it has been shown that the incidence of crime is higher during hot weather. But it would not be true to say that hot weather causes crime.

We can measure the extent to which two variables are correlated using Pearson's correlation co-efficient. This value takes two variables and looks at how closely a change in one variable is mirrored in the other. For example, we might record the daily temperature and the incidence of crime (ignoring the possibility of reporting errors that might arise from the different weather conditions).

Plotting these on a graph might illustrate the presence of a relationship between these two variables. The correlation co-efficient quantifies the strength of this relationship.

However, there are varying degrees of causality and none of them are reliant on the correlation co-efficient as the determining factor.

Firstly, a variable might be a necessary condition if it must be present in order for the other condition (the outcome) to occur. An outcome might have several necessary conditions before it might eventuate. In the absence of any one, the outcome does not occur.

Alternatively, a variable might be a sufficient condition if it is enough to trigger the outcome. For example, a lightning strike is sufficient to start a bush-fire, although it is hardly necessary - bush fires can be started in any number of ways.

The strongest causal relationships exist when a condition is both necessary and sufficient to create an outcome. Getting shot by a gun is a necessary & sufficient condition for a gun-shot wound - to cite a trivial example.

In our user experience work, we often come across circumstances where we record an outcome and look for conditions to help explain why. For example, the movement of an ad banner might coincide with an increase in the click-through rate. As humans we would naturally assign a causal relationship between the new position and the increase. But, it is coincidence only. If it happens consistently, we might suspect that there's something deeper going on.

However, it is important to understand that there's no quantitative test that one can perform that proves a causal relationship between an event and an outcome. You can determine the strong likelihood that a causal relationship exists - and you can search for an exception that proves the lack of a causal relationship - but no proof exists.

So, be skeptical when you hear people quote a correlation co-efficient and then start to talk about cause.

[For a philosophical discussion of causality, Wikipedia offers a good starting point.]


Lary Stucker said...

People do not leave enough comments just to say "good article" but look, I just did... Good article.

Unknown said...

Thanks for the article...which raises another question about correlation analyses. A colleague recently chastised me by saying
"Reporting p values with pearson correlations is a wrong practice that people with little stats knowledge have perpetrated throughout the medical literature. Up until a few years ago, I never saw a paper with a p value after the r or r-squared. It means nothing. A correlation is a correlation. It's a continuous relationship. r-squared or r tell you how strong or weak the correlation is. There's no threshold and hence, no meaningful information contained in a p value with it."

Is this true?

Alan James Salmoni said...

Jeff: I hope Steve doesn't mind if I butt in here and try to answer your question. I don't think there is a problem with a p-value for correlations. There is a danger though - as sample size increases, the threshold for significance drops. With a correlation of 0.50 and a sample size between 10-20, it would probably not be significant. If your sample increased to millions, it definitely would be even with the same correlation. You need to be cautious about inferring anything from a correlation's p-value which is why it was often not reported.

Suze Ingram said...

I first read about causality v's correlation a few months ago in "Freakonomics" by Steven Levitt and I've been thinking about it ever since. Great to read your thoughts on the topic, Doc!

markmatthewsphd said...
This comment has been removed by the author.
markmatthewsphd said...

while correlation does not equal causality, correlation data can provide a decision makers with a degree of relative accuracy when attempting to make a prediction... in your weather example, reliable and valid data correlating temperature and crime (depending on what type of crime of course), might provide law enforcement agencies a metric helpful in planning staffing needs during different times of the year... another great example is LSAT scores and first-year law school grades... strongly correlated... not causal...