Saturday, April 5, 2008

Cause & correlation

There is a common misconception in quantitative analysis - and in the general media - that because two things are strongly correlated, either positively or negatively, there exists a causal relationship between the two. That is, changes in one thing are causing changes in another.

A causal relationship is something like: I stub my toe, and my toe hurts. A lightning strike can cause bush-fires. For example, it has been shown that the incidence of crime is higher during hot weather. But it would not be true to say that hot weather causes crime.

We can measure the extent to which two variables are correlated using Pearson's correlation co-efficient. This value takes two variables and looks at how closely a change in one variable is mirrored in the other. For example, we might record the daily temperature and the incidence of crime (ignoring the possibility of reporting errors that might arise from the different weather conditions).

Plotting these on a graph might illustrate the presence of a relationship between these two variables. The correlation co-efficient quantifies the strength of this relationship.

However, there are varying degrees of causality and none of them are reliant on the correlation co-efficient as the determining factor.

Firstly, a variable might be a necessary condition if it must be present in order for the other condition (the outcome) to occur. An outcome might have several necessary conditions before it might eventuate. In the absence of any one, the outcome does not occur.

Alternatively, a variable might be a sufficient condition if it is enough to trigger the outcome. For example, a lightning strike is sufficient to start a bush-fire, although it is hardly necessary - bush fires can be started in any number of ways.

The strongest causal relationships exist when a condition is both necessary and sufficient to create an outcome. Getting shot by a gun is a necessary & sufficient condition for a gun-shot wound - to cite a trivial example.

In our user experience work, we often come across circumstances where we record an outcome and look for conditions to help explain why. For example, the movement of an ad banner might coincide with an increase in the click-through rate. As humans we would naturally assign a causal relationship between the new position and the increase. But, it is coincidence only. If it happens consistently, we might suspect that there's something deeper going on.

However, it is important to understand that there's no quantitative test that one can perform that proves a causal relationship between an event and an outcome. You can determine the strong likelihood that a causal relationship exists - and you can search for an exception that proves the lack of a causal relationship - but no proof exists.

So, be skeptical when you hear people quote a correlation co-efficient and then start to talk about cause.

[For a philosophical discussion of causality, Wikipedia offers a good starting point.]

Wednesday, March 12, 2008

What is an 'average'?

The term 'average' is bandied about a lot in user research, and although most people have an idea of what might be meant by the term, it can have several different meanings depending on the context and the data being analysed. So, to lay the second plank of our foundation concepts, let's take a look at the different types of average and the ways in which we might use or understand this term.

Average's help us understand the 'middle' of a set of numbers, but they have another purpose too: to estimated the value we could expect to see if we selected a single value at random. There are five commonly-used 'averages':
  • mode: the most commonly-occurring value
  • median: the middle value, when all values are ordered
  • mean: a calculated value, summing all observed values and dividing by the number of values
  • weighted average: a mean where some values are given greater weight than others
  • moving average: a mean where only the last n values are included in the calculation.

Modes
Modes are typically used when the data is categorical, in whatever form. For example, when analysing the data from a survey we might have a response to Gender with Male (78) and Female (64). The mode in this case would be male.

We might also have a situation where it isn't meaningful to report a figure that isn't a whole number - which could easily occur in each of the other types of average. Say, for example, our data is recording the number of tertiary qualifications held by UX practitioners. Our responses for range from 0 upwards. Lets say we have the following table of responses.

Qualifications Respondents
0 7
1 26
2 23
3 9
4 1
n 66

Now, we could calculate a mean, or a median for that matter, but it doesn't make sense to report that UX practitioners have, on average, 1.560606 tertiary qualifications. Either you have a qualification; or you don't. In this case, the mode, 1 qualification, slightly under-represents what we would expect to receive in response if we asked a UX practitioner the question at random.

Medians
The term median means, literally, middle. To find the median value, we rank all of the observed values in order, and select the value that falls in the middle of the ordered list. In the above example, this would look something like: 000000011111111111111111111111111222222222222222222222223333333334. The middle of this sequence falls between 1 and 2, so our median value is half-way between these, i.e. 1.5.

Now, as mentioned above, this doesn't make practical sense, but it does illustrate the concept of medians. It also illustrates the need to take care when calculating an average!

Means
The mean is what most people think of when they hear the term 'average'. It is also called the 'arithmetic mean' and is calculated by adding up all of the observations and dividing by the number of observations.

A mean is very useful for characterizing the expected value of a collection of observations. It can accommodate the most common forms of measured data - that being continuous data. For example, the time-to-completion for a usability task can calculate a mean figure, which will provide a meaningful value regardless of the result. In our previous article on time-to-completion data, we calculated a mean figure of 143.8725s.

Weighted Average
There are occasions when we need to give more weight to one set of observations versus another. An example might be the page view data for a Web site as a predictor for tomorrow's traffic. Many web sites are cyclical in terms of the peaks and troughs in their visitor numbers. So in trying to determine tomorrow's traffic, today's traffic numbers are less important than, say, a week ago's traffic.

Lets say our traffic looked a little like this:
Monday 12,358
Tuesday 14,122
Wednesday 14,823
Thursday 13,905
Friday 13,733
Saturday 11,064
Sunday 8,899

A straight mean calculation places as much importance on last Monday's traffic as it does on yesterday's. However, from a forecasting perspective, last Monday is likely to be a better indicator for this Monday, so we can give it more weight, like so:

Page Views Weighting Weighted page views
Monday 12358 4 49432
Tuesday 14122 1 14122
Wednesday 14823 1 14823
Thursday 13905 1 13905
Friday 13733 1 13733
Saturday 11064 1 11064
Sunday 8899 1 8899


10 12597.8

We've increased the influence that last Monday's observation will have on the predicted value by giving it a weighting factor of 4. In doing so we increase the overall number of 'observations' from 7 to 10. Our weighted average is 12,597.8, as a forecast for this Monday's traffic. This compares to a straight mean of 12,700.57. So our weighting provides us with a reduced prediction.

Moving Average
The last type of average helps us to deal with time series data - observations made over a period on a regular basis. It recognises that when calculating an expected value, the most recent observations are likely to be better predictors than data going back to the earliest observations made. This is particularly true of something like Web site traffic data, where the overall size of the pool of potential visitors is increasing, so we would expect the overall traffic to be increasing also.

Moving averages are used frequently in economics, particularly with respect to share prices where the high volatility of the stock makes historical data meaningless.

A moving average is usually calculated on an on-going basis using the last n observations. Examples might be to use a 5-day moving average, or a 20-day moving average as part of our analysis. So lets say we are tracking our page view data (from above) for a period of three weeks. The 5-day average would be calculated after 5-days:

Page Views 5-day ave
Monday 12,358
Tuesday 14,122
Wednesday 14,823
Thursday 13,905
Friday 13,733 13,788.2
Saturday 11,064 13,529.4
Sunday 8,899 12,484.8
Monday 12,589 12,038.0
Tuesday 14,222 12,101.4
Wednesday 14,813 12,317.4
Thursday 14,099 12,924.4
Friday 14,011 13,946.8
Saturday 10,781 13,585.2
Sunday 9,203 12,581.4
Monday 12,993 12,217.4
Tuesday 14,330 12,263.6
Wednesday 15,198 12,501.0
Thursday 14,078 13,160.4
Friday 14,215 14,162.8
Saturday 11,144 13,793.0
Sunday 9,126 12,752.2

You can see that the average changes each day, which is the point.

Different types of 'average' are useful in different circumstances: something that we'll touch on in future articles in the series.

Thursday, January 17, 2008

Calculating correlation co-efficients

The correlation co-efficient for a set of pairs of data provides a measure of the strength and direction (positive or negative) of the linear relationship between two variables. The most commonly used is the Pearson correlation co-efficient, which uses a least squares method of calculating the dispersion of the data pairs from a theoretical straight-line (linear) relationship.

The correlation co-efficient, r, for a set of (x,y) data pairs is calculated as follows:


The following steps can be followed:
  1. Calculate the average values for both x (x*) & y (y*);
  2. For each row, calculate (x – x*) and (y – y*);
  3. For each row, calculate (x-x*)2, (y – y*)2, and (x-x*)(y-y*);
  4. Add up the values in each column, and store the totals
  5. For both x & y values, calculate the standard deviation, sx and sy, using the totals for (x-x*)2 and (y-y*)2 dividing each by the number of rows, and taking the square root of the results.
  6. Calculate r using the total for the column of (x-x*)(y-y*) and dividing by (n*sx*sy) where n is the number of rows in the table (i.e. the number of x,y pairs.

The following table should help to illustrate the calculation:


x

y

(x-x*)

(x-x*)2

(y-y*)

(y-y*)2

(x-x*)(y-y*)

1

8.56

-4

16

1.865556

3.480297531

-7.46222222

2

8.23

-3

9

1.535556

2.357930864

-4.60666667

3

7.62

-2

4

0.925556

0.856653086

-1.85111111

4

7.12

-1

1

0.425556

0.181097531

-0.42555556

5

6.99

0

0

0.295556

0.087353086

0

6

7.05

1

1

0.355556

0.126419753

0.355555556

7

4.98

2

4

-1.71444

2.939319753

-3.42888889

8

5.37

3

9

-1.32444

1.754153086

-3.97333333

9

4.33

4

16

-2.36444

5.590597531

-9.45777778

mean x*

5


Total

60


17.37382222

-30.85

mean y*


6.694444

std dev

2.581989


1.38939724



r=

-0.9555



In the above table, the x values represent the number of guests; and the y values represent the conversion rate given as a percentage. The columns headed by (x-x*)2 and (y-y*)2 are used in the calculation of the standard deviations for x and y – sx and sy. Once the last column is calculated, the values are totaled, giving the numerator (upper value of the fraction) in the equation for r.

For the above example, r is calculated as:


The use of the correlation co-efficient enables a determination as to whether or not there exists a relationship between the variables. A strong correlation does not indicate a causal relationship in the data; although causal relationships show strong correlation.

Note: the value for the correlation co-efficient r can range from -1 to 1. A value towards either end of the range indicates a strong correlation between the variables; values close to 0 indicate very little or no correlation.

Wednesday, January 16, 2008

Analysing time-to-completion data

Let's start off with a simple example: we're going to analyse data on the time-to-completion for a Web site booking process. For our purposes, the process has five steps to it, with step 5 representing the confirmation or 'thank you' page presented at the end of the transaction.

In this example we're interested only in the time taken by a customer to complete a booking transaction. We're not interested in abandonment or completion rates; we're not interested in why or where a booking was abandoned. (Note: these are important things to consider on your site, but not the focus of this article.)

Our data is recorded in our site database as a series of date-time pairs:
  • transaction start time;
  • transaction end time.
From this we can calculate a time in seconds for the completion of the transaction. The server is probably capable of presenting us with values to as many decimal places as we like, but let's keep it at three for now (giving us milliseconds). For example, our time-to-completion for our first booking might be 136.277s

Now, because our server can provide us with whatever level of precision we like, this data can be classified as continuous, quantitative data.

Suppose we have the following data for Day 1 of our study:
1 136.277
2 191.119
3 122.98
4 177.538
5 135.788
6 142.225
7 138.802
8 141.003
9 119.65
10 133.343

Firstly, we're interested in the fastest (or shortest) completion and the slowest (or longest). This is easy to find by just sorting the data:
1 119.65
2 122.98
3 133.343
4 135.788
5 136.277
6 138.802
7 141.003
8 142.225
9 177.538
10 191.119
We can see that the fastest transaction was completed in slightly under 2 minutes; and the slowest took just over 3 minutes 11 secs. That also tells us that the transaction completion times spanned 71.469s

It's often easier to visualize the data by creating a chart of some sort. For continuous data like we have here, a scatterplot can be revealing...

You can see that the values for completion time are mostly grouped around the 130-140s mark, with some lower and a couple much higher.

We may want to know what was going on at the time the two high completion times occurred:
  • were these more complicated transactions?
  • was the server performing slowly at those times?
However, those questions are more about Web analytics and less about statistics, so we'll press on.

One thing we are interested in is knowing exactly where the 'average' completion time lies. This is an easy figure for most people to understand, and it's a nice, single figure to use as a benchmark. So, in this case, we can determine the 'average' - by which most people mean 'the middle' in one of two ways:
  1. the median value - or the value of the data point that falls in the middle of our ordered data; or
  2. the arithmetic mean - a calculated figure which is the sum of all observed values divided by the total number of observations (add 'em up and divide by how many there are!).
Our median figure actually falls between the 5th & 6th observations, we add them together and halve them. This gives us a median of 137.5395s.

Our arithmetic mean (what most people think of when they say 'average') works out as 143.8725s. The mean of a sample is usually denoted by x* or m (mu) depending on the context, but we'll just use x*.

Which one should we use? It doesn't matter as long as you clearly indicate which it is that you're presenting. Both provide a meaningful measure of the 'middle' of the data. However, it is less ambiguous, and less prone to confusion, to quote the mean figure for continuous data.

We can investigate a little further now, to see how much variability there is in the data. A typical data set of random measurements follows what is now as a Normal (or Gaussian) distribution; otherwise known as a Bell Curve. One of the characteristics of this distribution is that observations are concentrated around a central point - the mean - and range outwards at known rates based on the number of standard deviations.

Standard deviation is a measure of the variability in the data and is calculated as follows:
d =
sqrt(sum(x - x*)^2)/n

To demonstrate with our data:

x x* (x-x*) (x-x*)^2
1 119.65 143.8725 -24.2225 586.7295
2 122.98 143.8725 -20.8925 436.4966
3 133.343 143.8725 -10.5295 110.8704
4 135.788 143.8725 -8.0845 65.35914
5 136.277 143.8725 -7.5955 57.69162
6 138.802 143.8725 -5.0705 25.70997
7 141.003 143.8725 -2.8695 8.23403
8 142.225 143.8725 -1.6475 2.714256
9 177.538 143.8725 33.6655 1133.366
10 191.119 143.8725 47.2465 2232.232



Total
4659.403

Our variance is given as 4659.403/10 = 465.9403. And our standard deviation s is the square root of the variance, 21.5866. [Note: the formulae used by Microsoft Excel assume the data is a sample rather than the population itself. Results will differ. We'll discuss the difference between sample and population data in another article.]

If our data were 'normal' - i.e bell-shaped - we can estimate the upper and lower limits of our 'expected' data based on our mean and standard deviation. We expect roughly 2/3 of all values to fall within one standard deviation of the mean, or somewhere between 122.29s (143.8725 - 21.5866) and 165.46s (143.8725 + 21.5866). We expect roughly 95.5% of all observed values to fall within 2 standard deviations of the mean, or somewhere between 100.70s (143.8725 - 2 x 21.5866) and 187.04s (143.8725 + 2 x 21.5866).

Three standard deviations either side of the mean should account for around 99.75% of all observed values - giving a range of between 79.12s and 208.63s.

You can see from these ranges that all of our completion times fall within three standard deviations of the mean. Statistically speaking, we'd call this data 'normal' - in the 'everyday' sense of the word. In other words, although those two higher values seem to stick out of the scatterplot, statistically they're not atypical.

We need to be careful at this point not to draw too many conclusions from this one day's worth of data. We could use this data to attempt to predict what tomorrow's completion times might look like, but the reality is that it wouldn't tell us much with only 10 observations from which to draw. However, that's a topic for another day when we discuss statistical inferences and estimating population parameters based on a test sample. But first (and next) we need to discuss sampling techniques: the next topic in this series.