Wednesday, January 16, 2008

Analysing time-to-completion data

Let's start off with a simple example: we're going to analyse data on the time-to-completion for a Web site booking process. For our purposes, the process has five steps to it, with step 5 representing the confirmation or 'thank you' page presented at the end of the transaction.

In this example we're interested only in the time taken by a customer to complete a booking transaction. We're not interested in abandonment or completion rates; we're not interested in why or where a booking was abandoned. (Note: these are important things to consider on your site, but not the focus of this article.)

Our data is recorded in our site database as a series of date-time pairs:
  • transaction start time;
  • transaction end time.
From this we can calculate a time in seconds for the completion of the transaction. The server is probably capable of presenting us with values to as many decimal places as we like, but let's keep it at three for now (giving us milliseconds). For example, our time-to-completion for our first booking might be 136.277s

Now, because our server can provide us with whatever level of precision we like, this data can be classified as continuous, quantitative data.

Suppose we have the following data for Day 1 of our study:
1 136.277
2 191.119
3 122.98
4 177.538
5 135.788
6 142.225
7 138.802
8 141.003
9 119.65
10 133.343

Firstly, we're interested in the fastest (or shortest) completion and the slowest (or longest). This is easy to find by just sorting the data:
1 119.65
2 122.98
3 133.343
4 135.788
5 136.277
6 138.802
7 141.003
8 142.225
9 177.538
10 191.119
We can see that the fastest transaction was completed in slightly under 2 minutes; and the slowest took just over 3 minutes 11 secs. That also tells us that the transaction completion times spanned 71.469s

It's often easier to visualize the data by creating a chart of some sort. For continuous data like we have here, a scatterplot can be revealing...

You can see that the values for completion time are mostly grouped around the 130-140s mark, with some lower and a couple much higher.

We may want to know what was going on at the time the two high completion times occurred:
  • were these more complicated transactions?
  • was the server performing slowly at those times?
However, those questions are more about Web analytics and less about statistics, so we'll press on.

One thing we are interested in is knowing exactly where the 'average' completion time lies. This is an easy figure for most people to understand, and it's a nice, single figure to use as a benchmark. So, in this case, we can determine the 'average' - by which most people mean 'the middle' in one of two ways:
  1. the median value - or the value of the data point that falls in the middle of our ordered data; or
  2. the arithmetic mean - a calculated figure which is the sum of all observed values divided by the total number of observations (add 'em up and divide by how many there are!).
Our median figure actually falls between the 5th & 6th observations, we add them together and halve them. This gives us a median of 137.5395s.

Our arithmetic mean (what most people think of when they say 'average') works out as 143.8725s. The mean of a sample is usually denoted by x* or m (mu) depending on the context, but we'll just use x*.

Which one should we use? It doesn't matter as long as you clearly indicate which it is that you're presenting. Both provide a meaningful measure of the 'middle' of the data. However, it is less ambiguous, and less prone to confusion, to quote the mean figure for continuous data.

We can investigate a little further now, to see how much variability there is in the data. A typical data set of random measurements follows what is now as a Normal (or Gaussian) distribution; otherwise known as a Bell Curve. One of the characteristics of this distribution is that observations are concentrated around a central point - the mean - and range outwards at known rates based on the number of standard deviations.

Standard deviation is a measure of the variability in the data and is calculated as follows:
d =
sqrt(sum(x - x*)^2)/n

To demonstrate with our data:

x x* (x-x*) (x-x*)^2
1 119.65 143.8725 -24.2225 586.7295
2 122.98 143.8725 -20.8925 436.4966
3 133.343 143.8725 -10.5295 110.8704
4 135.788 143.8725 -8.0845 65.35914
5 136.277 143.8725 -7.5955 57.69162
6 138.802 143.8725 -5.0705 25.70997
7 141.003 143.8725 -2.8695 8.23403
8 142.225 143.8725 -1.6475 2.714256
9 177.538 143.8725 33.6655 1133.366
10 191.119 143.8725 47.2465 2232.232



Total
4659.403

Our variance is given as 4659.403/10 = 465.9403. And our standard deviation s is the square root of the variance, 21.5866. [Note: the formulae used by Microsoft Excel assume the data is a sample rather than the population itself. Results will differ. We'll discuss the difference between sample and population data in another article.]

If our data were 'normal' - i.e bell-shaped - we can estimate the upper and lower limits of our 'expected' data based on our mean and standard deviation. We expect roughly 2/3 of all values to fall within one standard deviation of the mean, or somewhere between 122.29s (143.8725 - 21.5866) and 165.46s (143.8725 + 21.5866). We expect roughly 95.5% of all observed values to fall within 2 standard deviations of the mean, or somewhere between 100.70s (143.8725 - 2 x 21.5866) and 187.04s (143.8725 + 2 x 21.5866).

Three standard deviations either side of the mean should account for around 99.75% of all observed values - giving a range of between 79.12s and 208.63s.

You can see from these ranges that all of our completion times fall within three standard deviations of the mean. Statistically speaking, we'd call this data 'normal' - in the 'everyday' sense of the word. In other words, although those two higher values seem to stick out of the scatterplot, statistically they're not atypical.

We need to be careful at this point not to draw too many conclusions from this one day's worth of data. We could use this data to attempt to predict what tomorrow's completion times might look like, but the reality is that it wouldn't tell us much with only 10 observations from which to draw. However, that's a topic for another day when we discuss statistical inferences and estimating population parameters based on a test sample. But first (and next) we need to discuss sampling techniques: the next topic in this series.

1 comment:

Steve 'Doc' Baty said...

Note: the Excel functions mentioned above are stdev() for standard deviation, var() for variance, and average() for the mean.