Thursday, December 20, 2007

The nature of data

“He uses statistics as a drunken man uses lamp-posts
for support rather than illumination.”
Andrew Lang (1844-1912)
One of the most basic elements of user research, or any form of research of that matter, is the data that we gather from our tests & measurements. Yet for many people, the distinction between the different types of data we gather is a mystery. So before we get into any of the actual analysis techniques we might bring to bear on our data, let's go over this fundamental concept - it will help frame the rest of what we do further along.

Data can be broken down into two broad categories:
  1. Category - something that is characterised
  2. Quantity - something that is measured
Categorical Data
Looking at categorical data first, we have two main distinctions here, also:
  • Nominal (named) categories
  • Ordinal (ordered) categories
A nominal category is one where you essentially apply a label to the object, but that label carries no inherent 'value'. For example, dividing a group of people up by gender (male, female) or religion (Christian, Muslim, Buddhist, Atheist, Agnostic etc) allows us to assign one label or another. But there is no 'value' associated by which we could order or rank the labels.

An ordinal category is one where the label does carry some inherent value by which the resulting divisions can be ordered (thus the name). There are two types of ordinal categories - ordered, & ranked.

Ordered categories are those where we apply labels such as "Good", "Better", "Best" and multiple objects within the group can have the same label. Another example might be that most common of resume techniques - the 'Novice --> Expert' categorisation used for technology competencies.

By contrast, in a Ranked categorisation we don't apply a specific label to each object. Instead, each object is ordered according to the category - say, height, or years of practice - and assigned it's rank in that ordered collection.

Let's say I have a group of people and I was categorising them by height in a ranked manner. My hypothesis might be something like: "The shortest person in the room is usually the smartest". You can see that I'm not interested in knowing the actual, measured height of the shortest person; all I'm interested in is being able to differentiate them from everyone else.

So, I take my room full of people and I line them up from tallest to shortest. In a ranked categorisation, the data I record is just their rank order: 1, 2, 3 etc.

Something else to take note of with both ordered & ranked categories: although the labeling scheme allows me to arrange the data in some structured manner, there is no inherent measure of distance between category labels. Let me give two examples to illustrate:

1. Ordered categorisation: "Novice, Intermediate, Experienced, Expert". There is no way to judge from this classification just how much 'better' an "Intermediate" practitioner is from a "Novice". How big is the gap between "Experienced" and "Expert". All we can tell is that one is better than the other.

2. Ranked categorisation: looking at my room full of people lined up from tallest to shortest, we don't know (and don't care) what the gap in height is between the tallest and the next tallest person. Is it 10cm? 20cm? It might be 40cm, but this has no bearing on our classification.

Quantity Data
Quantitative data is probably the most easily understood (and most abused) form of data. It comes in two forms:
  • discrete (counts)
  • continuous (measurements)
The easiest way to understand the distinction between the two types of quantitative data is that continuous data can be measured to whatever degree of precision you choose, whereas discrete data goes up in 'steps' (usually whole numbers).

For example, if we are recording the number of gears on a bicycle, this would be a discrete type of data as it doesn't make sense to record "2.6" gears. Something like the time-to-completion of a usability task would be treated as a continuous measurement, since we could record the data as '2 minutes', '2 minutes 23 seconds', or '2 minutes 23.275 seconds' depending on our desired level of precision.


Garumoo said...

Nice recap Steve. What irks me is when someone measures something as Category data when instead it could be measured for Quantity data.

What are some reasons why I might choose to go straight to a Category/Ordinal data framework, and avoid gathering Quantity data in the first place (which could then of course be abstracted to Category/Ordinal)?

Steve 'Doc' Baty said...

Eric, measuring is typically more time-consuming and might therefore be avoided if under time/resource constraints.

Take my example of the room-full of people: I could have measured the height of all of them; loaded it into a spreadsheet, and sorted it in ascending/descending order. But that would take significantly longer than simply asking the people to arrange themselves. Since their actual height isn't the focus of the research, you could decide to save yourself that time & use it elsewhere instead.