This supplementary section (which takes off from section of the same name in the book version of PSY-Q) explains what statistical testing is and why we need to bother, and then goes on to introduce between- and within-subjects t-tests, with interactive examples using Excel.
Statistical Testing: What’s the point?
When I was growing up, a common refrain in our household was “Cup of tea please, Ben; and milk in first”. To this day, my mother swears blind that tea tastes better if the milk is added before the tea, rather than after, and continues to sip suspiciously any cup of tea that she is offered in order to check that her rule has been followed. Although my mother usually detected my attempts to trick her into accepting milk-in-last tea (perhaps because of my guilty expression), I was generally rather suspicious of her claims that she could actually tell the difference.
A noted biologist, Sir Ronald Fisher (1890-1962), had a similar experience with a colleague named Dr Muriel Bristol-Roach, and was similarly suspicious. Rather than just muttering under his breath and refusing to make the tea, Fisher sat down and devised a statistical test that could be used settle to the question once and for all: Fisher’s exact test. Fisher gave Bristol-Roach eight cups of tea, telling her that the milk had been added before the tea in four, and after in the others, and asked her to say which was which.
That’s all very well, but why do we need a statistical test here? Well, say that our taster got four cups right and four cups wrong (i.e., a 50% success rate). Would you want to conclude that she can tell the difference between milk-in-first and milk-in-last tea? Of course not; it’s easy to see that someone who simply flipped a coin for each cup could expect to get half of them right by sheer chance alone. Now what if she got five right and three wrong? Again, we wouldn’t want to say that she can tell the difference, because our coin-flipper would only have to be slightly luckier than average to get the same result. What about six right and two wrong? Seven right and one wrong? All eight right? How many cups does our tea taster have to correctly identify before we conclude that she really can tell the difference and didn’t just get lucky?
The point of Fisher’s test is to answer this question. It turns out that the probability of getting all eight correct by chance alone, using the coin-flipping method, is 1 in 70, or 0.014 (i.e., 1 divided by 70). Scientists call this a p (short for probability) value. Now, we can never entirely rule out the possibility that a particular result – such as getting as eight cups correct – is down to chance alone; our coin-flipper could just have been extremely lucky. Instead, scientists have agreed on a rule of thumb, known as p<0.05: If (a) the probability of a particular result (e.g., getting all eight cups correct) happening by chance alone is is less than 1 in 20 (i.e., 0.05) and (b) this result then happens, then we accept that the result almost certainly didn’t happen by chance alone. That is, we accept the result as statistically significant.
So, let’s round off our tea-tasting example. The story goes that Dr Bristol-Roach did, in fact, get all eight cups correct. Because the probability of getting all eight correct by chance alone – 1 in 70 (p=0.014) – is clearly less than our cut-off of 1 in 20 (i.e., p<0.05) – we accept that the result did not occur by chance alone, and that…drum roll …she actually can tell whether or not the milk was added first. Score 1-0 to my mum.
Now that you know this, you can try your own tea-tasting challenge at home. Don’t like tea? Well, maybe you would like to find out whether you can tell Pepsi™ from Coke™, diet drinks from regular, brand-name groceries from supermarket own-make equivalents, or even – just for the grown-ups – red wine from white wine (surprisingly difficult if both are chilled). Just remember: With eight drinks, you need to get them all right to significantly beat chance (if you’re particularly thirsty and up the number of drinks to ten, you only need to get 9/10 correct).
Um, this is all fascinating stuff, but this business about milk-in-first tea seems a long way from the type of question that is addressed in typical Psychology studies. How does statistical testing work there?
OK, let’s look at a typical Psychology kind of question: “Do women smile more often than men?”. To find out, we get hold of 10 men and 10 women and measure how often each person smiles per minute. Let’s say the women average 10.5 smiles per minute, and the men 8.5. So we can say that women do indeed smile more than men and go home, right?
Wrong. The problem is that if you take a group of 20 people and split them into two groups of 10 randomly – without regard to their gender – it’s pretty unlikely that the average smiling rate for the two groups will be identical, though it will probably be pretty close (let’s say it’s 9.8 smiles per minute for one group and 9.2 for the other). Now, here comes the crucial part. In our original example, one group (women) averaged 10.5 smiles per minute, whilst the other (men) averaged 8.5. So, how do we know that the difference between the two groups is due to the fact that one is male and one is female, as opposed to the fact that if you divide people into two groups at random, one will just so happen to smile a bit more than the other? The answer is that we don’t. And this is why we need a statistical test.
Although the maths is a bit more complicated, the test essentially does the same thing as Fisher’s “tea” test. It figures out the probability of a result like the one we got (i.e., a difference of 2 smiles per minute; 10.5 vs 8.5) happening by chance alone if we just divided people into two groups completely at random. Again, if this probability is less than 0.05 (p<0.05) and the result then happens, we conclude that this result wasn’t down to chance alone and that the difference is statistically significant.
Statistical testing, like many things, is much easier to understand if you actually do it, which is why I’ve put together an Excel spreadsheet (“BetweenT.xlsx”)
If you type in the number of smiles per minute for 10 women and 10 men, the spreadsheet will automatically run a statistical test, and tell you whether any difference between the groups meets this p<0.05 cut-off for statistical significance. Ideally, you should go out and observe 10 men and 10 women and generate some real results, but if you can’t be bothered, here are some made-up ones:
By the way, this spreadsheet isn’t specific to comparing men and women on the number of smiles per minute; you can use it to compare any two groups on anything at all. Simply replace “men” (B6) and “women” (E6) with the names of the two groups that you want to compare (perhaps “northerners” and “southerners” or “young people” and “old people”) and “number of smiles per minute” (B1) with whatever you want to compare them on (perhaps “number of pies eaten per week” or “number of own teeth still remaining”).
But, sometimes, we’re not interested in differences between two groups of people (e.g., women vs men; young people vs old people) but within individual people. For example, suppose we want to test the idea that everyone – regardless of age and gender – smiles more in the morning than the evening. For this, we need a different type of test, which a second spreadsheet (WithinT.xlsx) will run for you.
Again, it’s more fun to conduct the study yourself, but – if not – here are some made-up data:
Again, you can replace the headings and numbers with anything that you care to test people on (e.g., Do people weigh more in January than November?; Do office workers send more emails on a Monday than a Thursday?). Just remember: It’s the first sheet (BetweenT.xlsx) when you’re looking for difference between different people (what Psychologists call a between-subjects test), and the second sheet (WithinT.xlsx) when you’re looking for a difference within different people (what Psychologists call a within-subjects test).
By the way, it is important to be aware that this idea of testing whether apparently-meaningful differences between groups could have arisen by chance alone – which is done as a matter of course in all sciences – has not entered public consciousness. Newspapers are always reporting as meaningful tiny differences between groups that are extremely likely to have arisen by chance alone. Worse still, I have seen newspaper articles in which a researcher’s claim that – for example – a miniscule increase in crime rates is “not statistically significant” is portrayed as the special pleading of a boffin who is determined to hide the truth. In fact, the truth is that if an “increase” in crime rates is not statistically significant, there’s no reason to think that crime rates have actually changed at all.
The book version of PSY-Q is a collection of interactive psychological tests that allow you to measure your personality, intelligence, moral values, thinking style, impulsivity, skill at drawing, capacity for logical reasoning, musical taste, multitasking ability, susceptibility to illusions (both visual and mental), preferences in a romantic partner and – as they say – much, much more.
 I also prefer milk-in-first, but for purely temperature-related reasons: If the tea is added before the milk, the cup gets hot, meaning that the beverage takes too long to cool down to a drinkable temperature. Don’t get me started on the practice of adding hot milk to coffee.
 Confusingly, another commonly-used test is actually called the tea-test – well, t-test, but it’s pronounced the same – though that test was invented for a different beverage-related purpose: quality control at the Guinness brewery in Dublin.
 Why 1 in 70? Because there are 70 different ways to pick four cups from eight, and only one of these correctly gives you the four milk-in-first cups.
 The real answer is “yes, a bit, but only if they know they are being observed