Why one randomization does not a successful experiment make

Based on a PsyArXiv preprint with the admittedly slightly provocative title “Why most experiments in psychology failed: sample sizes required for randomization to generate equivalent groups as a partial solution to the replication crisis” a modest debate erupted on Facebook (see here; you need to be in the PsychMAP group to access the link, though) and Twitter (see here, here, and here) regarding randomization.

John Myles White was nice enough to produce a blog post with an example of why Covariate-Based Diagnostics for Randomized Experiments are Often Misleading (check out his blog; he has other nice entries, e.g. about why you should always report confidence intervals over point estimates).

I completely agree with the example he provides (except that where he says ‘large, finite population of N people’ I assume he means ‘large, finite sample of N people drawn from an infinite population’). This is what puzzled me about the whole discussion. I agreed with (almost all) arguments provided; but only a minority of the arguments seemed to concern the paper. So either I’m still missing something, or, as Matt Moehr ventured, we’re talking about different things.

So, hoping to get to the bottom of this, I’ll also provide an example. It probably won’t be as fancy as John’s example, but I have to work with what I have :-)

I’ll start with some definitions (not taking any chances this time :-)).

Population: the infinitely large group of people about which the researchers want to make statements. Samples are drawn randomly from this population.
Sample: finitely large groups of people drawn from the population. We’ll assume that they’re drawn at random, which here means that every individual in the infinite population has exactly the same probability of being drawn.
Randomization: a procedure where each participant is placed in a group (or sub-sample) in a completely non-systematic (i.e. random) way, say based on a list of 0s and 1s as generated by http://random.org.
Confounding: the situation where a variable (which is then called a confounder) distorts the outcomes of a study because of its association to the independent and dependent variable (e.g. the way ‘high temperature’ confounds conclusions regarding a causal association between icecream sales and drownings; or, in an experiment where the dependent variable is not measures at baseline, where a difference in the dependent variable that already existed before the manipulation can cause researchers to wrongly believe that a manipulation had an effect).
Nuisance variable: a variable which, if sufficiently different between groups, is a confounder.
Covariate: a variable which is associated to the dependent variable.
Representative: I added this after John Myles White noticed that this was omitted, and that that omission was problematic. But it’s much longer than the rest, because I think he’s probably right, so it pays to explain this more in depth. Therefore: see at the bottom.

In my interpretation of the Twitter/Facebook discussion Richard D. Morey, Tal Yarkoni, John Myles White, and Dean Eccles made three points that I don’t think apply. Because the discussion didn’t seem to converge towards agreement, I might very well misunderstand their positions, so I’ll start by repeating what I perceived to be their positions (or necessary implications of those positions).

Randomization is guaranteed to yield groups (sub-samples) that are representative of the population that the sample is drawn from. This guarantee is not conditional upon sample size (i.e. also works with small samples).
If randomization results in non-equivalent groups, this is not a problem for your experiment. The sub-groups are both still representative of the population. This dynamic, too, does not depend on the sample size.
Both these dynamics/guarantees apply to each individual study; they don’t describe dynamics that only apply over multiple (infinite, formally) studies (like confidence intervals do; i.e. there’s a 95% probability that a population mean is in the confidence interval computed for a given sample out of an infinite number of samples; but this doesn’t apply to any given study, i.e. it doesn’t mean that there’s a 95% probability that for any single sample, the population mean is in the interval; that probability is 0% or 100%).

The last point is what Nguyen et al., in their paper where they conclude that sample sizes of 1000 participants are needed to make sure estimates aren’t biased, refer to with the law of large numbers. This third point (i.e. although in any single study, randomization can fail to achieve its goal (and though randomization has several goals, I’ll here stick to the “safeguard against confounding” one), over multiple studies, this is resolved) I will address as I discuss points 1 and 2. I’ll start with point 1 (that seems to make the most sense :-)).

So, let us take a population of infinite people. Better yet, guinea pigs. So two types of guinea pigs exist. There are long-haired guinea pigs and short-haired guinea pigs:

We randomly take a sample of 4 guinea pigs out of this infinite population of varied-hairedly guinea pigs. We happen to get 2 short-haired and 2 long-haired guinea pigs:

No, we randomize them into two groups. There are two possible outcomes of this process: either each group has one of each, or each group has two guinea pigs of the same type. We happen to end up with the latter situation.

This is group 1:

This is group 2:

If it is true that each group created by the randomization process is representative of the population, both groups must generalize completely to the entire population. This means we only need one group to draw conclusions about the population. So, we’ll give group 1 to our neighbour, who happens to really like guinea pigs. We then examine the sample in group 2. We conclude all guinea pigs have long hair.

Given the low sample size, we’re very uncertain of this conclusion; but on the other hand, there’s no variation, and if the sample is entirely representative of the population, there is no bias in any particular direction.

Now, this will be wrong of course. Because short-haired guinea pigs also exist. Our conclusion is less accurate than if we’d ended up with the other situation, and this would have been our group 2:

Of course, over several replications, we’d easily find out that we’re wrong.

But researchers usually have a ‘Conclusions’ section at the end of their papers, and it rarely says “we will not draw any conclusions because this is not a meta-analysis”. And for any given study, a sample can be more or less representative of the population. The same goes for sub-samples created by randomization. Any sub-sample can be more or less representative. And the more participants you have, the less likely it is going to be that any given sub-sample happens to be one of the ones that’s not representative.

In this example, it makes sense to follow Terry Pratchett‘s reasoning: “It is well-known that in an infinite universe everything that can be imagined must exist somewhere”. Or, applied to this example: given that it is possible to ’manually’ select a completely unrepresentative sample from an infinite population, it follows that random selection from that infinite population will, in a non-zero percentage of random selection procedures, yield that completely unrepresentative sample. The likelihood that any given randomization yields that completely unrepresentative sample is lower as the sample size increases. That’s why it’s very tricky to generalize to the entire population based on a random sample of N=1 (not impossible, depending on what you study, but tricky - especially in psychology).

I hope that at least now I managed to explain why I don’t agree with point 1 sufficiently for others to pinpoint where my logic fails. So, on to point 2.

We want to do an experiment with the guinea pigs. We didn’t randomize them for nothing. Our experiment involves letting them listen to the newest Haken album (Affinity) to see whether this has in impact on their mood. Now, if we were omniscient (which we are in this example), we’d know that in our population, long-haired guinea pigs love progressive metal, whereas the short-haired ones tend to go more for trip-hop, or, on the occasional Sunday morning, light classical music. Short-haired guinea pigs couldn’t care less for Haken, and their mood is not affected (d = 0.0). Long-haired ones love it and their mood is considerably lifted (d = 1.0). Thus, our intervention has an effect size in the population of d = 0.5.

But, we don’t know all this. However, on the basis of research where rabbits and mice were exposed to Dream Theater and Queensryche, we expect to find an effect size of d = 0.5. (Lucky us! How often does that happen, that you happen to guess the exact same effect size that exists in the population? :-)) So, we power for d = 0.5, recruit 128 guinea pigs to obtain 80% power, and randomize them into two groups.

I’ll save you the picture with 128 guinea pigs. Suffice to say that we happen to get exactly 64 guinea pigs in each group. And out of all possible randomizations, we happen to get this one:

Control group: 16 x 48 *

Experimental group: 16 x 48 *

Expressed in effect sizes, we get nothing in the control group, and in total 16 standard deviations increase in mood in the experimental group (1 d for every long-haired guinea pig). This means the effect of our intervention is d = 0.25. Using pdExtreme in R we get a p-value of .16, so we conclude our intervention has no effect.

In this specific study, because we were unlucky when randomizing, our experiment ‘failed’. Our design was invalid, because unknown to us, we were confounded by a variable that moderated the effect of our intervention.

Now, with short- and long-haired guinea pigs, this is easy to prevent. After all, everybody knows their musical preferences, so in a study like this you’d take those into account.

But with psychological research, we usually haven’t figured out enough to know what likely moderators are. Therefore, we often don’t know whether we run the risk of ending up with non-equivalent groups with respect to a nuisance variable. We don’t even know how many nuisance variables there are. Any given single experiment, therefore, has a probability that the randomization procedure ‘betrays’ the researchers, in that it creates two sub-samples that render the design invalid.

(I say ‘invalid’ because that is what I think happens here. This is not a statistical problem; it’s a methodological problem. Randomization is a methodological tool to create two groups that are equivalent with respect to potential confounders; randomization allows one to exclude ‘third variables’ as explanation for an association between the independent variable and the dependent variable. That is. If the randomization ‘works’. Which it is not guaranteed to do for any single study (unless the sample size is very large).)

Of course, over studies, this would be resolved ‘automatically’. But in any single study, that is of little consolation.

So, this is why I think it’s a good idea to allow researchers to compute their required sample size to be relatively certain that randomization, for any single study, yields equivalent groups. Note that the point of the paper is not to provide researchers with an ‘excuse’ if randomization fails. Quite the opposite: the paper provides researchers with tools to ensure that at least one potential cause of replication failure can be eliminated. Well, not eliminated of course - but set at a desired level of likelihood.

Nguyen et al., in the study I referred to earlier (“Simple randomization did not protect against bias in smaller trials”), ran some simulations and concluded that you’d need about 1000 participants to stop worrying about bias. I think it will be interesting to code some simulations for different degrees of moderation and then implement those in userfriendlyscience as well, so that researchers can get more accurate estimates. I’m kind of hoping somebody else gets around to that before I do though :-)

And, of course - you don’t know how many nuisance variables exist. In the discussion somebody proposed an infinite number; but I don’t think it’s that bad. However, the fact that we don’t know how many there are, is no reason to ignore this, or to conveniently assume that they’ll even out.

I’m really curious whether there’s some logical error in these examples that mean that they are impossible.

In any case, I’d like to thank everybody who took the time to explain these matters in these discussions! They yielded some very interesting reads :-)

The definition of representativeness

As John Myles White‏ remarked, a problem in the reasoning above is that I did not refine what ‘representative’ means. So I’ll do that here.

To compensate for the omission, I’ll even offer two definitions :-)

The first definition

The first is the ‘working definition’ I use of representative (I deliberately didn’t look up what others have to say, instead working from what I apply in my ‘daily professional life’ (and probably, to the exasperation of some, in my ‘daily private life’ :-))).

A given sample is representative of the relevant population if the chosen methodology yields statistics of interest that have expected values equal to the respective population values. An example: if you want to estimate the prevalence of guinea pig possession in the Netherlands, and you draw an entirely random (‘aselect’) sample, and have a measurement method for establishing guinea pig possession that has no systematic bias (e.g. a normally distributed measurement error with a mean of 0, to suggest something exotic :-)), the expected value of your sample mean is the population mean. This doesn’t mean that in any given sample the sample mean equals the population mean; but if you would repeat your sample an infinite number of times, and average all sample means, you would necessarily obtain the population mean. The mean, in that case, is an unbiased estimate of the population mean. That means that the value you obtain generalizes to the population; the mean will be representative of the population mean. These dynamics mean that a sampling distribution of the mean exists, which has a number of properties, which allows estimation of the accuracy of the estimate, e.g. using confidence intervals. With narrow confidence intervals, the likelihood that the population mean is very far removed from any given value of the sample mean is lower (not zero, but lower).

This definition of “representativeness” hinges upon dynamics of ‘randomization’ that allow us to make statements about the population based on single samples; but the certainty with which we can make those statements is a function of the sample size. With small samples, any given sample is likely to produce values that differ wildly from the population values (as illustrated by the wide confidence intervals). In other words, with small samples, you should probably refrain from making any statements whatsoever. Where it concerns estimates of associations between variables, a ‘small sample’ can still be hundreds of participants, depending on the measurement level of the variables you’re studying and what you consider ‘wild deviation’ from the sample value; Cohen’s d’s sampling distribution, for example, only allows accuracy with hundreds of participants. Therefore, what’s a ‘small sample’ might be quite a lot of people depending on what exactly you study. Most textbook findings in psychology, for example, have tiny sample sizes (sometimes only dozens of participants), and therefore, by chance alone, many of the obtained statistics will differ substantially from the respective population values in those studies. They statistics will generally be representative; but the representativeness doesn’t guarantee that they are being close to the population value.

The second definition

This second definition is the one I think most researchers in psychology, and definitely most professionals in media and university communications officer, use, which kind of resembles the belief in the law of small numbers.

This definition of basically the same as the first definition, but without acknowledging that the estimates derived from any single study or any single sample might be very far off from the population value.

Therefore, when this definition of representativeness is applied, the applier assumes that if the conditions for obtaining a representative estimate are met, the obtained statistics be be seen as literally ‘representing’ the population value; it is assumed to have a value that is close to the population value. It is considered to be very informative as to the population value. However, for small samples, it may not be. In fact, in large samples, it may not be, either; it will just be less likely to have a value that’s far removed from the population value.

Let’s call that second definition not representativeness, but ‘literal representation’.

This ‘literal representation’ is, as far as I know, a fallacy. Yet, also as far as I know, a fallacy widely held. Examples abound: a recent one concerns a study into why the return trip is felt as shorter than whatever the opposite is called (in Dutch we have the words “heenweg” and “terugweg”, which describe the first and second legs of a round-trip, respectively).

This was extensively covered by the media (by e.g. the L.A. Times and Vox; PLoS ONE has a convenient list). The paper describes a between-subjects experiment with 20 participants, randomized into two conditions.

Clearly, the estimates derived from this sample were considered ‘representative’ in the second definition.

And they probably were representative in the first definition, but that doesn’t afford such strong conclusions. Those strong conclusions were drawn nonetheless, and widely distributed. This is just one example of a frequently repeating phenomenon.

It’s this second ‘line of thinking’ that I think is the one that we need to ‘connect to’ in discussions of randomization; and it is this second line of thinking that I think can be ameliorated by enhanced awareness of the fact that in small samples, randomization is ‘fickle’.