Biostatistics / Statistics Advisory Serv... /
Statistics software
Power and sample size
This is a quick journey through statistical power and sample size for statistical hypothesis testing.
3. Cohen’s “small”, “moderate” and “large” effect sizes
At the start of any kind of scientific investigation, there needs to be a clear idea of the aims of the study. Ideally an investigator sets out to provide empirical evidence for a small number of research hypotheses. To allow formal statistical assessment of such experimental hypotheses each can be translated into a statistical test problem which consists of a null hypothesis about population parameters and an alternative hypothesis (typically a translation of the hypothesis that the investigator is trying to prove). The two hypotheses have to be mutually exclusive events and cover the whole parameter space. A typical example would be an experiment where the investigator sets out to show the existence of a group difference in the mean value of some normally distributed outcome. Here the null hypothesis would be that the population means in the groups are the same versus the alternative hypothesis that they differ.
The objective of a statistical test is to either accept or reject the null hypothesis in the light of the study data. When carrying out a statistical test wrong decisions might be made, namely the null hypothesis is rejected when it should be accepted or vice versa. Please see the diagram below:
|
Truth | |||
|
Null hypothesis true |
Null hypothesis false (=Alterative hypothesis true) | ||
|
Decision |
Accept |
True Negative 1-a |
False Negative Type II Error ß |
|
Reject |
False Positive Type I Error ? |
True Positive Power 1-ß | |
As can be seen from the above diagram there are two types of error, so called Type I errors, or rejecting the null hypothesis when it is in fact true, and so called Type II errors, or accepting the null hypothesis when it is in fact false. Their respective probabilities are commonly referred to as the significance level a and the type-II error probability ß. This leaves two boxes remaining; the probability of accepting the null hypothesis when it is in fact true and the probably of rejecting the null hypothesis when it is in fact false. The second of these statements is the definition of statistical power.
While the type I error probability of a statistical test is controlled by the significance level the power cannot be controlled for. Rather the power of a test depends on the sample size and the effect size that the test is aiming to detect. The latter implies that once two out of the three numbers sample size, power and effect size have been specified then the third can be calculated. Most importantly the sample size that allows detection of a given effect by a statistical test with a specified power can be calculated. Such calculations are often referred to as sample size calculations or power analysis.
For example when planning to carry out an independent samples t-test to test the null hypothesis of zero group difference in the mean value of some normally distributed outcome the relevant effect size measure (often called a standardized effect size) is the difference between the group means divided by the common within-group standard deviation of the outcome. The following figures demonstrate the interlinking of standardised effect size, sample size (n1=size of group 1, n2=size of group 2) and power (as a percentage).
(1) “For a given sample size the power increases with the standardised effect size that the test is trying to detect.”
(2) “For a given power the sample size necessary to detect an effect increases as the standardised effect size decreases.”
(3) “For a given standardized effect size the power to detect the effect increases as the sample size increases.”
Carrying out such calculations as part of the planning of a research study is important for ethical reasons. A study with too many subjects may be deemed wasteful of resources and unethical due to the involvement of too many people. On the other hand studies with too few people will be unlikely to detect clinically important effects, so again could be considered to be wasteful of resources and unethical. Obviously ethical implications of sample size will vary with the background of the study. For example, clearly the implications of unnecessarily applying an invasive treatment are larger than those of merely filling out a questionnaire.
In order to answer the basic question “what sample size do I need?” there are five steps that need to be gone through. These are as follows
I. Specification of the statistical test problem:
a. What is the null hypothesis?
b. What is the alternative hypothesis?
c. Is this a 1-sided or 2-sided test?
d. What significance level is required?
II. Selection of an appropriate statistical significance test:
a. Choice depends on the study design. (For example, is this a repeated measures design, or is there any kind of relationship between the units of observation?)
b. Choice depends on the scale of the response variable. (Is the response a continuous or a categorical outcome?)
c. For continuous variables choice further depends on assumptions that we are willing to make. (normal distribution?, constant variances in groups?, not willing to make any distributional assumptions?)
d. etc.
III. Specification of the effect size to be detected:
a. Ideally this effect size should reflect the smallest clinically significant effect, that is the smallest effect that would result in a change in clinical practice.
b. Often “expert knowledge” is used to determine the effect size. I.e. knowledge from previous studies or a pilot study is used to specify the expected size of the effect and then the current study is aimed at detecting this size effect or larger. Note, however, that previous studies can only provide estimates of effects and that if the true effect is smaller than the estimate the current study might not have enough power to detect it using this approach.
c. “Rough calculations” can be carried out by specifying “small”, “moderate” or “large” effect sizes when smallest clinically significant effect cannot be specified due to lack of knowledge. This approach has been used by Cohen (1988) and is described in a little more detail below. While perhaps appealing to the practitioner this less informative approach should only be used as a last resort.
IV. Power requirement:
a. There is a general acceptance that this should be at least 80%, though the greater the better.
b. (The IoP Ethics committee will not accept study proposals that do not at least provide 80% power.)
V. Sensitivity analysis:
a. This to see how the sample size varies when the input values vary within reasonable levels, e.g.
i. the power requirement changes within a reasonable interval
ii. the effect size changes within a plausible range (e.g. confidence intervals)
iii. another appropriate test (e.g. a non-parametric test) is used
b. One would have more trust in a sample size calculation if the sample size is sufficient even under the worst case scenario
3. Cohen’s “small”, “moderate” and “large” effect sizes
It can sometimes be difficult to interpret exactly what an effect size might mean and hence assess clinical significance, in particular when dealing with new scales. Cohen (1988) likened standardized effect sizes (differences in group means divided by within-group standard deviation) to the heights of teenage girls. Thus:
- Effect size = 0.2 (small effect size) – difference in mean heights between girls aged 15 and 16
- Effect size = 0.5 (moderate effect size) – difference in mean heights between girls aged 14 and 18
- Effect size = 0.8 (large effect size) – difference in mean heights between girls aged 12 and 20
The following figure illustrates the separations of two normal distributions for a “moderate” and a “very large” (standardized effect size = 3) effect size.
a) “Moderate” effect
b) “Very large” effect
It is often argued that a study should at least be able to detect what Cohen has called a “large” effect size, that is a standardized effect of 0.8-1, which separated two standard normal distributions by about a sixth of their range.
A study is planned to compare two groups of patients (groups A and B) using a cognitive assessment scale (CAMTOT). Significance testing is to be performed at the 5%-level and a power of at least 80% is required. The minimum difference in mean CAMTOT scores between groups that the researcher would like to be able to detect with the study is ten points. A pilot study was performed which yielded a pooled within-group standard deviation of CAMTOT scores of 10 points.
Here the null hypothesis to be tested is that the target population mean CAMTOT scores in group A and B are the same. The alternative hypothesis is that they differ. Assuming normality of the CAMTOT scores and homogeneity of variance in the two groups an independent samples t-test might be used to formally assess the null hypothesis.
Using a sample size calculator (e.g. NQuery Advisor; see also FAQ3.1) we can work out that 17 in each group would be needed for an independent samples t-test at the 5% significance level to be able to detect a mean difference of 10 points or larger with 80% power assuming that the within-group standard deviation is 10 points (standardized effect size=1).
The stated effect size requirement translates into a “large” effect size. What would happen if the true effect was only of “moderate” size; that is the true mean difference was only 5 points. A power calculation based on the same test problem and t-test as above shows that a sample size of 17 in each group would have only 29% power to detect such a reduced difference in means (standardized effect size=0.5). If the true effect was smaller than anticipated we would therefore not have sufficient power to find it.
Alternatively then we might calculate the sample size that would be needed to detect a 5 point difference with 80% power. A new sample size calculation shows that reducing the effect size from a standardized effect of 1 (“large”) to 0.5 (“moderate”) increases the necessary sample size to 64 subjects per group.
Finally we might check out the sensitivity of our calculation to changing the statistical test that will be used to detect the group difference in CAMTOT means. Assuming that we would be using a nonparametric equivalent to the independent samples t-test (say the Mann-Whitey U-test) we calculate that 20 in each group would be needed to detect a 10 point difference or larger with 80% power. This sample size requirement increases to 69 subjects per group for detecting a 5 point difference or larger using a nonparametric test.
Course notes for Power Analysis using nQuery Advisor
Text books on sample size calculations and power
Papers
Howell DC. Power. In: Everitt BS and DC Howell (eds.) (2005) Encyclopedia of Statistics in Behavioral Science, Wiley, p.1558-1564.


