Gluud C and Thorlund K (2025). Long overdue recognition of Klim McPherson’s 1974 article on sequential analysis of trial data

© Christian Gluud. Email: cgluud@ctu.dk


Cite as: Gluud C and Thorlund K (2025). Long overdue recognition of Klim McPherson’s 1974 article on sequential analysis of trial data JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/long-overdue-recognition-of-klim-mcphersons-1974-article-on-sequential-analysis-of-trial-data/)


Abstract

In 1974, Klim McPherson published a two-page note in the New England Journal of Medicine entitled “Statistics: The problem of examining the data more than once”. Drawing on the tradition of sequential analysis shaped by his mentor, Peter Armitage, McPherson demonstrated how repeated interim statistical significance testing inflates the type I error rate. He provided a simple table of adjusted significance thresholds to preserve nominal error levels. His use of standard p-value thresholds and clear tabular illustrations made the complex statistical issue accessible to clinicians and other non-statisticians, paving the way for more widespread understanding and eventual adoption. Although the article was modestly cited compared to later seminal works, such as the articles proposing the O’Brien-Fleming’s monitoring curve and the Lan-DeMets’ alpha-spending function, McPherson’s article sowed the conceptual seeds that underpin modern sequential methods in clinical trials, which have been further extended into meta-analysis and evidence synthesis. The legacy of McPherson’s 1974 article lives not only in any group sequential clinical trial design, but also in any context in which data analysis of multiple patients includes repeated statistical significance testing, in this vein, protecting both patients and science.

Introduction

In 1974, Klim McPherson published a brief, yet seminal note in the New England Journal of Medicine entitled: “Statistics: The problem of examining the data more than once” (McPherson 1974). In it, he offered an elegantly succinct and accessible demonstration of a crucial issue in clinical trial design: how repeated interim analyses inflate the type I error rate, making the mistaken conclusion that a new treatment is beneficial more likely; and how adjusted significance thresholds can maintain the desired nominal type I error (e.g. 5%).

Klim McPherson

Using two simple tables – one outlining the inflation of type I error in a few common scenarios and one outlining the adjusted significance threshold preserving the desired nominal type I error under the same scenarios – McPherson conveyed with utmost simplicity and clarity what had previously only been addressed using more complex mathematical approaches (Armitage 1954; Armitage 1958; Armitage 1960; Armitage, McPherson & Rowe 1969; McPherson & Armitage 1971). For instance, in one illustrative scenario, McPherson showed that testing accumulating data ten times without adjustment and before reaching the required sample size, raises the nominal 5% false-positive rate (type I error) to nearly 20% (McPherson 1974). Under the same scenario, he showed that a nominal significance threshold of 1.07% would maintain a desired overall type I error rate of 5% (see Tables 1 and 2 in McPherson’s article) (McPherson 1974). Importantly, he also emphasised that repeated analyses should be pre-specified in the protocol for a trial, including the number and timing of planned looks (McPherson 1974). While his recommendations laid the foundations for later formal frameworks for sequential designs, the article itself has rarely been cited outside methodological circles, despite its continued presence in biostatistics training materials.

From where did McPherson get his insight?

McPherson was trained by Peter Armitage, the doyen of British medical statistics (Reynolds & Nagaraja 2024). They co-authored two articles published in the Journal of Royal Statistical Society on repeated significance testing in 1969 and 1971 (Armitage, McPherson & Rowe 1969; McPherson & Armitage 1971). Armitage’s “Sequential Medical Trials” monograph codified sequential methods for clinicians (Armitage 1960), and in the second edition’s preface Armitage thanked McPherson for his assistance with the book (Armitage 1974). The lineage from Armitage to McPherson places the 1974 article (McPherson 1974) firmly within a tradition that would soon blossom into modern sequential designs.

What was unique about McPherson’s 1974 article

Previous articles and books on repeated statistical significance testing in clinical trials were invariably mathematically ‘heavy’, and arguably not readily accessible to the average clinical researcher with little or no statistics training (Armitage 1954; Armitage 1958; Armitage 1960; Armitage, McPherson & Rowe 1969; McPherson & Armitage 1971; Haybittle 1971; Armitage 1974). While all these articles were ripe with tables that showed hand calculated (or occasionally computer simulated) inflated probabilities of type I error under various repeated statistical significance testing scenarios across multiple types of data (binary, continuous, etc.), McPherson’s 1974 two page article was (to our knowledge) the first to present one simple and digestible table to only outline a few key scenarios that clinical researchers would typically understand to be important (McPherson 1974). Perhaps more importantly, the equally simple second table in McPherson’s article (McPherson 1974), simplistically outlining adjusted significance thresholds for p-values, could well be perceived as a didactic masterpiece. While previous articles had conceptually done the same, their focus was typically on thresholds for some non-standard test statistics (typically some linear transformation of the Z-value and varying across articles). McPherson’s shift to significance thresholds relative to the p-value conveyed the problem of (and solution to) repeated significance testing using terminology that the non-statistically minded clinical researcher could easily comprehend.

How was McPherson’s 1974 article received?

As of 6 July, 2025, ScholarGPS lists 172 citations of McPherson’s 1974 New England Journal of Medicine article across five decades. This is a respectable number, yet modest when compared to other seminal papers such as Peter O’Brien’s and Thomas Fleming’s introduction of the now widely adopted O’Brien-Fleming group sequential monitoring curve (O’Brien & Fleming 1979), with 2774 citations in ScholarGPS on 6 July, 2025 or Gordon Lan and David DeMets’ introduction of the flexible (‘test when you want’) alpha-spending function (Lan & DeMets 1983), with 1518 citations in ScholarGPS on that day. The difference in citations may say less about the merit of McPherson’s article than about timing and audience. In the early 1970s, most clinical investigators worked with paper case‑report forms, hand calculators and limited access to biostatisticians. Interim monitoring was logistically daunting, so the value of McPherson’s clear warnings may not have been immediately obvious. Furthermore, other articles and books had provided much more comprehensive tables of type I error inflation, which meant that McPherson’s 1974 article was likely never cited for significance thresholds applied within a trial. As computing power grew and data monitoring committees became standard (Armitage 2013), Lan and DeMets’ flexible spending‑function approach provided user‑friendly software implementations, which naturally drew more citations. Nonetheless, the conceptual seeds had been sown by McPherson.

Why did widespread adoption of McPherson’s ideas take time?

Several contemporary forces delayed implementation of statistical adjustment when repeatedly testing accumulating data in clinical trials. In the 1970s, computing constraints and immature data management practices meant that interim data were seldom available in a clinical trial until much later during the trial. This meant that timely interim analyses were rarely feasible. Throughout the 1980s and 1990s, the clinical trial statistics community witnessed a lively debate about what type of alpha-spending was optimal when considering ethics, interim statistical power, and other factors. It was not till the late 1990s (or even early 2000s) that most of the community had begun to favour conservative boundaries such as those of O’Brien-Fleming or Haybittle-Peto (O’Brien & Fleming 1979; Haybittle 1971; Peto et al. 1976). Once spending‑function software matured and both the U.S. Food and Drug Administration (FDA) and the European Medicines Agency (EMA) guidance documents (published first in 1998) endorsed group sequential design using pre‑specified alpha‑spending (U.S Department of Health 1998; European Medicines Agency 1998), McPherson’s insight became an operational norm rather than an academic curiosity. Yet, due to the nature of the two-decade debate over optimal alpha-spending, references favoured for citations were those that proposed or compared monitoring boundaries (e.g. Pocock vs quadratic alpha-spending vs O’Brien-Fleming).

Progress since 1974

The advances in sequential testing of accumulating clinical trial data have been remarkable. From the early proposals of formal group sequential boundaries such as that of O’Brien-Fleming, to the introduction of the Lan-DeMets’ alpha-spending function, to the availability of open-source code for alpha-spending functions or user-friendly commercial software (e.g. Cytel’s EAST software (Cytel Inc. 2023)), to the regulatory endorsement of these methods, it is clear that McPherson’s 1974 article had a ‘snowball effect’ on the evolution of repeated significance testing practices in clinical trials.

Perhaps beyond what McPherson had even imagined, the methods have further extended into evidence synthesis. In 1998, Pogue and Yusuf first proposed cumulative meta-analytic monitoring boundaries based an optimal information size (i.e. required meta-analysis sample size) (Pogue & Yusuf 1998). These developments were reflected in the Copenhagen Trial Unit’s series of papers on trial sequential analysis (TSA) and software to ensure proper adjustment for repeated significance testing in cumulative meta-analysis (2005 and onwards) (Thorlund et al. 2005; Wetterslev et al. 2008; Wetterslev et al. 2009; Thorlund et al. 2009; Imberger et al. 2015; Imberger et al. 2016; Thorlund et al. 2017; Thorlund et al. 2021). Other groups proposed alternative methodologies and approaches (Thomas et al. 2024).

Remaining challenges

While the regulatory environment now strongly endorses error control in clinical trials, many randomised clinical trials as well as cumulative evidence syntheses still vary in rigor. A sharp increase in the use of interim analyses and data monitoring and safety committees has been seen during the past 35 years, yet several articles have pointed out that many randomised clinical trials lack stringency in use of interim analyses and use of data monitoring committees (Montori et al. 2005; Mills et al. 2006; Fleming et al. 2017; Schöffski 2021; Ciolino, Kaizer & Bonner 2023; Bodden, Hilgers & König 2025). Many systematic reviews accept primary analyses as well as repeated updates without adjustment, risking over‑interpretation of early trends (Thorlund et al. 2005; Wetterslev et al. 2008; Wetterslev et al. 2009; Thorlund et al. 2009; Imberger et al. 2015; Imberger et al. 2016; Thorlund et al. 2017; Thorlund et al. 2021). Despite publication of several studies demonstrating the control of type I and II errors by using TSA (Thorlund et al. 2005; Wetterslev et al. 2008; Wetterslev et al. 2009; Thorlund et al. 2009; Imberger et al. 2015; Imberger et al. 2016; Thorlund et al. 2017; Thorlund et al. 2021), an expert panel within the Cochrane Collaboration advocated against the use of TSA in Cochrane reviews in 2017 (Expert Panel 2017), although the opposition to TSA has been slightly softened in the latest edition of the Cochrane Handbook (Thomas et al. 2024). Moreover, when performing TSAs, the majority of the more than 10,000 meta-analyses or systematic reviews using the methodology lack proper protocolisation or acquiesces in other mistakes (Riberholt et al. 2024). Wider and better uptake of TSA or comparable tools would honor McPherson’s spirit by safeguarding against spurious precision at the individual trial level as well as at the meta‑analytic level.

Concluding remarks

McPherson’s 1974 note was seminal: it conveyed the problem as well as a solution to repeated significance testing in clinical trials to a broader audience that had not previously been reached. Fifty-one years on, the article deserves overdue homage – not because later work eclipsed it, but because it made that work possible. Its legacy lives on in every sequential design and interim analysis plan that now protects patients and science alike.

Acknowledgements
We thank Iain Chamers and Mike Clarke for asking us to prepare this commentary on Klim McPherson’s article, as well as for helpful language amendments to this article. We also thank Klim McPhersons’s daughter, Tess McPherson, for clarifying her father’s full name, Chistopher Klim McPherson, and providing his photograph.

 

References

Armitage P (1954). Sequential tests in prophylactic and therapeutic trials. Quarterly Journal of Medicine 23: 255-274.

Armitage P (1958). Sequential methods in clinical trials. American Journal of Public Health 48: 1395-1402.

Armitage P (1960). Sequential Medical Trials. Oxford. Blackwell Scientific Publications. First edition.1-105.

Armitage P, McPherson CK, Rowe BC (1969). Repeated significance tests on accumulation data. J R Stat Soc [A] 132:235-244.

Armitage P (1974). Sequential Medical Trials. Oxford. Blackwell Scientific Publications. Second edition. 1-194.

Armitage P (2013). The evolution of ways of deciding when clinical trials should stop recruiting. Journal of the Royal Society of Medicine 107(1): 34-9. (https://www.jameslindlibrary.org/articles/the-evolution-of-ways-of-deciding-when-clinical-trials-should-stop-recruiting/)

Bodden D, Hilgers R-D, König F (2025). Randomization in clinical trials with small sample sizes using group sequential designs. PLoS One 20(6): e0325333. https://doi.org/10.1371/journal.pone.0325333.

Ciolino JD, Kaizer AM, Bonner LB (2023). Guidance on interim analysis methods in clinical trials. J Clin Trans Science 7(1):e124. doi: 10.1017/cts.2023.552.

Cytel Inc. (2023). EAST® (Version 6.6). Software for the design, simulation, and monitoring of clinical trials. Cytel Inc., Cambridge, MA, USA. https://cytel.com/east-horizon/.

European Medicines Agency (1998). ICH E9: Statistical Principles for Clinical Trials (CPMP/ICH/363/96). Committee for Proprietary Medicinal Products (CPMP), London. https://www.ema.europa.eu/en/documents/scientific-guideline/ich-e9-statistical-principles-clinical-trials-step-5_en.pdf.

Expert Panel Consensus Statement (2017). Should Cochrane apply error-adjustment methods when conducting repeated meta-analyses? https://methods.cochrane.org/sites/methods.cochrane.org/files/uploads/tsa_expert_panel_guidance_and_recommendation_final.pdf.

Fleming TR, DeMets DL, Roe MT, Wittes J, Calis KA, Vora AN, Meisel A, Bain RP, Konstam MA, Pencina MJ, Gordon DJ, Mahaffey KW, Hennekens CH, Neaton JD, Pearson GD, Andersson TL, Pfeffer MA, Ellenberg SS (2017). Data monitoring committees: Promoting best practices to address emerging challenges. Clin Trials. 14(2):115-123. doi: 10.1177/1740774516688915.

Haybittle JL (1971). Repeated assessment of results in clinical trials of cancer treatment. Br J Radiol. 1971 Oct;44(526):793-7. doi: 10.1259/0007-1285-44-526-793.

Imberger G, Gluud C, Boylan J, Wetterslev J (2015). Systematic reviews of anesthesiologic interventions reported as statistically significant: problems with power, precision, and type 1 error protection. Anesth Analg.121:1611–22.

Imberger G, Thorlund K, Gluud C, Wetterslev J (2016). False positive findings in cumulative meta-analysis with and without application of trial sequential analysis: an empirical review. BMJ Open. 6(8):e011890.

Lan KKG, DeMets DL (1983). Discrete sequential boundaries for clinical trials. Biometrics 70:659–63.

McPherson CK, Armitage P (1971). Repeated significance tests on accumulating data when the null hypothesis is not true. J R Stat Soc [A] 134:15-25.

McPherson K (1974). Statistics: The problem of examining accumulation data more than once. N Engl J Med 290:501-502.

Mills E, Cooper C, Wu P, Rachlis B, Singh S, Guyatt GH (2006). Randomized trials stopped early for harm in HIV/AIDS: a systematic survey. HIV Clin Trials. 7(1):24-33. doi: 10.1310/FEED-6T8U-0BUG-6HQH.

Montori VM, Devereaux PJ, Adhikari NKJ, Burns KEA, Eggert CH, Briel M, Lacchetti C, Leung TW, Darling E, Bryant DM, Bucher HC, Schünemann HJ, Meade MO, Cook DJ, Erwin PJ, Sood A, Sood R, Lo B, Thompson CA, Zhou Q, Mills E, Guyatt GH (2005). Randomized trials stopped early for benefit: a systematic review. JAMA 294:17:2203-09.

O’Brien PC, Fleming TR (1979). A multiple testing procedure for clinical trials. Biometrics 35:549–56.

Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N, McPherson K, Peto J, Smith PG (1976). Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br. J. Cancer 34 (6): 585–612.

Pogue J, Yusuf S (1998). Overcoming the limitations of current meta-analysis of randomised controlled trials. The Lancet 351:47-52.

Riberholt CG, Olsen MH, Milan JB, Hafliðadóttir SH, Svanholm JH, Pedersen EB, Lew CCH, Asante MA, Pereira Ribeiro J, Wagner V, Kumburegama BWMB, Lee ZY, Schaug J P, Madsen C, Gluud C (2024). Major mistakes or errors in the use of trial sequential analysis in systematic reviews or meta-analyses: the METSA systematic review. BMC Medical Research Methodology 24 (1):196. doi: 10.1186/s12874-024-02318-y.

Reynolds PS, Nagaraja CH (2024). A tribute to Peter Armitage. Chance 37:49-54.

Schöffski P (2021). Importance and role of independent data monitoring committees (IDMCs) in oncology clinical trials. BMJ Open 11(10):e047294. doi: 10.1136/bmjopen-2020-047294.

Thomas J, Askie LM, Berlin JA, Elliott JH, Ghersi D, Simmonds M, Takwoingi Y, Tierney JF, Higgins HPT (2024). Chapter 22: Prospective approaches to accumulating evidence [last updated October 2019]. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.5. Cochrane, 2024. Available from cochrane.org/handbook.

Thorlund K, Wetterslev J, Brok J, Gluud C (2005). Trial sequential analyses of six meta-analyses considering heterogeneity and trial weight. Cochrane Colloquium, Melbourne, Australia. https://abstracts.cochrane.org/2005-melbourne/trial-sequential-analyses-six-meta-analyses-considering-heterogeneity-and-trial.

Thorlund K, Devereaux PJ, Wetterslev J, Guyatt G, Ioannidis JP, Thabane L, Gluud LL, Als-Nielsen B, Gluud C (2009). Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses? Int J Epidemiol. 38:276–86.

Thorlund K, Engstrøm J, Wetterslev J, Brok J, Imberger G, Gluud C (2017). User Manual for Trial Sequential Analysis (TSA). Copenhagen Trial Unit, Centre for Clinical Intervention Research, Copenhagen, Denmark, free-ware available at www.ctu.dk/tsa.

Thorlund K, Engstrøm J, Wetterslev J, Gluud C (2021). Java Trial Sequential Analysis software ver. 0.9.5.10 Beta is available at http://www.ctu.dk/tsa/for free.

U.S. Department of Health and Human Services Food and Drug Administration, Center for Drug Evaluation and Research (CDER) and Center for Biologics Evaluation and Research (CBER) (1998). Adaptive Designs for Clinical Trials of Drugs and Biologics. Guidance for Industry. FDA 1-33. https://www.fda.gov/media/78495/download.

Wetterslev J, Thorlund K, Brok J, Gluud C (2008). Trial sequential analysis may establish when firm evidence is reached in cumulative meta-analysis. J Clin Epidemiol. 61:64–75.

Wetterslev J, Thorlund K, Brok J, Gluud C (2009). Estimating required information size by quantifying diversity in a random-effects meta-analysis. BMC Med Res Methodol. 86.1-12.