Detection and evaluation of the effects of treatments was given impetus in the late 19th century by the arrival of some effective medical interventions, such as salicylic medicines, phenacetin, and diphtheria antitoxins (Morse 1878; Roux 1894; Sørensen 1896; Fibiger 1898; Haas 1983; Hróbjartsson 1998; Lafont 2007).
One remarkable commentator during this important period was the Danish physician Povl Heiberg (1868-1963) (Gluud 2008). The clarity of Heiberg’s 1897 article is very impressive, and it is difficult to imagine that it was written more than a century ago (Heiberg 1897). Heiberg comments on the fruitlessness of arguments among specialists whose opinions are based on weak evidence. Instead he proposes that scientific evidence about the effects of interventions should be sought using “statistical-therapeutic experiments, with the real aim of finding a safe and common therapy that every unbiased physician is obliged to give” to the patient (Heiberg 1897).
Heiberg acknowledges that such experiments cannot be applied to every clinical question; that physicians have difficulties in dealing with statistics; and that statistical methods can be misused (Heiberg 1897). But he advocates strongly the use of clinical trials and statistical tests before deciding whether interventions should be introduced in clinical practice, and he warns against embarking on a trial if one does not have access to a sufficient number of patients.
Heiberg makes it clear that he appreciates the dangers of systematic errors (‘biases’) as well as random errors (resulting from ‘the play of chance’) in clinical research (Heiberg 1897). The inspiration for his clear understanding of these two key components for obtaining reliable results from therapeutic experiments is likely to have come from the Danish mathematical statistician TN Thiele (1838-1910). Thiele deals with both types of errors in his 1889 book (Thiele 1889; Lauritzen 1999; Lauritzen 2007). Heiberg also realises that diseases may be subject to periodic changes, and that, when attempting to evaluate therapeutic effects, errors or erroneous judgements may result from insufficient diagnostic precision, and variations in age and other prognostic variables. Heiberg stresses the need for homogeneous patient populations and notes that the circumstances of hospital admission may affect what we would nowadays call ‘the severity spectrum’ (Heiberg 1897). He even recognises that bias may be introduced when rumours that a new intervention is successful are responsible for attracting, differentially, the interest of patients with less severe disease. When relying on historical controls, this leads to the false inference that the new intervention is responsible for the apparently improved outcome.
To avoid false conclusions resulting from such lopsided comparisons, Heiberg proposes that experimental interventions should be used in alternate patients, although he actually recommends treatment allocation by date of admission to hospital as a way of reducing selection bias. He also recognises that observer bias may arise from unblinded experiments, especially if a participant can observe the reaction of the patient next to him.
Heiberg is particularly forceful in arguing that patients assigned to standard treatment (control group) cannot be considered deprived of a chance of cure. He recognises that use of the specious thinking underlying such notions short-circuits the empirical process necessary for protecting patients from unrecognised adverse effects of experimental interventions, and hinders therapeutic progress. Accordingly, alternate allocation to new or standard treatments poses no ethical problem to him.
As well as covering Danish discussions on medical statistical thinking during the 1800s, Heiberg reviews its international historical development, referring to Bernoulli (1713), LaPlace (1812), Poisson (1837), Bouillaud (1840), Gavarret (1840), Hirschberg (1874), Westergaard (1882), and Thiele (1889) (for references, see original or translation of Heiberg’s article). In particular, Heiberg stresses Gavarret’s pioneering role in introducing statistics into medicine (Gavarret 1840). As a result, Heiberg recognises that random error (from ‘the play of chance’) in small data sets can lead to mistaken inferences. Using numerous tables, he shows how this can happen, and why it is necessary to clarify how likely it is that differences between treatment comparison groups can be explained by play of chance. An aspect that is striking is his clear conception of the implications of fluctuation greater than that predicted by binomial or Poisson variability (overdispersion). He notes that, whenever this occurs, it is necessary to examine the data for possible sources of heterogeneity, such as seasonal variation in the severity or incidence of a disease.
Like Gavarret two generations earlier, Heiberg refrains from explaining the mathematical tools he presents and it is unclear what he has invented himself and how much he has adopted from his teachers. Who were Heiberg’s teachers? Heiberg had read Westergaard’s Statistikkens Theori i Grundrids (Fundamental Theory of Statistics) (1890), as well as the guidelines for medical statistics in Westergaard’s Die Lehre von der Mortalität und Morbilität (The Theory of Mortality and Morbidity) (1882). In addition, Heiberg also knew Sørensen’s Ledetraad for Læger ved statistiske Undersøgelser (Guidelines for Doctors in Statistical Analyses) (1889), which was acclaimed because of its clear and easy-to-understand presentation. As already mentioned, Heiberg was familiar with Thiele’s textbook Forelæsninger over almindelig Iagttagelseslære (Lectures on general observation theory) (1889). Thiele was an internationally renowned expert in mathematical statistics, who was well ahead of his time, and his books contain early formulations of analysis of variance and other novel statistical techniques (Schweder 1999; Lauritzen 1999; Lauritzen 2007).
What are Heiberg’s mathematical tools? Essentially those that follow from the use of the Gaussian approximation to calculate how frequently deviations of a certain magnitude can be expected, given standard binomial and Poisson situations. A warning is needed about Heiberg’s terminology at this point. When he speaks about “the law of large numbers”, he means precisely how frequently deviations of a certain magnitude can be expected, given standard binomial and Poisson situations (Heiberg 1897).
Heiberg takes it as an empirical matter whether the law holds, the envisaged alternative being fluctuations that are greater than predicted by binomial or Poisson variability (overdispersion). (In modern texts ‘the laws of large numbers’ are theorems about the mathematical conditions under which the distribution of an average has the usual asymptotic properties. This has nothing to do with the empirical question of whether a given data source is overdispersed.)
A very sophisticated statistical analysis in Heiberg’s paper is one that involves an age-adjusted mortality comparison between two successive calendar years. First, each pair of age classes gives rise to a 2-by-2 table (year versus outcome) (shown on p 22 in the original and on p 15 in the translation). The table compares means using expected number of deaths – indirect standardisation – using the proportion dying in the first year to calculate the expected mortality in the second. This technique and terminology was introduced in the 18th century and is prominent in Westergaard’s writings (Keiding 1987). Niels Keiding is unclear from where Westergaard came by the standardisation technique and conjectures “that he picked it up during his visits to England” (around 1880) (Keiding 1987).
The calculated standard error that Heiberg presents in the 4th column of the calculation table (shown on p 22 in the original and p 15 in the translation) involves the added sophistication of taking into account that the ’expected numbers’ themselves propagate an associated binomial uncertainty. So in the end, each age class is dealt with by its own 2-by-2 table analysis, summarised using a z statistic (= √χ2) in the final column. Furthermore, the deviations and their variances are summed (note that 14.22 = 3.42 + 11.52 + …), and a summary z of 6.4 is calculated as evidence of a difference in mortality. The inspiration for this is likely to have come from Westergaard’s book, but we have only found hints there as to how such an analysis should be done, and no fully worked example. Neither did we find any hints of such an analysis in Sørensen’s Guidelines for Doctors in Statistical Analyses (Sørensen 1889). Inspiration from the visionary Danish mathematical statistician Thiele could at most be indirect, as we have found no similar models or hints to such models in his books produced before (Thiele 1889; Lauritzen 2007) and after (Thiele 1903) Heiberg’s remarkable article.
Anyhow, Heiberg’s procedure is analogous to that proposed on somewhat intuitive grounds by Mantel and Haenszel more than 60 years later (Mantel and Haenszel 1959). This procedure has subsequently been given theoretic underpinning by many researchers (Kuritz 1988). Mantel and Haenszel do not refer to Heiberg’s work or to Westergaard’s work for that matter. We will be interested to see whether there is an earlier example than Heiberg’s of a confounder-adjusted 2-by-2 analysis equivalent to the Mantel-Haenszel procedure. Incidentally, Heiberg does not comment on the evidential impact of his z = 6.4 (we find 6.0), possibly because he did not have a Gaussian distribution table at hand. The observed z is actually equivalent to P ≈ 10–9.
Heiberg’s vocabulary was not like present day ‘statistical test’ vocabulary. For example, confidence limits were presented under ad hoc names, such as Gavarret’s ‘limits of oscillation’ (Gavarret 1840). Comparing Heiberg’s text with modern texts on clinical trials one also notes a number of methodological caveats of which Heiberg was probably unaware. Heiberg did not seem to recognise the dangers of treatment allocation by date or simple alternation, where foreknowledge of upcoming allocations can result in selection biases. These dangers were not noticed until the 1930s. Bradford Hill, for example, became aware that an alternate allocation scheme had not been strictly observed in a Medical Research Council trial conducted in the early 1930s and that selection bias had thus probably undermined the validity of the comparisons made in the study (Chalmers 2003; Chalmers 2008). The full consequences of the dangers of not adhering to adequate generation of the allocation sequence and adequate allocation concealment first became fully recognised about one century later (Schulz 1995; Moher 1999; Kjaergard 2001; Wood 2008). However, from a modern-day perspective, Heiberg nowhere erred, except when writing towards the end of his essay about calculating “the chances that this [observed] difference is real.” In this he makes the all too common slip of interpreting a P-value as the probability that the null hypothesis is false. Heiberg may well be excused for this, as the application of probabilistic reasoning about chance errors in medicine was still in its infancy.
This JLL Bulletin article has been co-published in Preventive Medicine 2009; 48:600-3.
We thank Iain Chalmers and Jan P Vandenbroucke for helpful comments and suggestions on earlier drafts.
Chalmers I (2003). Fisher and Bradford Hill: theory and pragmatism? International Journal of Epidemiology 32:922-924.
Chalmers I (2008). MRC Therapeutic Trials Committee’s report on serum treatment of lobar pneumonia, BMJ 1934. The James Lind Library (https://www.jameslindlibrary.org/).
Fibiger J (1898). Om Serumbehandling af Difteri [About serum treatment of diphtheria]. Hospitalstidende 6: 309-325 and 337-350.
Gavarret LDJ (1840). Principes généraux de statistique médicale: ou développement des règles qui doivent présider à son emploi. [General principles of medical statistics, or development of rules that should govern their use]. Paris: Bechet Jeune & Labé. The James Lind Library (https://www.jameslindlibrary.org/).
Gluud C (2008). Povl Heiberg (1868-1963). The James Lind Library (https://www.jameslindlibrary.org/articles/povl-heiberg-1868-1963/).
Haas H (1983). History of antipyretic analgesic therapy. Am J Med 75(5A):1-3.
Heiberg P (1897). Studier over den statistiske undersøgelsesmetode som hjælpemiddel ved terapeutiske undersøgelser [Studies on the statistical study design as an aid in therapeutic trials]. Bibliotek for Læger 89:1-40.
Hróbjartsson A, Gøtzsche PC, Gluud C (1998). The controlled clinical trial turns 100 years: Fibiger’s trial of serum treatment of diphtheria. BMJ 317:1243-1245.
Keiding N (1987). The method of expected number of deaths, 1786-1886-1986. International Statistical Review 55:1-20.
Kjaergard LL, Villumsen J, Gluud C (2001). Reported methodological quality and discrepancies between large and small randomised trials in meta-analyses. Ann Intern Med 135:982-989.
Kuritz SJ, Landis JR, Koch GG (1988). A general overview of Mantel-Haenszel methods: applications and recent developments. Ann Rev Public Health 9:123-160.
Lafont O (2007). From the willow to aspirin. Rev Hist Pharm (Paris) 55:209-16.
Lauritzen SL (1999). Aspects of T.N. Thiele’s contributions to statistics.
http://www.stat.fi/isi99/proceedings.html . (Accessed 2 May, 2008).
Lauritzen SL (2007). Thiele: Pioneer in Statistics. Oxford: Oxford University Press.
Mantel N, Haenszel W (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 22:719-48.
Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP (1998). Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 352:609-613.
Morse N (1878). Ueber eine neue Darstellungsmethode der Acetylamidophenole. [On a new production method for acetylamidophenole]. Berichte der deutschen chemischen Gesellschaft 11: 232–233. doi:10.1002/cber.18780110151.
Roux ME, Martin ML, Chaillou MA (1894). Trois cents cas de diphtérie traités par le sérum antidiphtérique. [Three hundred patients treated with anti-diphtheria serum]. Ann Inst Pasteur 8: 640-662.
Schulz KF, Chalmers I, Hayes RJ, Altman DG (1995). Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 273: 408-412.
Schweder T (1999). Early statistics in the Nordic countries – when did the Scandinavians slip behind the British? http://www.stat.fi/isi99/proceedings/arkisto/varasto/schw0844.pdf.
Sørensen S (1896). Forsøg med Serumterapi ved Difteritis. [Trials on serum therapy for diphtheria]. Hospitalstidende 4: 621-628.
Sørensen T (1889). Ledetraad for Læger ved statistiske Undersøgelser. [Guidelines for Doctors in Statistical Analyses]. Published with support from the Danish General Medical Association. Supplement to Ugeskrift for Læger XX;6: 1-62. Kjøbenhavn: Bianco Lunos Kgl. Hof-Bogtrykkeri.
Thiele TN (1889). Forelæsninger over almindelig Iagttagelseslære. [Lectures on Studies of General Observations]. Referred to by Heiberg P (1897). Studier over den statistiske undersøgelsesmetode som hjælpemiddel ved terapeutiske undersøgelser. [Studies on the statistical study design as an aid in therapeutic trials]. Bibliotek for Læger 1897;89:1-40.
Thiele TN (1903). Theory of Observations. London: Layton. (Reprinted in Ann Math Statist 1931; 2:165-308).
Westergaard H (1882). Die Lehre von der Mortalität und Morbilität. [The Theory of Mortality and Morbidity]. Jena: Gustav Fischer Verlag.
Westergaard H (1890). Statistikkens Theori i Grundrids. [Fundamental Theory of Statistics]. København. (also published in German 1890. [Die Grundzüge der Theorie der Statistik.] Jena: Gustav Fischer Verlag.)
Wood L, Egger M, Gluud LL, Schulz K, Jüni P, Altman DG, Gluud C, Martin RM, Wood AJ, Sterne JA (2008). Empirical evidence of bias: Methodological quality and treatment effect estimates in controlled trials with different interventions and outcomes. Meta-epidemiological study. BMJ 336:601-605.