The role of probabilistic reasoning in medicine has been a source of controversy for millennia (Matthews 2020). This reflects its implications for such vexed questions as the extent to which medicine is an art or science, and the ability of insights from clinical trials to inform the treatment of individuals. By the mid-19th century, these issues had been joined by a third: the role of statistical analysis in assessing the effectiveness of therapies tested in patient studies (Tröhler 2020a, b). Ironically, while early advocates of such “numerical methods” claimed they brought objective clarity to clinical decisions, their principal effect was to provoke more debate, frequently fuelled by misunderstandings and ill-founded arguments. As we shall see, a key driver was the reliance of numerical methods on concepts drawn from the theory of probability, a branch of mathematics notoriously prone to misconceptions. When the debate went into abeyance in the 1880s the sceptics of numerical methods were in the ascendancy, where they would remain until the early decades of the 20th century (Matthews 1995).
The ceasefire followed criticism of a remarkable contribution to the debate published in 1877 by Carl Liebermeister (Ineichen 1994, Matthews 1995, Tröhler 2020a). Born in 1833 in Ronsdorf, Germany, Liebermeister studied medicine and held several senior posts, including Professor of Internal Medicine at Basel, until his death in 1901 (Baumberger (1980), Seneta et al. (2004), Tröhler (2020a)). Liebermeister wrote many papers, not all on clinical topics; a list with 93 of his publications originally compiled by his daughter Marie in 1919 can be found in Baumberger (pp 159–69). His best-known contribution to medicine is the eponymous rule relating changes in body temperature with pulse rate is still invoked today (eg Voet 2018).
Liebermeister’s controversial contribution to the debate over the role of statistical methods in medicine (Liebermeister 1877) is far less well-known – and, we will argue, undeservedly so. It takes the form of a 28-page paper entitled Über Wahrscheinlichkeitsrechnung in Anwendung auf therapeutische Statistik (“On Probability Theory Applied to Therapeutic Statistics”; henceforth ÜW). Published in 1877 in Sammlung Klinischer Vortrage (“Collection of Clinical Lectures”) edited by the distinguished German surgeon Professor Richard Volkmann, it attracted considerable attention at the time. Today it is all but forgotten except by medical historians. However, it remains remarkable in many respects, not least its relevance for current concerns about the use of statistical methods in the interpretation of clinical studies. Indeed, we will argue that ÜW contains the building-blocks of a paradigm shift in the use of statistics in medicine which would have better served clinicians than today’s predominant methodology.
The purpose of this article is to bring Liebermeister’s work and its significance to a wider audience. We describe its contents and discuss their implications, both for medical science as practised at the time and today. We also describe the reaction ÜW provoked among Liebermeister’s contemporaries, and in particular the supposedly damning critiques that left sceptics of numerical methods in the ascendancy for decades to come. We will show that these critiques rested on various misconceptions about the theory of probability held by authors lacking Liebermeister’s considerable ability in the subject. Finally, to further assist recognition of Liebermeister’s remarkable contribution to the history of medical science, we provide supplementary material online, including the first English translation of ÜW and a detailed technical commentary.
Liebermeister’s ÜW is his sole contribution to the long-running debate on the role of probability theory in medicine. This is somewhat surprising, given the quantitative turn taken by this debate in the early 19th century and Liebermeister’s considerable mathematical abilities, made strikingly apparent in ÜW. This shift had been prompted by claims concerning the efficacy of certain medical procedures made by French physicians Louis & Civiale (Tröhler 2020c) in the 1830s. Advocates of the use of probability theory argued that assessments of efficacy required more than just raw statistics such as simple counts of how many patients did or did not benefit, for if the numbers involved were inadequate, any apparent success could simply be the result of chance. Probability theory, it was argued, provided objective tools for assessing the risk of being thus misled.
The quality of the debate was constrained by the inadequate technical background of many of the participants. This was somewhat remedied in 1840 with the publication of Principes généraux de statistique médicale by the mathematically trained French physician Jules Gavarret (Gavarret 1840; see Huth 2008 and Tröhler (2020c) for reviews). Gavarret’s methods drew heavily on results in probability theory published in 1837 by the distinguished French mathematician Simeon-Denis Poisson (1781 – 1840). In essence, these offered a means of assessing whether the difference in proportion of patients benefiting from different treatments was so large that it could not reasonably be ascribed to chance. This was achieved by a kind of “confidence interval” calculated from data using formulas based on Poisson’s work. The resulting limites d’oscillation, as Gavarret called them, gave the range of values within which the difference in proportion could be said to lie with a certain probability, P. If this range did not include zero – corresponding to no difference – then the possibility that the difference was merely a chance effect could be ruled out with probability P.
However, in developing his methodology, Gavarret faced the same question that had confronted mathematicians since the emergence of probability theory in the 17th century: what value of P is sufficiently compelling? This question had a direct bearing on whether medicine can be deemed “scientific”, given the long-standing view that this requires that its findings be both certain – i.e. P = 100% – and objective (Matthews 2020). Gavarret recognised that the former was unattainable, not least because it would require an infinite number of patients. This still left the problem of determining a value of P above which chance could reasonably be discarded as the explanation of a finding. In the absence of any obvious objective approach, Gavarret adopted the seemingly arbitrary value used by Poisson of 99.53%. This figure has its origins in nothing more profound than computational convenience and practicality (see Appendix 3). Perhaps recognising the potential criticism that this made the choice “unscientific”, Gavarret attempted to put forward a rationale for the adoption of 99.53% as a threshold above which chance effects could safely be ruled out. It rested on what he called a “general principle” that (Gavarret 1840 p 258):
“from the moment a test observer has arrived at a high degree of probability relative to the existence of a fact, he can use it as if he were absolutely certain” [emphasis added]
To this he added two criteria to be met by any specific choice of probability threshold: firstly, that it be “high enough to leave no doubt”, and secondly, that it not require “too large a number of observations” in order for findings to exceed the probability threshold. As to its specific value, Gavarret simply resorted to an appeal to authority: 99.53% should be adopted because that was the value chosen by Poisson, “an authority which doubtless no-one will try to dispute the importance of”. Gavarret concluded by claiming that:
“It is conceivable, in fact, that it would be unreasonable to raise the slightest doubt as to the existence of an event, when there is a bet of 212 against 1 whether it has occurred or not”.
If Gavarret’s attempt to give this threshold some objective basis seems unconvincing, it is because there can be no such basis, a fact insufficiently appreciated in relation to today’s de facto standard of 95% (Kennedy-Shaffer 2019), and indeed the (remarkably similar) threshold of 99.50% now advocated by some in connection with the so-called “replication crisis” (Benjamin et al 2018). This was picked up by the German physiologist Adolph Fick, who noted that the 99.53% threshold “does not at all have intrinsic reasons” (Matthews 1995 p 53). He did not suggest an alternative, recognizing perhaps both the impossibility of justifying one and the dangers of highlighting this “unscientific” aspect of Gavarret’s approach, about which he was otherwise supportive.
A more practical concern even among supporters of Gavarret’s approach was that its probability threshold appeared to demand the use of very large numbers of patients. This prompted the mathematically-trained German clinician Julius Hirschberg (1843 – 1925) to publish revised formulas involving the less demanding 91.6% confidence level (Hirschberg 1874)). His justification for reducing the threshold was perfectly reasonable: findings failing to meet Gavarret’s standard still contained valuable evidence. However, Hirschberg’s precise choice of threshold was apparently dictated by the patently subjective desire to lend credibility to a method of removing cataracts developed by his one-time teacher Albrecht von Graefe (Matthews 1995 pp 53-56).
It was against this background that Liebermeister entered the debate with the publication of ÜW in 1877, offering a far more radical solution to the problem of arbitrary thresholds.
Liebermeister’s remarkable achievement
Liebermeister’s ÜW opens with the bold statement that “practical doctors” such as himself had often shown “irrefutably” that making conclusive assessments of treatments requires the use of probability theory. The reference to “practical doctors” is telling, as it seems directed at sceptics who might otherwise dismiss Liebermeister as one of the “strangers at the bedside” – i.e. mathematicians – seeking to challenge the physician’s role (Tröhler 2021). He then sets out the specific goal of his paper:
“[…] to examine how large the probability is that the observed differences in success are not simply due to chance. And for this question only probability calculus gives the necessary indication.”
While insisting that this is a crucial first step in assessing a therapy, Liebermeister is careful not to over-claim, stressing that the theory of probability “is completely incapable” of determining the actual cause of the difference; that remains “a matter for clinical analysis”. He then goes on to concede that the reason probability theory had failed to become part of the clinician’s toolkit is not because its importance had not been recognised, but because the available theory “has so far been too incomplete and inconvenient”. Liebermeister credits Poisson as having developed the requisite theory, but observes that the simplifying assumptions underpinning the resulting formulas come at a heavy price: the need for “hundreds and often many thousands of individual observations”. This, he adds, would raise many practical problems. He then makes a key observation: that the need for such large numbers has been routinely accepted by the medical community to the point of becoming
“…an unshakeable dogma that series of observations which do not consist of very large numbers cannot prove anything at all, that it is unscientific to want to draw conclusions from small numbers”. [Emphasis added].
But, asks Liebermeister, is this assumption actually valid? He cites the possibility that even a small study of a highly effective treatment – such as quinine with malarial patients – could still produce strong evidence of efficacy. Liebermeister then argues that the failure of the existing theory to cope with small numbers suggests it is still “highly imperfect”, and that mathematicians “have not yet provided us” with tools for such situations. He points out that effect sizes would obviously need to be substantial for a small trial to rule out chance to a given standard of probability. Even so, there would be situations where even small studies could rule out chance with high confidence. Liebermeister took the strikingly modern view that all well-conducted studies, whatever their outcome, should be allowed to contribute to the cumulative evidence for the reality of an effect:
“Truly, we are not so rich in gold coins in the empirical foundations of therapy that we could be advised to throw all silver coins into the water! And a handful of silver coins is often worth more than a single gold coin.”
Such gathering of evidence of varying strength is redolent of today’s notion of the systematic review (eg Chalmers & Altman 1995). Liebermeister’s prescience is, however, most striking in his identification of what remains arguably the principal barrier to making the most of clinical data: the use of probability thresholds to decide “what works”. He argued that both Poisson/Gavarret’s threshold of a 99.53% probability and Hirschberg’s less demanding 91.6% were both arbitrary and squandered the full power of probabilistic reasoning in assessing clinical evidence – an argument that remains valid for today’s de facto threshold of 95%. Liebermeister called for this simplistic “pass/fail” approach to be replaced by the calculation for each study of the exact probability that its findings represent a genuine effect:
“In fact, if the probability calculus is to be applied to the assessment of therapeutic results with benefit and in an extensive way, then it is necessary…. that one can calculate with certainty and accuracy for each available observation material with which degree of probability chance is excluded. Only when this is possible can we use all our series of observations in a scientific way, by giving each of them exactly the value it deserves.”
Liebermeister then proceeds to develop what had eluded such giants of statistical methodology as Poisson: a means of reliably calculating this probability for any size of study. To achieve this remarkable result, Liebermeister used an approach that allowed him to avoid the approximations that had undermined reliable inferences from very small trials. In doing so, he preceded Fisher’s development of the celebrated Exact Test (Fisher 1934 Section 21.02) by more than 50 years. Moreover, Liebermeister’s approach is free of interpretational issues which bedevil the use of standard statistical methods to this day.
Liebermeister’s formula and its application
Liebermeister’s formula and its derivation can be found in the German original and English translation of ÜW; see supplementary material Appendix 1 and Appendix 2 respectively. Appendix 3 carries a technical commentary setting the mathematics in historical context. Despite the considerable complexity of both the derivation and the formula, a brief non-technical description of both is vital to a proper appreciation of both Liebermeister’s achievement and its implications.
First, in keeping with the practice of the time, Liebermeister worked within the so-called Bayesian inference paradigm, in which data are transformed into insight via Bayes’s Theorem. Published posthumously by the eponymous English clergyman-mathematician in 1764, the theorem allows an initial level of belief in a hypothesis – its so-called prior probability – to be updated in the light of newly-acquired data, resulting in a new level of belief, or posterior probability (for a non-technical explanation see eg Matthews 2017 pp 135-146). Liebermeister was thus seeking a means of allowing even small amounts of data to update an initial level of belief about a treatment. The means of doing this is today known as a likelihood ratio, and it gives the relative chances of observing the results obtained on the basis of each hypothesis under test. In Liebermeister’s case, there were just two hypotheses – that the treatment was genuinely effective or not – with the evidence taking the form of the relative numbers of treated and untreated patients who recovered in their respective groups. The greater the difference in these relative numbers, the less likely mere chance could have been responsible. But how much less likely? Liebermeister needed a means of capturing the effect of chance in such a comparison, and turned to the time-honoured mathematical model of black and white balls in urns. This allowed him to arrive at a formula for the probability that mere chance could have led to more white balls being plucked at random from the “treatment” urn compared with the “control” urn.
The derivation is a demanding exercise in advanced mathematics based on Bayes’s Theorem (see online supplementary material, Appendix 3), but leads to the desired outcome: a formula giving the probability that the treatment under test is effective, given the data from a clinical study. It should be stressed that Liebermeister’s use of Bayes’s Theorem ensures that this (posterior) probability is free of the counterintuitive interpretation of p-values that remain widely used in the assessment of study outcomes.
Contrary to common perception, standard (two-sided) p-values bear no simple relationship to the probability of chance accounting for the outcome, and are notoriously misleading (see eg Wasserstein & Lazar 2016). In contrast, the Bayesian posterior probability is exactly what it appears to be: the probability that the treatment is effective, given the observed outcomes. Moreover, Liebermeister’s derivation has the remarkable feature of leading to a formula applicable to all sample sizes. This is not the case for Poisson’s formula, which is based on an assumption which is only valid for large sample sizes (see Appendix 3). As such, Liebermeister had preceded by over 50 years the work of the celebrated statistician Ronald Aylmer Fisher (1890 – 1962), whose well-known Exact Test is widely used to assess differences between small samples – albeit via the problematic concept of p-values (Fisher 1934 Section 21.02).
Liebermeister had no reason to mention the interpretational benefits of his approach over p-values, as these are part of an inferential paradigm that came to prominence after his death. Known as Null Hypothesis Significance Testing (NHST), it supplanted the Bayesian approach for reasons beyond the scope of this article (see eg McGrayne 2011 chapter 3). Instead, Liebermeister stressed what was, at the time, the critical advantage of his probability formulas: their probabilistic reliability regardless of sample size. He proceeded to demonstrate this with a set of worked examples, including some based on real clinical studies. These allowed him to highlight the ability of the formulas to extract valuable insight even from small studies. They include the case of quinine and malarial patients, where Liebermeister reports a study (probably his own) involving two groups of 12 patients, 10 of which had become free of fever three days after treatment with quinine, compared with just 2 of those left untreated. Applying his formula, Liebermeister calculated the odds against so large a difference being a fluke as 1666 to 1 against (99.94%) – thus confirming his claim that small studies can nevertheless produce compelling evidence if the effect size is sufficiently large.
In another telling example, Liebermeister examines the assertion that small differences in relative proportions provide no evidential weight of an effect, his aim being
“…to show in what striking way the meaning of the formulas used by representatives of medical statistics has been misunderstood by them….”
This appears to be a direct criticism of Hirschberg (1874), who gave the example of two groups of 300 patients with the same disease, where the mortality in one group is 22%, compared with 16% in the other. Hirschberg argued that given that the difference is just 6%, the true mortality rate is very likely to be the same. Liebermeister disputes this, insisting that “the general practitioner” would undoubtedly regard the difference as real. He then shows that even using Poisson’s original formulas – which he states are “not very exact for such small numbers” – the probability against the difference being a fluke is 93.97%, or odds of over 15 to 1. While this fails to pass the (arbitrary) 99.53% probability threshold promoted by Gavarret and his supporters, Liebermeister argues that odds of 15 to 1 are
“….certainly not meaningless. It will depend to a large extent on other circumstances and considerations [such as] whether one wishes to consider them sufficient to take an important decision in relation to future treatment or anything similar”.
Liebermeister then uses his own formula, and finds a probability of 96.91% , adding that this gives “an even somewhat larger significance” than the standard theory. For technical reasons, this is not quite correct; for an explanation, see example 6 in the Applications section of Appendix 3. Nevertheless, Liebermeister is making several important points here. First, he is highlighting the dangers inherent in using thresholds to decide whether to accept a specific finding as genuine. In the real world, such decisions are rarely clear cut, but depend on context. As such, they require a more nuanced approach than simply “pass/fail”, and this is what Liebermeister’s formula provides: the probability that effectiveness has been demonstrated by the study in question, thus allowing it to be assessed on its own merits. This, in turn, has clear practical value: the ability to base actions on probabilities. Liebermeister illustrates this by assuming that the higher mortality rate was observed in an ordinary hospital, while the lower rate came from identical conditions in a barrack hospital. He then asks whether the evidence is large enough to justify building another barrack hospital, arguing that this ultimately depends on the associated cost:
“Where the construction of the barracks would be easy to carry out, one would probably proceed without question to that result. Where, on the other hand, there would be particular difficulties and inconveniences connected with it, and there would be no urgent need for a careful decision, it would be preferable to wait and see whether further observations would increase or decrease the probability.”
This is an example of an informal decision analysis combining the cost of different decisions with the available probabilistic evidence. It is remarkable that Liebermeister considers the possibility of collecting additional data to obtain more reliable evidence on the existence of a difference between the two groups. This highlights another key feature of Liebermeister’s approach to inference: recognition of the limits of inference based on probability theory. Even if the observed mortality rates had ruled out chance to better than the 99.53% standard adopted by advocates of probabilistic analysis, this means only that chance is highly unlikely to account for the difference. The true cause remains to be identified – a key point often overlooked to this day.
Limitations of Liebermeister’s method
Liebermeister was well aware that the validity of his method depends on additional assumptions. He states at the beginning of his article that it is important to make sure that the groups to be compared do not differ with respect to important characteristics at the beginning of his article:
“Certainly, with the accomplishment of this mathematical and formal part, our task is far from being completed. Rather, the question then arises as to whether the two series of observations, in which the difference in success occurred with different treatment, can really be regarded as comparable in every other respect. There might have also been a decisive change in the character of the disease, in the intensity of the cause of the disease, whether a change in the various other moments, on which the outcome of the disease may depend, has not caused the differences in the observed success.”
Concerns about non-comparability of groups have been common in the 19th century according to Morabia (2011). Vandenbroucke (2002) describes the work by Mill (1843) and by Claude Bernard (Bernard, 1865) as the first accounts discussing the problem of comparability. It is noteworthy how precisely Liebermeister describes the possibility that potential confounders may have caused an observed difference between groups.
In example 5 he compares mortality rates among patients with acute pneumonia in a hospital in Basle, Switzerland. He compares patients treated with antipyretic methods to historical controls without that treatment. He remarks that
“through precise clinical analysis it was established that the two series of observations were comparable in every other regard”
Similarly, in example 6 he compares the mortality rates 66/300 and 48/300, an example taken from Hirschberg (1874). Liebermeister argues that there is moderate, but not overwhelming evidence for a true difference in mortality between the two groups. He then asks what factors may have contributed to the reduction in mortality:
“The question, of course, as to what is the cause of the reduction in mortality, whether a possible difference in treatment or a change in the nature of the epidemic or any other change in circumstances, is not a matter for mathematical analysis, but for clinical analysis.”
These examples show that Liebermeister was well aware of the problem of confounding and that there remains a role for clinical analysis in unravelling identifying the true cause of any difference in efficacy. The solution is now known to lie not in “clinical analysis” but in experimental design, specifically the use of randomised allocation. The power of this methodology to counteract bias was only starting to be recognised by the time of Liebermeister’s death in 1901 (Hróbjartsson, Gøtzsche & Gluud 1998;Chalmers et al. 2011).
Liebermeister concludes the substantive part of ÜW by pointing out that the formulas he has derived “are not only applicable to therapeutic statistics, but also to a large number of other problems in probability calculus”. History records that while both statements are true, Liebermeister’s remarkable achievement was destined to be forgotten even within clinical medicine until long after his death. Likely explanations for this will be explored in the next section.
ÜW also includes two appendices covering technical points and giving a more detailed derivation of the formulas. The first appendix deals with a key issue confronting anyone using Bayes’s Theorem to turn data into insight. In essence, the theorem shows how a prior level of belief expressed as a probability should be updated in the light of data, producing a posterior probability. But how should that prior level of belief be set? This question has dogged Bayesian inference since its emergence over 250 years ago. Liebermeister’s solution was to use a convention widely applied at the time (and since), and which assumed a complete lack of prior insight about the possibility that a finding could be due to chance. Known technically as a “non-informative” uniform prior probability distribution, this assumption greatly simplifies the derivation (see Appendix 3). However, Liebermeister was well aware that other choices could be made:
“When dealing with tasks concerning the so-called posterior probability, it is not uncommon to be under the illusion that one is approaching the observations without any preconditions. In reality, this is never the case and naturally cannot be the case”.
For example, there may be pre-existing studies that suggest a given treatment is effective; Bayesian inference allows these to be taken into account using prior probabilities. Liebermeister was also aware that prior beliefs can greatly influence the outcome of a Bayesian calculation. This is especially true with the levels of evidence typical of small studies – the focus of Liebermeister’s attention – as their relatively weak evidential weight may barely alter prior beliefs. Liebermeister mentions that the balls-in-urns model can accommodate other priors, but he gives no mathematical details. As we shall see, this led to criticism of his entire approach, on the grounds that the assumption of complete prior ignorance would often lead to conclusions inconsistent with those based on “common sense” beliefs of physicians (Korteweg 1877). Yet it must also be admitted that such “common sense” has a chequered record in the history of medicine. It is unclear whether Liebermeister developed the necessary mathematical detail, or rebutted the criticism in general terms. What is known is that the addition of “informative” priors to Liebermeister’s model is far from trivial. The full theory was only developed long after his death by Altham (1969), unaware of the existence of Leibermeister’s pioneering work (Altham, pers. comm. 2020).
The critical response to Liebermeister’s ÜW
Liebermeister’s contribution to the debate over probabilistic reasoning in clinical medicine was exceptional; the response was not. Very soon after the publication of ÜW, critical reviews appeared revealing yet again the gulf separating the statistical and clinical communities.
Hirschberg (1877; original German text and English translation available in supplementary material) drew on his unusual combination of mathematical expertise and clinical experience to critique Liebermeister’s most impressive claim: that his formulas extract inferential value from even small trials. Hirschberg’s concern was that physicians would be bamboozled by Liebermeister’s arcane formulas into over-interpreting their output. As evidence, he cited worked examples in ÜW in which the formulas gave huge odds against the outcome of a clinical study (of an antipyretic) being merely a chance effect, in one case exceeding trillions to one. Hirschberg pointed out that this is vastly greater than estimates of the probability the sun will rise tomorrow, given it has done so for at least the last 5,000 years. “Are we really to believe”, he asks, “that the superiority of [the antipyretic therapy] has a tremendously greater degree of probability than the return of daylight?”. After pointing to real-world constraints that can undermine such apparent certainty, such as accurate diagnosis and clear end-points, Hirschberg makes clear his belief that probability theory can best serve clinicians as a guide to plausible levels of efficacy, not as dichotomous proof. He argued that a confidence interval (“Territorium der Chance”) of the kind advocated by the French clinician Gavarret (Tröhler 2020c) could help clinicians make reasonable decisions about efficacy. While these standards might be somewhat arbitrary, they suggest Hirschberg envisaged a classification system for strength of evidence, based on which (if any) of these standards are met by a specific study outcome.
He then turns to Liebermeister’s remarkable claim to have found probability formulas whose reliability does not depend on trial size. Hirschberg expresses no doubts about their mathematical validity, and emphasises that he does not dispute the value of small studies, or even individual cases. Rather, his concern was that when used in such cases, the formulas are sensitive to practical issues such as the choice of clinical end-point and the impact of stopping a study early, in what would now be called interim analyses. In summary, Hirschberg’s critique reflects a concern that the application of apparently sophisticated probability theory can be undermined by inadequate data: “[I]t would not be desirable to clothe the cautious empirical groping by calculation with a shining semblance of higher certainty”. Such concerns still resonate almost 150 years later.
Similar sentiments appear in a more vituperative critique of ÜW published the same year by Johannes Korteweg (1851 – 1930) a young Dutch surgeon based in Leiden (Korteweg 1877). In a footnote, Korteweg acknowledges help from his older brother Diederik, a gifted mathematician later celebrated for his work on the theory of waves. Like Hirschberg, however, the focus of Korteweg’s concerns is not the formal derivation of Liebermeister’s formulas, but their reliability as a source of clinical insight.
Like Hirschberg, Korteweg points out that reliable conclusions about genuine efficacy based on probability theory require that “[T]he series of numbers to be compared are also comparable in all respects – that is, that the disease character of the cases in both series was the same, the cause of the disease was equally powerful, etc”. While Liebermeister acknowledges this constraint in ÜW, Korteweg insists it is far from trivial, describing it as one of “…the sources of errors which even someone like Liebermeister cannot avoid when applying the method of the exact sciences to a subject which is not yet entitled to the name “exact”.”
This leads Korteweg to take issue with Liebermeister’s starting premise: that clinical studies can be modelled as equivalent to the random selection of coloured balls from urns. It is notable that neither Korteweg (nor, presumably, his mathematically inclined brother) questioned whether this was an appropriate model for the clinical studies of the time, given that randomisation played no explicit role in their execution (and would fail to do so until well into the 20th century). Instead, he focuses on another, nonetheless important, difference between the probabilistic model and clinical reality: the former assumes nothing is known about the relative proportions of the different colours in each urn. Korteweg points out that many clinical studies involve diseases about which much is known concerning mortality rates and treatment efficacy, at least approximately. Liebermeister’s formulas have no place for such insights, including those based on what Korteweg calls “common sense” about treatments which “cannot be brought under a mathematical calculation”. Korteweg is here alluding to the problem of the incorporation of prior insight – especially of a subjective nature – in the Bayesian approach to assessing new evidence. He was wrong to imply Liebermeister was unaware of it: he explicitly deals with it, albeit discursively, in Supplemental Note 1 to ÜW. Korteweg is nevertheless correct to warn that such prior evidence can outweigh that produced by single clinical studies, especially those involving small numbers – precisely the kind Liebermeister had brought within reach of probabilistic analysis. Used in isolation, the apparently impressive odds of effectiveness produced by the formulas could lead to what Korteweg calls the “smothering” of insight from other sources.
He then turns to what he believes is Liebermeister’s selective use of published studies of new therapies to demonstrate the value of his formulas, warning of the dangers of what is now termed publication bias:
“[C]ommon sense teaches that 100 new drugs are tried without results when 10 have been announced with results. The truth calculus shows that of 100 new but indifferent drugs, 50 happen to be worse, of the latter 50, some, e.g. 10, will happen to give a result that seems worth mentioning. These 10 will be made known and will be subject to further investigation. Liebermeister selects from these 10 the one that seems to be the best by chance and bets 100 to 1 that the good result of this drug is not due to chance alone…”
Korteweg ends his critique with a stinging – and prophetic – final paragraph:
“It is to be hoped that Liebermeister’s treatise will be one of the most useful editions of the Klinischer Vortrage Sammlung and that it will achieve exactly the opposite of what it is intended to do. Its purpose was to push the often erroneous subjective judgment of the physician into the background and to replace it with numbers that would guide, if not overwhelm, the so easily mistaken mind. After the failed attempt made by Liebermeister, everyone will consider it more desirable for the time being to place common sense above scholarship, above facts, but above all above faith in numbers”. [Emphasis in original].
The last substantive contemporary critique of ÜW appears to have come from a young German military physician, Friedrich Martius (1850 – 1923), and forms part of his review of the role of probability theory in clinical medicine (Martius 1881). This had already been the focus of an earlier work (Martius 1878) in which he argued that probabilistic methods are a “makeshift necessity” lacking the robustness needed to cut through the complexities of clinical observation and achieve the goal of making medicine “scientific”. Instead, Martius claimed that simple, empirical induction based on experiment and observation is the surer path to reliable insight. In his follow-up critique Martius sought to clarify “the still very much mistaken logical foundations of statistics and probability calculus” applied in medicine, including the work of Liebermeister.
The result was a protracted curate’s egg of a review combining both familiar and novel criticisms with mathematical errors. Martius starts by conceding that statistics– i.e. raw data such as patient numbers – and probability per se are essential tools in the drive to make medicine scientific. The dispute, he states, is about “the scope of this source of knowledge” (p337). He takes issue with Liebermeister’s claim that the chief barrier to the use of probabilistic methods is just the practical difficulty of applying the formulas – including the need for large numbers of patients. Instead, he reiterates his belief in the superiority of experimental induction, and then repeats the now-familiar doubts about modelling patient outcomes via the extraction of balls from urns, claiming that if the arguments for this approach were valid, physicians could only reject its adoption on the grounds of “their mathematical ineptitude” (which Martius inadvertently demonstrates by bungling simple probability calculations (pp 352, 362)).
Martius then turns to what he sees as compelling arguments against those claiming probabilistic methods can make medicine “scientific”. First, he picks up on the different probability thresholds advocated by Poisson/Gavarret and Hirschberg for ruling out chance effects (Held & Matthews 2022) . While Fick (1866; Matthews 1995) had already noted that the Poisson/Gavarett threshold of 99.53% was based principally on practical constraints, Martius went further, arguing that Hirschberg’s suggestion of a lower probability threshold revealed the “arbitrariness and uncertainty of the whole procedure”. Nor was he impressed by having a means of extracting insight even from very small clinical trials. He declares (p 376): “In fact, I believe that one cannot ignore the conviction that the Liebermeister formulas, though more practically applicable, are more unscientific than those of Poisson”.
Part of the reason seems to be Liebermeister’s elimination of the need for any threshold for deciding when an effective therapy has been identified, such as the 99.53% adopted by Poisson/Gavarret. Liebermeister believed this would encourage the accumulation of evidence from any well-conducted trial (“silver coins”), including those that might otherwise be discarded as “failures”. Martius, in contrast, claimed this would encourage the practice of “subjective preference” over whether a therapy had worked or not.
Martius’s objections also seem to reflect a belief that Liebermeister’s formulas were based on (unspecified) “other assumptions” to those used by Poisson, and “completely neglect the Law of Large Numbers”. This law, first formalised by the Swiss mathematician Jacob Bernoulli (1655 – 1705), states that the probability of an event can be estimated ever more precisely from its observed frequency as the number of observations increases. This somewhat vague statement was later made mathematically precise and underpins the “confidence interval” formulas of Poisson and Gavarret. Martius seems to imply, however, that Liebermeister’s formulas only work with very small numbers because they have ignored this Law, thus violating a basic tenet of inference incorporated into Poisson’s formulas. This is understandable, given Poisson’s own peremptory (but incorrect) statement that the Law is “the basis of all applications of the calculus of probabilities” (Matthews 1995 p26). Given Martius’s lack of mathematical training, it is also unsurprising that he then fails to appreciate that the Law (of which he gives a somewhat muddled explanation on p 365) has not been ignored, but circumvented by Liebermeister’s inspired use of a mathematical model that obviates the use of approximations reliable only for large sample sizes (for more details, see Appendix 3).
Martius ends by stating that his “rather negative” views should not be taken as implying rejection of the use of statistics in medicine. Rather, his aim was to counter “the repeated attempts to exceed the competence of the method on which statistics and probability calculations are based”. Progress in medicine, he insists, “lies in experimental induction, not in the numerical method”.
Martius’s 15,000 word critique appears to mark the end of substantive public discussion of ÜW during its author’s lifetime. It is perhaps significant that the principal critics were all relatively young and raised similar concerns. This suggests they represented the nature and depth of the scepticism of probabilistic methods which would prevail for decades to come.
The relevance of ÜW for clinical biostatistics in the 21st century.
The descent of ÜW into obscurity was rapid. Soon after Martius’s critique, what may be the first discussion in English of Liebermeister’s methodology appeared in the form of a puzzle sent to a British educational journal. It had been posed by the distinguished Scottish mathematician and physician Donald MacAlister (1854 – 1934), and concerned a small clinical trial of treatments for blood poisoning (MacAlister 1882). After stating the relative efficacy of the two treatments under test, MacAlister asked readers for the probability of the apparently higher efficacy of one approach being merely a fluke. MacAlister credited Liebermeister with providing the means of answering such questions, noting that they showed that some clinical insight could be obtained even from small samples. Yet after some correspondence about their underlying assumptions, Liebermeister’s formulas failed to make any further impact. By the 1940s, the American biostatistician Charles Winsor (1895 – 1951) praised Liebermeister’s “clear understanding of what statistical methods could and could not do for the practitioner” but declared “it is clear that few of us today would use the Liebermeister solution” (Winsor 1948).
This suggests that ÜW had already been rendered obsolete by more sophisticated inferential methods. This was not the case: Winsor’s remarks merely reflect the fact that by the 1940s other inferential paradigms had pushed aside Bayesianism, along with methods based upon it such as Liebermeister’s formulas. Fisher’s Exact Test, their equivalent in the so-called frequentist paradigm, had by then been published, apparently in complete ignorance of Liebermeister’s work (Fisher 1934). Even after his method was independently rediscovered and began to appear in textbooks during the Bayesian renaissance of the 1980s, Liebermeister himself remained forgotten.
Those few who did encounter ÜW admired its sophistication and practicality. Zabell (1989) described its formulas as “impressive… rigorously derived [with] several examples involving actual clinical data”. Similar praise appeared in Ineichen (1994 in German), Seneta (1994), Seneta and Phipps (2001). Flatly contradicting Winsor, Seneta et al (2004) argued that “There is little reason today for using less powerful exact tests in preference to Liebermeister’s now that it has been brought out of oblivion”.
Ironically, Winsor’s comments appeared just as the most serious objection to Liebermeister’s methods was losing its power. Critics had repeatedly warned of the naïveté of interpreting the outcome of clinical studies in terms of the random extraction of balls from urns. From the mid-20th century onwards, clinical trials increasingly featured random allocation of patients to the treatment or control arms. Introduced to counter biased allocation by triallists (Chalmers 2011), this also effectively turned each patient into a ball in Liebermeister’s urn model. The resulting randomised controlled trial (RCT) cuts through the inferential difficulties that Liebermeister himself recognised as hampering the adoption of his methods.
In truth, there were many others barriers. Some had deep cultural roots, harking back to ancient disputes over whether medicine could be regarded as a “science” and the scope for quantification in decision-making (Matthews 2020). Many 19th clinicians would have shared Martius’s distaste for a mathematical model which represents patients as coloured balls. Those willing to set aside such qualms would have found Liebermeister’s formulas unfamiliar and possibly intimidating. Even if they succeeded in extracting the probability of a finding not being a mere fluke, they would still have faced the ineluctable problem of making sense of the answer. Unless all other causes had been eliminated, how reliable was this apparent probability of efficacy – and how high should it be? In the absence of randomisation, the former is very hard to assess; even in its presence, the latter remains a deeply controversial question to this day (Gibson 2021).
Nevertheless, Liebermeister’s concern about the abuse of arbitrary thresholds in deciding “what works”, his insistence that all well-conducted trials have inferential value and his use of Bayesian techniques for its extraction would today put him at the cutting edge of the debate over the use of statistics in medicine. Had the perspicacity of ÜW been recognised when it was published, it might have led to more questions being asked about the inferential methods now causing so much concern within the medical research community (eg Goodman 2016). As the drive to find more reliable methods continues, we believe Liebermeister should be recognised as one of the founding fathers of 21st century evidence-based medicine.
LH is grateful to Flurin Condrau for useful discussions, to Valentina Held for support in translating Liebermeister (1877) into English, Patricia Altham for comments on the Exact Test, Stephen Senn for helpful comments on an early version of this manuscript and Klaus Dietz for bringing Liebermeister to his attention. RAJM thanks Iain Chalmers for his enthusiasm for this project, Ulrich Tröhler, and Wolfram Liebermeister and Klaus Dietz for assistance with relevant literature. Both authors are grateful to Peter Diggle and Håvard Rue for comments on drafts.
Supplementary material (available online here: https://osf.io/4dvpk/): This consists of five appendices: (1) The text of ÜW in the original German; (2) The first English translation of ÜW; (3) A technical commentary covering the mathematical aspects of ÜW; (4) The text of Hirschberg’s 1877 commentary on ÜW in the original German; (5) This commentary translated into English.
Altham PME (1969). “Exact Bayesian Analysis of a Contingency Table and Fisher’s ‘Exact’ Significance Test.” J Roy Stat Soc B 31: 261–69.
Altham PME (2020). personal communication to LH.
Baumberger HR (1980). Carl Liebermeister (1833-1901). Züricher Medizingeschichtliche Abhandlungen New Series No. 137. 1545-1614 Zurich: Juris Druck Verlag.
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, and others. (2018). “Redefine Statistical Significance.” Nature Human Behaviour 2: 6–10.
Bernard CI (1865). Introduction à l étude de la médecine expérimentale (Reprinted Paris: Flammarion 1966). Available online here
Chalmers I (2011). Why the 1948 MRC trial of streptomycin used treatment allocation based on random numbers. J Roy Soc Med 104 (9) 383-386.
Chalmers I, Dukan E, Podolsky SH, Davey Smith G (2011). The advent of fair treatment allocation schedules in clinical trials during the 19th and early 20th centuries. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/the-advent-of-fair-treatment-allocation-schedules-in-clinical-trials-during-the-19th-and-early-20th-centuries/)
Chalmers I, Altman DG (Eds) (1995). Systematic Reviews. London: BMJ Publishing.
Fick Adolf (1866). Die Medicinische Physik (Braunschweig: F Vieweg und Sohn).
Fisher RA (1934). Statistical Methods for Research Workers. 5th ed. Edinburgh: Oliver & Boyd.
Gavarret LDJ (1840). Principes de Statistique Médicale: ou développement des règles qui doivent présider à son employ Béchet Jeune et Labé, Libraires de la faculte de medicine de Paris. See also 1844 translation into German by Landmann, S. (1844). Allgemeine Grundsätze der medicinischen Statistik; available online here.
Gibson EW (2021). The Role of p-Values in Judging the Strength of Evidence and Realistic Replication Expectations, Statistics in Biopharmaceutical Research 13:1 6-18.
Goodman SN (2016). Aligning statistical and scientific reasoning. Science 352 (6290): 1180-1.
Held L, Matthews RAJ (2022). Carl Liebermeister and the emergence of modern medical statistics Part 1: his remarkable work in historical context JRoySoc Med (in press).
Hirschberg J (1874). Die Mathematischen Grundlagen der Medizinischen Statistik. Leipzig: Verlag von Veit & Comp.
Hirschberg J (1877). Ueber Wahrscheinlichkeitsrechnung in Anwendung auf therapeutische Statistik Berliner Klinische Wochenschrift 21: 7 297-299.
Hróbjartsson A, Gøtzsche PC, Gluud C (1998). The controlled clinical trial turns 100 years: Fibiger’s trial of serum treatment of diphtheria BMJ 1998;317:1243–5.
Huth.E (2008). “Jules Gavarret’s Principes Généraux de Statistique Médicale.” Journal of the Royal Society of Medicine 101 (4): 205–12. https://doi.org/10.1258/jrsm.2008.081008.
Ineichen R (1994). “Der “Vierfeldertest” von Carl Liebermeister (Bemerkungen zur Entwicklung der medizinischen Statistik im 19. Jahrhundert).” Historia Mathematica 21: 28–38.
Kennedy-Shaffer L (2019). Before p< 0.05 to beyond p< 0.05: using history to contextualize p-values and significance testing. The American Statistician, 73(sup1), pp 82-90.
Korteweg JA (1877). Over de Toepassing der Statistiek op Medische Wetenschappen Weekblad van het Nederlandsch Tijdschrift voor Geneeskunde 21 553-560
Liebermeister C (1877). Über Wahrscheinlichkeitsrechnung in Anwendung auf Therapeutische Statistik. Sammlung Klinischer Vorträge (Innere Medicin No. 31-64) 110: 935-962.
MacAlister D (1882). Probability and Listerism Educational Times Reprints 37 40-42.
Martius F (1878). “Die Principien der wissenschaftlichen Forschung in der Therapie.”Volkmanns Sammlung Klinischer Vorträge 139: 1169–88.
Martius F (1881). “Die Numerische Methode (Statistik und Wahrscheinlichkeitsrechnung) mit besonderer Berücksichtigung ihrer Anwendung auf die Medicin.”Virchows Archiv für Pathologische Anatomie und Physiologie und für Klinische Medicin 83: 336–377.
Matthews JR (1995). Quantification and the Quest For Medical Certainty. Princeton: Princeton University Press.
Matthews RAJ (2017). Chancing It (London: Profile Books) pp 135-146.
Matthews RAJ (2020). The origins of the treatment of uncertainty in clinical medicine. Part 1: ancient roots, familiar disputes. Journal of the Royal Society of Medicine, 113(5), pp 193-196.
McGrayne SB (2011). The theory that would not die. Yale University Press
Mill JS (1843). A System of Logic, Ratiocinative and Inductive. London: John W. Parker.
Morabia A (2011). “History of the Modern Epidemiological Concept of Confounding.” Journal of Epidemiology & Community Health 65 (4): 297–300. https://doi.org/10.1136/jech.2010.112565
Poisson (1837). Recherches sur la Probabilité des Jugements en Matière Criminelle et en Matière Civile , Bachelier, Paris.
Seneta E (1994). “Carl Liebermeister’s Hypergeometric Tails.” Historia Mathematica 21: 453–62.
Seneta E, Phipps MC (2001). “On the Comparison of Two Observed Frequencies.” Biometrical Journal 43: 23–43.
Seneta E, Seif FJ, Liebermeister H, Dietz K. (2004). “Carl Liebermeister (1833-1901): A Pioneer of the Investigation and Treatment of Fever and the Developer of a Statistical Test.” Journal of Medical Biography 12: 215–21.
Tröhler,U (2020a). “Probabilistic Thinking and the Evaluation of Therapies, 1700-1900.” JLL Bulletin: Commentaries on the History of Treatment Evaluation. https://www.jameslindlibrary.org/articles/probabilistic-thinking-and-the-evaluation-of-therapies-1700-1900/.
Tröhler U (2020b). Probabilistic thinking and evaluation of therapies: an introductory overview. J Roy Soc Med 113(7), pp 274-277.
Tröhler U (2020c). The French road to Gavarret’s clinical application of probabilistic thinking Part 2: Louis-Denis-Jules Gavarret. J Roy Soc Med, 113(9), pp 360-366.
Trohler U (2021). Conclusions and perspectives, part II: social, national, and long-term perspectives. Journal of the Royal Society of Medicine, 114(3), pp 132-139.
Vandenbroucke JP (2002). “The History of Confounding.” Sozial- Und Praventivmedizin 47 (4): 216–24. https://doi.org/10.1007/BF01326402.
Voets PJ (2018). Central line-associated bloodstream infections and catheter dwell-time: A theoretical foundation for a rule of thumb. J Theor Biol 445: 31-2.
Wasserstein RL, Lazar NA (2016). The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70, 129–133 (doi:10.1080/00031305.2016.1154108)
Winsor Cdoi:10.1080/00031305.2016.1154108P (1948). Probability and Listerism Human Biology 20: 161–69.
Zabell S (1989). “Discussion of Robin L. Plackett: “Fisher’s History of Inverse Probability”.” Statistical Science 4: 261–63.