Heiberg P (1897)

Studier over den statistiske undersøgelsesmetode som hjælpemiddel ved terapeutiske undersøgelser [Studies on the statistical study design as an aid in therapeutic trials]. Bibliotek for Læger 89:1-40.

Key passage(s)



1. “Heiberg stresses the problems facing the physician when choosing which medical treatment to use: the large number of choices among different drugs, how to assess benefits and harms, and whom to believe. Heiberg notes that scientific proof of the benefit of a drug can only be reached through a numerically planned trial – the statistical-therapeutic experiment. If one replaces the word ‘serum’ in Heiberg’s article with ‘humanised monoclonal antibodies’ this introduction looks surprisingly contemporary.”

I. Introduction

“The time when all effective medicaments could be written on a fingernail is over. The number of efficacious medications has increased rapidly during recent years. A variety of antipyretics, soporifics, and pain-killers are at physicians’ disposal, and at the same time a completely new line of medicines has been introduced, for example, specific medicines, like different serums, developed to an exquisite degree.

Some of the more recent preparations have evident effects. No one, having administered acetyl salicylic acid for rheumatic fever only a few times, would doubt its very favourable influence on the condition of the patient. But for many other contemporary remedies this does not hold true; the effect may not be so evident and may not be that stable. The physician must rely on the experience of other physicians in order to build confidence in the drug.”



“But these experiences! Some physicians provide one medical opinion, others another. Shall the physician then begin to trust the authority of one author at the expense of distrusting another? I find the latter a dangerous approach. Shall the physician then count how many authors are pros and how many are cons? For sure, an equally dangerous approach. But what then makes so many physicians refrain from judgement of a report on the astonishing potential of a new medication, even if it derives from a source in which fraud is unlikely to occur deliberately?

Among various reasons, what they answer is also true: “We have been repeatedly disappointed, and we became disappointed even when the most promising figures asserting the benefits of the new drug were produced”. From this very answer we understand the physicians’ inability to assess a numerical report on the effect of a drug because physicians are not used to working with numbers that are so large that the percentages have particular meaning. Usually, one must have the original, absolute numbers in order to be able to assess the value of the report – but the original numbers are often not published!

Especially now, when statistical examinations on serum are produced one after the other to justify the beneficial effect of the different types of serum, an attempt to clarify the requirements for statistics within the therapeutic area should be justified. Internal medicine is about to leave expectant treatment and enter direct curative treatment methods. Everywhere, on a larger or smaller scale, trials are being instituted using the guidance that experimental pathology mainly can offer to physicians. But how rarely these experiments are planned, keeping in mind the simplest requirements of a statistical therapeutic experiment!”



“Even so, the “statistical results” are often used in a more or less enthusiastic proclamation of the beneficial effect of the new drugs. If the trials are performed in this way, then one can easily “cure everything with all sorts of things”, and the ever lurking scepticism ends up condemning, for example, serum therapy, since it becomes almost impossible to distinguish between the really good serum types and the speculative “good” types.

As Prof. Sørensen stressed recently, preliminary information about the usefulness of a drug can be obtained through the use of clinical methods. But in some cases, the stringent – the scientific – evidence of the effect of the drug should rather be proved through a numerical, planned study, a statistical-therapeutic experiment, the aims of which should really be to find a safe and common therapy that every unbiased physician is obliged to give; not because he believes that it is the right therapy, but because he knows it with certainty and is as sure of it as one can ever be of scientific facts derived from natural sciences.”



2. “Heiberg stresses that statistics can only deal with probability. Although statistics alone cannot prove anything, they can help decide which interventions should be adopted and which abandoned. Heiberg also realises that when interventions have dramatic effects they do not need to undergo statistical testing.”

III. The purpose of the statistical method

“Statistics have proved it, therefore the case is true”. This way of thinking is sometimes put forward by physicians. But the expressed judgement reveals a considerable misunderstanding of the essence of the statistical method and of the help it can give. The contribution of statistics is more humble. It is not so much the appreciation of the ability to prove something, but rather to test the value of an expressed judgement in a specific way.

In the natural sciences – the empirical sciences – all our postulates, all our judgements2 are expressed with greater or lesser probability. This is true for both Keppler’s laws and Laennec’s stethoscopy. While it never occurs to us to ask how large the probability is for these observed laws to come into force, it is different when we get to the areas where research is doubtful and we try to find new truths, new judgements. Here, we have to keep in mind the size of the probabilities which we calculate.

Within therapy, there is, in fact, an area in which a lot is still open to doubt, and where work is performed and new statements, new judgements, are being sought. In addition, the special circumstances prevailing in this field make it necessary to be as sure as possible about the probability with which a judgement is expressed for one or another new method of treatment.”



“Within physics, quite often, a few well-planned and well-performed experiments may settle the case, and yet when it comes to evaluation of a series of fine and stable observations, the physicist resorts to numerical considerations. If the astronomer is to evaluate a series of observations then he would also use numerical considerations, and it is from the group of astronomers that the study of observations has received a remarkable number of contributions. But if the exact sciences can use numerical methods as an aid, should then not “conjectural” therapy be able to use the amazing cleansing agent for poorly motivated judgements, which the statistical method can be?

The statistical method can do nothing alone. It is only an aid. Especially in therapy, it is very important to remember that it is just an aid to cleaning, to sweeping out large numbers of badly founded treatments, and to preventing the emerged, useless, or even harmful treatments from becoming consolidated and gaining influence. The following quotation by Fenger3 gives an example of how much even a very critical observer can be deluded by a simple subjective estimate of the conditions: “I cannot really imagine how I would treat inflammatory diseases in the abdominal region without leeches and blood-cups. I am absolutely sure that this would at least deprive me of a strong recovery-agent, after having observed the results on countless occasions etc.

Thus, the statistical method is a touchstone, a purge. However, the purge only clears of guilt those sins resulting from abuse of numbers.”



“But will every new or old treatment be purified by taking this test? The answer must be “no”. By no means all treatments need to be criticised in this way and by no means all are suitable for this method of criticism. It is as ridiculous to investigate statistically the benefit of removing a corpus alienum from the cornea as it is impossible to statistically evaluate, for example, the benefit of amputation rather than extirpation of the organ in case of an adenosarcoma in the kidney. But this leads us to the question of the areas that therapeutic statistics should cover. Before addressing this issue, it will be advisable to discuss the statistical method in more detail.”



3. “Heiberg explains how clinical investigators usually only have access to limited numbers of patients and how the outcomes of such limited number of patients may be dramatically influenced by random error (‘play of chance’).”

“If all ‘external circumstances’ at a hospital or another bounded area remain unchanged for several years, the risk of a specific disease would most probably stay unchanged, be of constant value. However, small fluctuations would, of course, be observed for individual years. In other words, the risk for each single year may either become a little higher or a little beyond an actual constant risk. Applying the law of large numbers, the small fluctuations would be grouped in a symmetrical way around the average – in the described circumstances around the actual risk. In addition, positive and negative fluctuations would alternate randomly. ‘External circumstances’ means circumstances related to hospitalisation, delimitation of the spread of the disease, treatment etc.”



“If you know the course of one disease in a large number of cases and thereby know how many of these people die, you usually talk about the mortality of the disease or – probably expressed in a better way – about the risk of the disease, as pointed out by J. Carlsen4. The term ‘mortality’ is often used for a completely different concept: the incidence of death attributable to a disease as a rate relative to the population. Thus, the risk of a disease becomes the proportion of all the dead over all examined patients with the disease. The following 3 examples5 show 2000 patients observed in 20 groups for 20 successive years. In each example, the risk of the disease is set to be constant and is based on the numerical circumstances; it shows that large fluctuations can be expected. By letting the size of each group be 100, the fluctuations in percentages become obvious.

In the 3 examples, the fluctuations from an actual constant risk are distributed hypothetically, following the law of large numbers. Approximately as shown in the table below.”




“In the second example, the risk is set to 20 per cent only, and here, due to the numerical circumstances, the annual fluctuations range from 11 to 28 per cent. In other words, the risk in per cent for one year can be more than twice as high as another year.

In the third example, the risk is set to 50 per cent only, and here, due to the numerical circumstances, the annual fluctuations range from 39 to 60 per cent.

The investigator of a therapeutic study rarely deals with more than a couple of hundred cases, and it is obvious from the above, that he has to be very careful in assigning any value to small differences between observed percentages. If the random variations observed in a small study, like one including only 20 groups, are this big, how much higher will not the risk be, if eg, 2 groups out of the innumerable possible ones are available?

As is obvious from the above, it is useless to discuss the higher or lower risk in the following groups. The example is taken from the very recent literature. (See Table below.)”




“Do the actual circumstances now agree with the law of large numbers? In reality, would fluctuations in the risk of a specific disease under constant ‘external circumstances’ be distributed in the same way as in the examples given above? This question cannot be answered a priori, but has to be examined specifically. It is possible that the risk of some diseases under constant ‘external circumstances’ fluctuates in other ways, eg, are subject to periodical variation. Anyhow, you will never come across fluctuations smaller that those according to the law of large numbers. This is what is important. You should not start looking for the cause of the fluctuations unless they become large enough or are recurring regularly in a specific direction, so that it becomes possible to claim that they have exceeded the unavoidable random fluctuations. This may – perhaps periodically – be a changing risk, a change in the ‘external circumstances’ that has escaped your attention, or a changed treatment.”



“Investigations of whether the exponential law of errors can be used in studies on the risks of the specific diseases is, as far as I know, still not available in the literature. All medical writers have taken this for granted, while actually it is only – and quite rightly – a conclusion by analogy. And after all, conclusions by analogy are not popular in the natural sciences. The following paragraph will show that the use of the exponential law of errors is proper in some particular cases.

The following example is given to show how a single factor, eg, age, can be eliminated by means of the method of expected deaths.

A study in a limited area, like a town or a major health insurance society, showed that under constant ‘external circumstances’ the fluctuations in risk of a specific disease fell within the limits of the law of large numbers. Then the following development occurred in 2 successive years:


If the above numbers are ordered in such a way that comparison becomes possible, we obtain the following table:


Fluctuations in the same direction appear in all 4 age groups: in 2 of them they are quite substantial, and for all groups, after having eliminated the age-factor, a fluctuation of 6 times the average error is observed.



“These numbers clearly show that a new causal factor has appeared, but whether this is a result of spontaneous decline of the risk of the disease or a decline caused by a new treatment, no analyses of numbers can tell us.”



4. “Heiberg explains how statistics can be used in experiments in which exisiting treatments are withdrawn or maintained, at random, to examine their effects.”

“Statistics has a special position in therapeutics in serving most often the experiment while statistics usually is applied on existing things. In therapeutics, one can rightly talk about a statistical experiment because one can add or remove casual factors randomly.”



5. “Heiberg explains how fluctuations in outcomes may reflect the catchment area of the patients, diagnostic methods, the age of the patients, and spontaneous changes (‘natural history’) .  An improvement in prognosis may actually reflect ‘natural history’, but be mistakenly assumed to be due to new ‘therapeutics’.”

“Now we come to the requirements that your data must fulfil. Naturally, data will, in most cases, come from large hospitals, where hundreds, even thousands of patients with the same disease, are treated over the years.

Is it then enough to obtain your data from the same hospital department, and if possible, during a period when the same consultant was in charge of guarding against variation in diagnosis? No, it is not. You also have to keep an eye on the age of the patients in the 2 groups that you compare.”



“One last factor, so far only vaguely mentioned, is spontaneous fluctuations of the risk of some diseases.

For instance, the risk of typhoid fever has probably been declining constantly in Copenhagen during the last 30 years, so every new treatment has been able to refer to increasingly beautiful numerical results.”



6. “Heiberg explains how spontaneous fluctuations in outcomes may totally corrupt the interpretation of what we call today ‘time-series studies’. One may cover oneself by conducting random selection of every second patient for the experimental intervention, but it must be strictly every second patient. Hereby, Heiberg shows his appreciation of selection (allocation) bias. Heiberg also knows the problems connected with such selection, and suggests that one use ‘day of hospital admission’ as the basis for allocation. His suggestion has since been used by many, but we now realise that this method of treatment allocation allows foreknowledge of the allocation, and may thus be subject to selection bias.”

“In addition to material from hospitals, new material, some of which might be used for therapeutic-statistical studies, is now becoming available from medical reports from larger towns. In Ugeskrift for Læger 1895 Nr. 47, I tried to demonstrate how it might be possible to use material from the city medical officer’s annual report to analyse whether serum treatment of diphtheria had had any influence worth mentioning or not.”



“Based on the experiences from 1875-94, this paper tried to prove that the law of large numbers could be applied, at least in the age-group 5-15 years, to assess a fluctuation in risk. But may one here not become frustrated by a spontaneous drop in the risk of the disease? Definitely this is possible, but this would hardly happen at the same time as the introduction of a new method of treatment.

Is this an insurmountable difficulty to the statistical-therapeutical research? Probably not. If a study has proved that in a specific case it is proper to use the law of large numbers, then – as mentioned above – it would be quite a coincidence if a spontaneous fall in risk were to take place at the same time at which one started a new method of treatment. But if one is to cover oneself, there is only one way, to treat every second with the new method and every second with the old one. But under ‘every second’, indeed, – from a statistical point of view – , it is claimed to be understood mechanically every second, and that no subjectivity comes first, for example, assume every second of the most severely affected cases.

In practice, it will most likely be very difficult to treat every second in one way and every other in another way. It is much easier and equally safe to treat patients admitted on odd dates using one method and those admitted on even dates using the other method. Therefore, you can still restrict the new treatment for the more severely affected. In this way, the desired result, to divide the material into 2 groups where everything is exactly identical except one single causal factor, is reached without help from the subjective assessment.”



“If, in a previous study, you have made sure that it is appropriate to use the law of large numbers for the present problem, or if you have proceeded as in the above mentioned way and still find a numerical, clearly pronounced difference between the 2 treatment methods, you may claim without any risk that you have enriched the therapy with an observation of such merit that it will indeed have some value.

Which factor in the new treatment is the effective one is a question in its own right. The so-called homeopathic treatment of pneumonia is an enlightening example showing that the effective factor of a treatment may completely escape the investigator’s attention.

Is it then correct to use the statistical-therapeutic experiment since it will often require not only a few, but indeed a series of control cases?

Recently, in Denmark the use of oxygen in case of carbon monoxide poisoning has been advocated. Because not everyone with this disease dies and because of the small number of patients occurring even in a larger span of years, to use counts is out of the question when one may use the clinical method, ie, compare cases that are comparable as far as possible. If you, for example, have 2 individuals of nearly the same age, and both have been subjected to carbon monoxide intoxication in exactly the same way, are you then entitled to use oxygen in only one of the cases? Or is this equivalent to depriving the other patients from a chance? I think, that as an observing, inquiring physician, you are entitled to use one of the patients as control. In my opinion, you cannot deprive the other patient from the chance until you have established, based on a study, if breathing in oxygen is actually beneficial.

If the observing therapist does not dare to assume this responsibility, then his position will come down to a hatstand for the other sciences, and he must give up practising his own criticism of the drugs that are offered to him from various places. But if it is proper to let one individual be a control-case, it is also proper – yes, not only proper, but even in some cases necessary – to use rows of individuals as control-cases when this is the only way in which the truth can be acknowledged, in which progress can be made. Here again, the notion that one deprives the individuals of a chance must be refuted since it is exactly this point that needs to be proven, and the only way to determine universal validity is through the results obtained from the statistical-therapeutic experiment.

Therapy must claim its right to choose and pick among all the remedies on offer, and therapy in a number of fields, quite justifiably, must resort to the statistical-therapeutic experiment in order to make the proper choice.”



7. “Heiberg explains how bias from the wishful thinking of an investigator may influence trial participants, whose accounts of their symptoms may then be subject to reporting bias. Heiberg acknowledges the necessity of having large numbers of observations to reduce random error (‘play of chance’). He is also aware of the basis for reliable causal inferences: in order to ascribe a difference between the comparison groups to the experimental intervention, groups must be similar in respect of all factors apart from the experimental intervention.”

VI. Introductory remarks about the plan for a statistical-therapeutic experiment

“If you are facing a specific task, for example, evaluating a new treatment for pneumonia cruposa, there are a number of questions that one should clarify before proceeding with the trial.”



“Maybe the new treatment is based on a solid experimental foundation. You are in favour of it in advance, and preliminary clinical studies might even have strengthened your confidence in the method. But if the investigator has confidence in the method, you cannot help having a psychological influence on the patients. And in the clinical evaluation, where the subjective state of the patients always plays an important role, you may easily obtain too favourable an impression of the new method. Consider what the history of pharmacology teaches us about every new medicine. In the beginning when it is tested it seems to be effective for lots of different diseases. This is especially the case when the trials are run in a hospital department, where each patient can read the effect on the patient next to him. When you then eventually subtract all the stuff resulting from autosuggestion or psychological transfer from others, you may perhaps end up with a modest area where the remedy is continuously indicated”.8 Clearly, I do not believe that during an acute febrile disease risk numbers will be much influenced by such a psychological transfer, but this is still a factor that cannot be completely ignored.

For this reason too, it will be desirable to work with a series of control-cases, in which you may even make the whole external device related to the new method, and only omit the factor considered to be the effective one.

Far more important is probably the rumour of a new ‘effective’ treatment for a disease quickly reaching the public, and hereby, often a number of mild cases are attracted to the hospitals and in this way the risk of the disease becomes artificially reduced.”



“Similar to the possible spontaneous fluctuation in the risk of the disease, this problem may be avoided by the use of control-cases. This way of planning your trial also ensures against the often rather large lurches that may appear between the different groups. That it should be decided objectively, which cases should be experimental – and which should be control, is a requirement that is probably easier to state than put into practice.

The easiest way to perform this, may be, as said before, to treat the newly admitted patients every second day or perhaps every second week in the same way.

Even though you only want to treat the more severe cases of a disease using a new method, you might as well use this method to divide all the cases into an experimental group and a control group.

Furthermore, you have to know for sure whether you dispose of a sufficiently large number of cases compared to the disease risk. Even under favourable circumstances – as, eg, pneumonia crouposa in a large hospital department – you will soon become aware that a statistical-therapeutic experiment is only rarely carried out for weeks or months; more often it requires several years.

If you have divided your material into 2 groups in which all factors except the treatment are identical, and there is a difference in the disease risk of the 2 groups, you should use the method of expected deaths in order to explain what are the chances that this difference is real. If the chance is only little, one may consider adding more to the given basis than if the chance is considerable.”

Translation by Christian Gluud

Whole article


Povl Heiberg (1868-1963)