Welsh BC, Podolsky SH, Zane SN (2020). Between medicine and criminology: Richard Cabot’s contribution to the design of experimental evaluations of social interventions in the late 1930s.

© Brandon C. Welsh. Email: b.welsh@northeastern.edu

Cite as: Welsh BC, Podolsky SH, Zane SN (2020). Between medicine and criminology: Richard Cabot’s contribution to the design of experimental evaluations of social interventions in the late 1930s. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/between-medicine-and-criminology-richard-cabots-contribution-to-the-design-of-experimental-evaluations-of-social-interventions-in-the-late-1930s/)


In 1935, Richard Cabot began designing the Cambridge-Somerville Youth Study (CSYS) to evaluate the impact on youth delinquency of a social intervention of “directed friendship”. By the commencement of the study in 1937, in an attempt to ensure that like were being compared with like among intervention versus control groups, hundreds of boys had been “matched” into pairs based on 142 separate variables. The matching procedure was elaborate and adhered to the principle that “each personality would be studied both statistically and configurationally” (Powers and Witmer 1951, pp 62-63), a nod to educational research (McCall 1923). But Cabot introduced a final methodological maneuver—a coin flip to determine which boy within each pair would be assigned to the intervention group.

Today, the CSYS is recognized as the first randomized trial in criminology (Weisburd and Petrosino 2004), one of the earliest randomized clinical trials of a social intervention (Forsetlund et al. 2007), and seemingly the first to use alternate or random allocation after matching study participants into pairs in the social and behavioral sciences (Welsh et al. 2019b). Historical research has previously investigated the personal, professional, and institutional influences that inspired Cabot’s vision for the study and its research design (Welsh et al. 2017; 2019b; see also O’Brien 1985; Evison 1995).

Yet Cabot—physician and social interventionist alike—lived at the interface of medicine and the social sciences. His study marks a telling moment in the history of attempts to compare like with like—of the evolving articulation among pre-allocation stratification, matching, alternate allocation, random allocation, and other innovations intended to ensure fair comparison of the effects of interventions. It is also a useful starting point from which to explore the migration of innovations across seemingly siloed disciplines.

This article uses Cabot’s iconic study as a focal point from which to explore the tension between pre-allocation stratification and matching on the one hand, and alternate allocation and random allocation on the other, as a means of ensuring comparisons of like with like. It also examines the comparative histories of this tension in the social sciences (with a focus on criminology) and medicine as a means of further contextualizing Cabot’s study.

Richard Clarke Cabot and the Cambridge-Somerville Youth Study

Richard Clarke Cabot (1868-1939) was a physician and professor of clinical medicine and social ethics at Harvard University (Anon 1939; White 1939; Crenner 2005). He made a number of important contributions to medicine and public health, including differential diagnosis using blood (Cabot 1896), the establishment of the clinical pathological conference (Hajar 2015), and, in 1905, starting the first medical social work program in the country at Massachusetts General Hospital (Cabot 1919a; Stuart 2004). In the social sciences, he is best known for advancing the field of social ethics (Cabot 1926), advocating for social work practice and research, serving as president of the National Conference of Social Work (Cabot 1931), and developing and directing the Cambridge-Somerville Youth Study.

The CSYS would be Cabot’s final project, consuming the last five years of his life. In 1935, he incorporated a charity named after his late wife, the Ella Lyman Cabot Foundation, with the express purpose of funding an experimental intervention of young boys judged to be at increased risk of becoming delinquent (Powers 1949). In the same year, he created a selection committee, comprised of three prominent practitioners in juvenile and criminal justice, to identify and recruit boys for the study (see Powers and Witmer 1951, pp 53-54, n. 2). The committee was charged with recruiting boys who were between the ages of 5 and 13 years, attended public and parochial schools and who lived in the working-class areas of Cambridge and Somerville (Massachusetts), and were deemed to be “pre-delinquent”. Characteristics of pre-delinquency included “persistent truancy, persistent breaking of the rules, sex difficulties, petty pilfering and stealing, failing to return home after school, and, among the kindergartners, temper tantrums” (Cabot 1935).

A large number of boys were referred to the committee, mostly by local schools (approximately 77%), local welfare agencies, churches, and the police. Information on the boys was collected from a wide range of sources, including elementary school teachers, juvenile courts, physicians, and the parents of eligible boys. Case files of eligible boys were turned over to the “matchers,” a group of psychologists employed by the study. The process of matching involved two steps. A group of older boys (n=80) were first observed on overnight camping trips to assess relevant social and personality traits to operationalize matching parameters. The psychologists then matched all boys (n=650) using 142 variables (rated on an 11-point scale) covering a wide range of characteristics, including physical health, emotional and social adjustment, father’s occupation, teacher ratings of “average” or “difficult,” mental health, aggressiveness, acceptance of authority, discipline, and delinquency or disruption at home (Powers and Witmer 1951). This resulted in 325 matched pairs, whom the researchers referred to as “diagnostic twins” (deQ. Cabot 1940, p 146).

Following the matching process, one member of each matched pair was randomly allocated—based on a coin toss—to the treatment group. Overseen by the study director, the process of random allocation was staggered, beginning on November 1, 1937, and ending on May 13, 1939, with the intervention officially starting on June 1, 1939.¹

In the 1930s, the preventive intervention was described as character development through positive role models, also termed “directed friendship” (Powers 1950, p 21). The intervention was similar to today’s mentoring programs (Eddy et al. 2017). Boys in the treatment group received individual counseling and home visits by paid professional counselors, known as “case workers” at the time. Counseling activities included taking the boys on trips and to recreational activities, tutoring them in reading and arithmetic, encouraging them to participate in the YMCA and in summer camps, playing games with them at the project’s center, encouraging them to attend places of worship, and giving advice and general support to the boys’ families. Participants were enrolled in the intervention for a mean of 5.5 years, with case workers visiting the treatment boys on average twice a month. The control group received no special services.

In 1942, resource shortages—owing to the country’s involvement in World War II—resulted in the study being scaled back to 253 matched-pairs² and the intervention ending in 1945 (instead of running for 10 years as planned by Cabot). When a boy was dropped from the treatment group, his diagnostic twin in the control group was also dropped (McCord 1984). A comparison of all of the remaining pairs—using group differences in arithmetic means—indicated that there were no statistically significant differences at baseline between the treatment and control groups on a wide range of variables (e.g., age, IQ, referral to the study as “average” or “difficult,” mental health; Powers and Witmer 1951, pp 80-81).

There have been four CSYS post-intervention assessments of criminal behavior and other outcomes covering major periods of the life-course, including transition from adolescence to early adulthood, early adulthood, middle-age, and old age. Results of the first two assessments were not associated with any statistically significant differences in criminal behavior (Powers and Witmer 1951; McCord and McCord 1959a, 1959b). At 30-years post-intervention (mean age = 47 years), it was found that the program had produced harmful effects: compared with the control group, treatment group participants were statistically significantly more likely to have committed 2 or more crimes, have symptoms or signs of alcoholism, mental illness, high blood pressure, and heart trouble, have occupations with lower prestige, and to have died before their 35th birthday (McCord 1978, 1981). In the most recent assessment, 72-years post intervention (and likely the most prolonged follow-up of randomized cohorts in history), no statistically significant differences between the treatment and control groups were found in age at or cause of death (Welsh et al. 2019a).

Pre-allocation stratification, matched pairs, and randomization: the view from the present

The key benefit of random allocation in experiments is that it “ensures that any known and unknown differences between groups at baseline are due to chance and not to any systematic bias” (Forsetlund et al. 2007, p 371). In short, it establishes the basis for like-with-like comparisons. This represents an improvement over traditional matching designs, where observations of known characteristics are used to create “balanced” groups—but unknown differences between groups may remain and lead to allocation bias with consequent lower internally valid comparisons. Random allocation (and alternation) turns “all systematic sources of bias into random ones” (Rubin 1974, p 693); it is thus expected to eliminate the problem of confounding and is generally seen] today as “necessary to produce the most valid and unbiased estimates of the effects” of social interventions (Farrington and Welsh 2006, p 60).

It is possible, however, to combine these two approaches. One way has been to first stratify participants (by age, sex, or other characteristics) prior to random allocation. A closely related manner, employed by Cabot, has been to create pairs matched according to particular characteristics, followed by random allocation within pairs to treatment or to control conditions. According to its supporters (Ariel and Farrington 2010; Weisburd and Gill 2014), this hybrid study design has several advantages over simple random allocation.

First, although randomization is designed to help eliminate confounding, covariate imbalance is still possible (Ariel and Farrington 2010). That is, the treatment and control groups may still differ by chance. This is less likely with large samples, but in small trials it can threaten internal validity (Balzer et al. 2015).³ Attempts to balance the treatment and control groups by matching on known, measured covariates prior to random allocation might help reduce this possibility. However, “depending on the choice of variables used to make the statistical adjustments for imbalances, the likelihood of bias may increase rather than decrease” (Chalmers 1989; Detre et al. 1981).

Second, matching prior to random allocation can improve study power when the matching is effective, meaning that there is a positive within-pair correlation on relevant variables (Wacholder and Weinberg 1982). By decreasing variation within matched pairs on known covariates, matching can improve the precision of estimated treatment effects—as has been demonstrated using over 5,000 simulated datasets (Balzer et al. 2015). There is some debate, however, about whether these advantages necessarily hold in cases of matched-pair cluster randomization, whereby units such as households or schools rather than individuals are matched and randomized (see e.g., Campbell, Donner and Klar 2007; Donner and Klar 2004; Ivers et al. 2012; but see Imai, King and Nall 2009a, 2009b).

Third, random allocation within matched pairs provides a straightforward way of dealing with differential attrition, which can present serious challenges for longer follow-up assessments of controlled trials. Here, the researcher can drop both members of the pair in the event one member is missing. The downside is that this can result in considerable attrition and small final samples. Such considerations of course represent our present perspective.

Back to history: the context of the social sciences

There was a growing interest in experimentation in the social sciences during the 1920s (McCall 1923; Brearly 1931; Chapin 1931).4  Prior to the advent of alternate or random allocation, comparison groups for experiments were generated within a number of trials using “systematically balanced designs” (Box 1980, p 2), which involved “matching on prognostic variables” (Forsetlund et al. 2007, p 371). Early experimenters actually regarded matching designs as preferable to random allocation; this is because chance selection was thought to produce larger errors than matching on measured variables (McCall 1923).

Educationalists started to use comparison groups in educational experiments as early as 1908 (Winch 1908), but used matching rather than random allocation to create comparison groups. For example, an early textbook’s treatment of experimental design for education research promoted “measurement” (i.e., matching) over “chance selection” (i.e., alternation, rotation, or random allocation) for generating equivalent comparison groups (McCall 1923, p 42). While random sampling was agreed to be the best means of achieving representative participants, there was not agreement that random allocation to comparison groups was necessarily the best means of achieving group equivalence. Thus, McCall (1923, p 42) wrote: “Measurement, if adequate and accurate, is the best basis for selecting subjects irrespective of their number. Chance selection is merely an economical substitute for measurement, and is practicable only where the number of experimental subjects is sufficiently large.” As Forsetland and colleagues (2007, p 374) observe, “McCall clearly regarded reliance on chance as inferior to active matching of groups using measures of general ability, and we have been unable to find any account of him having used chance (random allocation) to generate comparison groups in intervention studies.”

This can be observed in early experiments in education that used matching rather than random allocation for generating equivalent comparison groups. For example, Winch (1910) reported on a series of experiments on memory in 10-year-old children. In each experiment, a class of students was first divided into a treatment group that performed rote memory exercises and a control group that did not. To generate equivalence between groups, the whole class first took a memory test and was then divided into two groups with roughly equal total scores. Following this intervention, the two groups were given substance memory tests (i.e., stories) to evaluate whether the rote memory exercises improved substance memory, as assessed by comparing mean group scores.

In another early example of matching, Thorndike and Ruger (1916) reported on an experiment comparing the test outcomes of students exposed to recirculated air compared with those exposed to fresh air. Two groups of students were matched based on the results of a series of practice exercises involving addition, number and letter checking, and finding and copying addresses. The two groups were then given a series of tests to measure their ability across a number of academic topics. The average initial scores in each group were roughly the same, and the groups were deemed equivalent. From January to April, the treatment group was exposed to recirculated air in their classroom while the comparison group was only exposed to fresh air (i.e., opening windows). At the end of the semester, the groups were compared in terms of improvement from initial scores across a number of examinations.

In a more advanced educational setting, Chapin (1931) noted that 59 experiments were conducted at the University of Minnesota to investigate the effects of class size on academic achievement. In one such experiment, researchers compared a large class of 59 students with a small class of 21 students. Students in each class were assessed on intelligence scores and grades, and 11 students from the small class (treatment group) were matched with 11 students from the large class (control group). The classes had the same instructor, text, and method of instruction, and the mean final exam outcomes of the matched subgroups were compared.

Matching based on initial measurements appears to have remained the dominant approach to experimentation in education into the 1930s. There were, however, a few examples of experiments in education using random allocation prior to 1937, all of which appear to have taken place at Purdue University in the late 1920s and early 1930s (see Forsetlund et al. 2007). For example, Walters (1931, 1932) investigated whether counseling services improved the performance of freshman students who were deemed to be “academically delinquent.” In one study, freshman students with failing grades were “divided into three groups by random sampling”: a group with instructor counselors, a group with student counselors, and a control group with no counselors (Walters 1932, p 229).

By 1937, Cabot regarded matching alone as insufficient to ensure a fair assessment of his intervention. In keeping with his protestations to social workers and others to evaluate their projects and use rigorous methods (Cabot 1931), Cabot managed to set the bar even higher for evaluating his own delinquency prevention intervention, a decision that would come to be seen as significant for evaluation science.

In reporting on the results of the first evaluation of the CSYS, Edwin Powers and Helen Witmer (1951, p 78) reported that Cabot deemed matching on its own to be insufficient: “The next question was to determine whether any given boy should fall into the treatment or the control group. It was evident that an arbitrary decision might give rise to a constant error. The proper method of determining this question was, of course, by chance. Accordingly, a coin was flipped and the cases fell into the treatment or comparison groups in accordance with its fall.” Powers and Witmer went on to reflect in their 1951 assessment, “It was believed that, even if the measures used in the matching were not perfectly reliable, chance would tend to preserve, in groups as large as 325 each, an even balance of important factors” (p. 78).5 Later commentators on the study have supported this view: “The researchers used this paired or fully blocked design because the experimental treatment was lengthy and complex, so they sought to maximize the equivalence of the comparisons they could make” (Weisburd and Gill 2014, p 100).

Closely tied to this view was the need to overcome additional uncertainties about testing the effects of an intervention on social behavior (Claghorn 1927). Writing in the book chapter on the matching process, Powers and Witmer (1951, p 82) concluded with the following: “This account of the matching process, unavoidably complex, brings to light the difficulties of achieving adequate experimental controls in investigations of therapeutic methods; indeed, in any investigation that concerns itself with personalities or social behavior. For the purpose of this Study, however, it was vital to attempt as sound an equating as possible.” Recent historical research on the origins of the research design of the CSYS has suggested two plausible explanations for Cabot’s decision to following matching with random allocation (Welsh et al. 2019b). First, it represented a natural carry-over from Cabot’s background in clinical practice and research. This had long been the prevailing view about the research design in general, held by those conducting research on and writing about the study (e.g., McCord 1992). Second, the combination of matching and random allocation appealed to Cabot on the grounds of using an even more rigorous method of experimentation than what was, in the late 1930s, the standard of the day. In fact, Cabot’s decision to employ random allocation within matched pairs was made in the context of a debate among leading statisticians regarding the best method for creating equivalent groups for purposes of experimentation. We will next provide a deeper exploration of this history.

The historical context of medicine and public health

The advent of matching
Within medicine, from the turn of the 20th century onward, despite the plethora of uncontrolled studies, some researchers began to consider how to compare like with like in clinical experiments. Many such researchers, often investigating the prophylaxis or treatment of infectious diseases, employed “alternate allocation” studies, in which patient A would receive one intervention, patient B an alternative or nothing, and so forth.6 Dozens of such studies were conducted during the first half of the 20th century (Chalmers et al. 2011). As discussed elsewhere (Chalmers 2005; Bothwell and Podolsky 2016), by the 1930s and 1940s, concerns over researchers believing that they could ‘improve on’ allocation schedules based on alternation or random allocation, for example by preferentially steering the sickest patients to the novel treatment) led Austin Bradford Hill to conceal allocation schedules from those entering patients as participants in controlled trials. Concealed allocation schedules in the 1948 Medical Research Council (MRC) study of streptomycin for pulmonary tuberculosis contributed to its subsequent iconic status in the history of treatment evaluation (MRC 1948).

But there were other methods offered for ensuring fair medical treatment comparisons. One important technique was that of “matching,” or attempting to ensure, a priori rather than solely in post-hoc analysis, equivalent representation of seemingly relevant characteristics among treated and untreated groups. To some extent, such notions extend to James Lind’s own assertion that the cases of the sailors in his experiment comparing different treatments for scurvy, “were as similar as I could have them.” Yet while the history of matching merits far more inquiry, it appears that by the early 20th century, some trialists demanded increasing attention to ensuring such matched characteristics.

The earliest articulation that we have found of intentional matching as a method per se in the evaluation of therapeutics appears in a 1912 paper by Harry Lee Barnes on the treatment of tuberculosis with tuberculin, although this was a retrospective analysis. Superintendent of the State Sanatorium in Rhode Island, Barnes conducted a retrospective analysis of 150 patients treated at the sanatorium between 1907 and 1912. As he stated, comparisons “should be drawn between two classes of patients, those who take the treatment and those who do not, and these parallels should be made from cases that are as similar in prognosis as possible. For this study, an attempt was made to match each one of the 150 patients taking tuberculin against another patient of the same classification, according to the National Association, and also anatomically according to Turban, and likewise to match only cases having similar records of bacilli in the sputum, temperature, pulse, respiration, general condition, weight, race and year of discharge” (Barnes 1912).7 Finding no evidence of benefit of tuberculin, Barnes nevertheless considered that his methodology itself represented an advance: “While not perfect it should be much superior to slipshod methods of stating results of treatment and if widely adopted it would help to weed out more rapidly worthless methods of treatment in pulmonary tuberculosis. If applied to mooted questions like the ‘value of climate,’ it would eventually solve them, as the fruitless war of theories and opinions would eventually be displaced by evidence.” Nevertheless, Barnes presciently acknowledged: “Drawbacks to the use of this method are the abundance of material required and the amount of labor necessary to carry it out.”

We see emphasis on the need for equivalence around key factors in prospective studies in such prominent locations as Major Greenwood’s and Udny Yule’s World War I-era paper on “The Statistics of Anti-Typhoid and Anti-Cholera Inoculation, and the Interpretation of Such Statistics in General” (Greenwood and Yule 1914-1915) and in the subsequent American Public Health Association “Working Program against Influenza” (Anon 1919). As John Eyler (2009) has pointed out, in the wake of the mass of uncontrolled (or poorly controlled, even to contemporary judges) influenza vaccination studies, the “Working Program” authors stated as one of their key characteristics of a valid study comparing vaccinated to unvaccinated individuals that “the relative susceptibilities of the two groups should be equal, as measured by age and sex distribution,” as well as exposure history.8 It is perhaps telling that there is no mention of alternate allocation (let alone random allocation) in the “Working Program ” Indeed, it did not specify how such groups were to be rendered equivalent with respect to such factors.

More formal attention to a priori matching around key characteristics appeared in several prominent studies throughout the 1920s.9 In Harriette Chick’s and colleagues’ investigation of the influence of diet and sunlight on rickets in institutionalized children in postwar Vienna, “the children on admission were placed in two groups upon Diets I and II, care being taken that the infants in each group should be as similar as possible in age, general condition, and development, and that they should remain under identical conditions of general management and hygiene during their stay in the hospital” (Chick et al. 1923; Chick 1976). To further ensure equivalence with respect to environmental exposure, “the children in the two dietetic groups occupied adjoining cots in the wards, so that differences in the degree of illumination and exposure to fresh air were minimized as much as possible.” In Elmer McCollum’s controlled study of supplementary milk in 84 institutionalized children in Baltimore divided into treatment and control groups, “every effort was made … so that any child in one group was comparable in age, size and condition to a child in the other group” (McCollum 1924; Pollock 2005). Likewise, in Harold Corry Mann’s study of milk supplementation among institutionalized children outside of London, attention was made to divide active versus control groups according to age, as well as a combined rating score of height and weight (Corry Mann 1926; Pollock 2005). And in a study not related to nutrition, but of vaccines for preventing common colds at the University of Manchester, researchers took 144 volunteers “and divided them into two equal groups by sorting the cards [filled out by the volunteers] first according to the sex of the volunteers, and then according to the dates on which the last cold was recorded.” As they continued, emphasizing the characteristics they were seemingly able to control versus those they weren’t: “Thus the two groups were approximately alike with regard to sex-distribution and with regard to the period which had elapsed since the last cold, in all other respects the distribution was random” (Ferguson et al. 1927). In none of the four studies was there any overt mention of how participants were allocated to active versus control groups.

Matching and stratification in alternate allocation studies
By contrast, among those focusing on alternate allocation as a means of ensuring the comparison of like with like, matching prior to allocation could be described as an impractical luxury, especially when researchers felt that with strict alternation and a large enough sample size, important characteristics would distribute sufficiently evenly among active versus control groups. Patients with pneumonia at Harlem Hospital were given polyvalent antiserum (treating multiple pneumococcal serotypes),10 and “because of the importance of treating patients at the earliest moment it was impracticable to alternate [the patients by pneumococcal serotype], since often at least twelve hours would have been lost before this was determined” (Park et al 1928). William Park, Jesse Bullowa, and Milton Rosenblüth, conducting the study, “believed that with a sufficiently large series the distribution of case by type would be equalized between the treated and the untreated group,” and indeed this proved to the case. Similarly, when the British MRC began its own study of anti-pneumococcal antiserum a few years later, while they clearly enunciated exclusion criteria (e.g., no patients with advanced heart disease, no patients under the age of 20 or over the age of 60) to avoid confounding factors, their plan “still left altogether unregulated the chance scatter of distribution of patients with severe or mild pneumonia into either the serum or control groups, and also of those admitted for treatment early or relatively late in the progress of the disease” (MRC 1934). As with the American pneumonia researchers, it “was thought better not to attempt a deliberate sorting of cases in respect of mildness or severity, but to trust that the distortion of chance scatter would become almost negligible in a fairly large number of cases.” However, analysis of the MRC trial noted that in some of the participating sites more “severe” cases had ended up in the treated groups rather than in the control groups, evidence that set the stage for future discussions of the limitations of unconcealed allocation schedules (Chalmers 2013).

To some extent, such divisions between the use a priori stratified studies and alternate allocation may be considered to represent the practical differences between planned, slowly enrolling studies of chronic conditions or preventive measures, and interventions in acute illnesses like pneumonia. But certain researchers did take pains to carefully stratify patients into subgroups for comparison before alternate allocation took place. In the early 1920s, Nicholas Kopeloff and George Kirby, at the New York State Psychiatric Institute on Ward’s Island, investigated the impact of the elimination of focal infections (dental, tonsillar, or cervical) on psychiatric illness (Kopeloff and Kirby 1923; Wessely 2009). As they noted, “because of the difficulties of interpretation inherent in an investigation of this nature, it seemed desirable to reduce the study as nearly as possible to the terms of an experiment.” They chose alternate allocation as their primary means for ensuring equivalence among treated versus untreated patients, but also noted that “an attempt was made to place in the two different groups, patients comparable as to sex, age, duration of psychosis, diagnosis, prognosis, and infective conditions.” It is unclear how exactly they attempted to operationalize – or reconcile with alternate allocation – this methodological foreshadowing of “minimization” (for the advent of later concerns with such “minimization,” see Pocock and Simon 1975). But a decade later, Massachusetts General Hospital’s Donald King, studying the inhalation of carbon dioxide to prevent postoperative pulmonary complications, was far more explicit in describing his attempt to stratify patients prior to alternate allocation. He began by noting that “since the sex of the patient and the type of abdominal operation play so important a part, the patients were divided according to sex and then grouped according to the type of abdominal operation. Every other patient, in the subgroups of each sex, was treated” (King 1933). He continued: “This alternation gave, for instance, a group of men who had had operations on the stomach and who had had hyperventilation induced, to compare with an equal number of men who had had operations on the stomach but who had not had hyperventilation induced. … Thus, statistics were available for male and female cases, treated and untreated, in the different groups of abdominal operations and hernia repair.”11

Random allocation within matched-pairs
Such general tensions between matching and alternate allocation would be paralleled among those who first broached the mixed application of matching and random allocation within medicine and public health, bringing us still closer to Cabot’s study. By the 1920s, Ronald Fisher had advocated random allocation among agriculture plots in his “The Arrangement of Field Experiments.” (Fisher 1926). Ian Hacking has noted that W.S. Gosset “and a majority of traditionalists believed that ‘matched’ or ‘balanced’ arrangements were less subject to error, more instructive, and in general entitled one to draw firmer instances” (Hacking 1988, p 429; Cox 2009), and that Gosset eventually favored “balanced randomization” as a happy compromise.12

This played out in a fascinating way in 1930 and 1931 with respect to a “nutritional experiment on a very large scale” (Student [i.e., Gosset], 1931) that followed upon the milk studies described above (Pollock 2005). In Lanarkshire, Scotland, 20,000 students from 67 schools were studied in the spring of 1930 to assess the effects of milk supplementation on growth. In any given school, for the most part, “the teachers selected the two classes of pupils, those getting milk and those acting as ‘controls’, in two different ways. In certain cases they selected them by ballot and in others on an alphabetical system” (Leighton and McKinlay 1930). However, “in any particular school where there was any group to which these methods had given an undue proportion of well-fed or ill-nourished children, others were substituted in order to attain a more level selection.” In other words, a rough form of “matching” was added to the process to ensure the comparison of like with like. The study seemed to favor the inclusion of milk; but most important to our inquiry, by 1931, it had led Gosset to produce a methodological deconstruction of the study (see also Pollock 2005). For Gosset, foreshadowing the concerns of those who revealed well-intentioned cheating with unconcealed allocation schedules, “unconscious selection” (later in the paper referred to as “unconscious bias”), seemingly manifested in the attempt at matching, could lead to the production of unequal comparison groups (as seemed to have been the case in the Lanarskhire study). Especially focusing on a sub-question of the study concerning the relative utility of raw versus pasteurized milk, Gosset noted that the studied students “were not random samples from the same population; they were selected samples from populations which may have been different, … [and] I would be very chary of drawing any conclusions from these small biased differences.” As he gently lamented, “this experiment, in spite of all the good work which was put into it, just lacked the essential condition of randomness which would have enabled us to prove the fact.” Instead, Gosset proposed that if the experiment were to be repeated “on the same spectacular scale,” then: “The ‘controls’ and ‘feeders’ should be chosen by the teachers in pairs of the same age group and sex, and as similar in height, weight and especially physical condition (i.e. well or ill nourished) as possible, and divided into ‘controls’ and ‘feeders’ by tossing a coin for each pair.” In a fascinating subsequent section of the paper, concerning the comparison between raw versus pasteurized milk, Gosset noted that among 20,000 children, there should be, on average, about 50 pairs of identical twins and that “the error of the comparison between them may be relied upon to be so small that 50 pairs of these would give more reliable results than the 20,000 with which we have been dealing.” Again, he proposed a plan whereby the researchers would “’Feed’ one of each pair on raw and the other on pasteurized milk, deciding in each case which is to take raw milk by the toss of a coin.”13 Cabot’s “diagnostic twins” had been foreshadowed by such literal twins.

That same year witnessed the publication by J. Burns Amberson and colleagues of “A Clinical Trial of Sanocrysin in Pulmonary Tuberculosis.” Twenty four patients “free from serious complications” participated in the study. “On the basis of clinical, X-ray and laboratory findings the 24 patients were divided into two approximately comparable groups of 12 each. The cases were individually matched, one with another, in making this division. Obviously, the matching could not be precise, but it was as close as possible, each patient having previously been studied independently by two us.” Finally, “by a flip of the coin, one group became identified as group I (sanocrysin-treated) and the other as group II (control)” (Amberson et al. 1931, pp 403-404).

The Amberson et al. study did not uncover any beneficial effects of sanocrysin; indeed the drug was shown to have nasty side effects. Joseph Gabriel has demonstrated the origins of the trial at the intersection of mutual public health service and pharmaceutical industry (Parke Davis) interest in an objective assessment of the drug, with the trial entailing blinding of patients to prevent a “psychic influence” on healing (Gabriel 2014). More germane to our line of inquiry, the origins of the single coin toss to determine the allocation of the two groups of patients are less apparent from the archival record. George McCoy, who had played a large role in the APHA vaccine protocols that emphasized matching (as mentioned above), likewise supported this therapeutic trial through his role as the director of the national Hygienic Laboratory. While the expressed need for a controlled study (even in discussions of the animal studies that preceded the human study) is evident throughout the record, and while the “plan” for the trial initially called for 100 treatment and 50 control patients, there is no formal mention of either matching (beyond the intent to choose groups of patients “on the basis of pulmonary lesions that are as nearly as possible comparable as regards extent and character of disease”) or random allocation in the plan (“Plan for Clinical Test of Sanocrysin,” November 1926, RG 443, General Records of the NIH, 1930-1948, box 21, “Sanocrysin Clinical Tests”).14 Clearly, however, by the late 1920s and early 1930s, certain trialists were extending beyond would-be matched controls to the addition of random allocation as an additional mechanism to ensure fair comparisons among treatment and control groups.


Despite our extensive searching within the Cabot and Sheldon Glueck15 papers at Harvard, it is unclear whether Cabot was aware of Kopeloff and Kirby’s trial, King’s study, Gosset’s dissection of the Lanarkshire study, Amberson et al.’s tuberculosis trial, or Austin Bradford Hill’s discussion of the “Principles of Medical Statistics” in Lancet in 1937. On the one hand, Cabot conducted research on tuberculosis in his early medical career (Cabot 1919b), as well as wrote about the disease and its treatment in his medical textbooks, including later editions published after 1931 (e.g., Cabot 1937; Cabot and Adams 1938). On the other hand, we have not found reference to any of these studies, either in his medical publications or in his personal notes or correspondence.

And in tracing the history of treatment evaluation and the conduct of fair comparisons, it would seem that there is more of a direct line from the advent of alternate allocation, through concerns over their improper implementation, to the advent of randomized clinical trials (RCTs), than there is from Amberson, Cabot, or even Gosset to Bradford Hill (Chalmers 2005). In this reading, the mixed matching plus randomization proposals and studies of the 1920s and 1930s seem to be a relative dead end, albeit one reflecting increasing concern to provide objective assessment of novel interventions in the interwar years. These developments ensured that like would be compared with like, unmeasured variables would be unbiasedly distributed among comparison groups, and that, by concealing the allocation schedule, the allocation system itself could not be cheated.

The combination of matching with random allocation in prospective clinical trials would continue to be deployed in both the social sciences and medicine throughout the 20th and into the 21st centuries, followed by evolving debate over its advantages and limitations (Billewicz 1964; Bland and Altman 1994; Farrington and Welsh 2006; Ariel and Farrington 2010). The design itself serves as a cornerstone of the evolving articulation of stratification, matching, randomization, and similar innovations for ensuring fair comparisons are made in trials. Key to this history has been Richard Cabot’s Cambridge-Somerville Youth Study, the first large-scale matched-randomized trial and one of the earliest randomized trials of a social intervention (Forsetlund et al. 2007).

Prior research (Welsh et al. 2019b) pointed to the study’s design representing a natural carry-over from Cabot’s background in clinical and research medicine, as well as the design appealing to him on the grounds that it would be even more rigorous than alternation or simple random allocation (Welsh et al. 2019b). We found additional support for the influence of the latter, with Cabot viewing matching as insufficient on its own to achieve equivalence between treatment and control groups. Equally compelling was Cabot’s added concern about “achieving adequate experimental controls” in evaluating an intervention that was focused on social behavior. The implication is that the social world compared to the physical world was less known to experimentalists. This is to take nothing away from Cabot’s staunch advocacy for social workers and his repeated call for them to evaluate their interventions using rigorous methods. Famously, in his presidential address to the National Conference of Social Work, he made clear his desire for an age of rigorous evaluation in social work: “Pending the much-to-be-desired epoch when we shall control our results by comparison with a parallel series of cases in which we did nothing” (Cabot 1931, p 448).

More broadly, our research has situated both Cabot and his study in the midst of the social sciences and medicine and public health as they have wrestled with the uses of a priori stratification, matching, alternate allocation, and random allocation and attempted to compare like with like in the 20th and 21st centuries. Matching – whether as an independent form of ensuring seemingly unbiased comparisons or as an a priori component of alternate allocation or random allocation – has so far received insufficient attention (as have related notions of stratification and exclusion criteria). Our research is an attempt to address this historical gap. Additionally, we have attempted to place the history of the social sciences, medicine, and public health in direct conversation with one another. The boundaries among such disciplines are indeed indistinct and dynamic. For example, the 1916 study cited earlier on the role of air quality in education was overseen by the New York State Commission on Ventilation, with noted public health pioneer Charles-Edward Amory Winslow among the Commission’s listed members (Thorndike and Ruger 1916). “Cross-ventilation” among the social sciences, medicine, and public health themselves has persisted to this day. Richard Cabot likely served as only the most prominent of individuals who straddled – or at least engaged with – multiple disciplines. We hope historians will follow suit and that this article will stimulate further attention in this direction.


We are especially grateful to the reference staff at Harvard University Archives, Harvard Law School Library’s Historical and Special Collections, and the Center for the History of Medicine at the Countway Library. We also wish to thank Iain Chalmers for suggesting the central questions that guided our research, as well as for his sage advice and insightful comments throughout the development of this article.

This article was judged article of the month by the American Society of Criminology’s Division of Experimental Criminology.

This James Lind Library article has been republished in the Journal of the Royal Society of Medicine in 2 parts.

Welsh BC, Podolsky SH, Zane SN (2021). Richard Cabot, pair-matched random allocation, and the attempt to compare like with like in the social sciences and medicine: Part 1: The context of the social sciences. JRSM 2021;114:212-217.

Podolsky SH, Welsh BC, Zane SN (2021). Richard Cabot, pair-matched random allocation, and the attempt to compare like with like in the social sciences and medicine: Part 2: The context of medicine and public health. JRSM 2021;114:264-270.


Amberson JB, McMahon BT, Pinner M (1931). A clinical trial of sanocrysin in pulmonary tuberculosis. American Review of Tuberculosis 24:401-435.

Anon (1919). A working program against influenza. American Journal of Public Health 9:1-13.

Anon (1939). Deaths: Richard Clarke Cabot. JAMA 112:2079.

Ariel B, Farrington DP (2010). Randomized block designs. In: Piquero AR, Weisburd D, eds. Handbook of quantitative criminology. New York: Springer, p 437-454.

Balzer LB, Petersen ML, van der Laan MJ (2015). SEARCH Consortium. Adaptive pair-matching in randomized trials with unbiased and efficient effect estimation. Stat Med 34:999-1011.

Barnes HL (1912). Report of 150 cases of pulmonary tuberculosis treated with tuberculin. JAMA 59:332-333.

Billewicz WZ (1964). Matched samples in medical investigations. Brit J Prev Soc Med 18:167-173.

Bland JM, Altman DG (1994). Matching. BMJ 309:1128.

Bothwell LE, Podolsky SH (2016). The emergence of the randomized, controlled trial. NEJM 375:501-504.

Box JF (1980). R.A. Fisher and the design of experiments, 1922-1926. American Statistician 34:1-7.

Brearly HC (1931). Experimental sociology in the United States. Social Forces 10:196-199.

Brown L (1934). Harry Lee Barnes, M.D. Trans Am Clin & Clim Soc 50:lvii-lxi.

Cabot RC (1896). A guide to the clinical examination of the blood for diagnostic purposes. New York: W. Wood.

Cabot RC (1919a). Social work: Essays on the meeting-ground of doctor and social worker. Boston: Houghton Mifflin.

Cabot RC (1919b). Tuberculosis of the lungs. War Medicine 2:978.

Cabot RC (1926). Adventures on the borderlands of ethics. New York: Harper.

Cabot RC (1931). Treatment in social case work and the need of criteria and of tests of its success and failure. Hospital Social Services 24:435-453.

Cabot RC (1935). Letter to Miss Gertrude Duffy, June 3, 1935. HUG 4255: Box 97. Richard Clarke Cabot Papers, Pusey Library, Harvard University Archives.

Cabot RC (1937). A layman’s handbook of medicine: with special reference to social workers, 2nd ed. Boston: Houghton Mifflin.

Cabot RC, Adams FD (1938). Physical diagnosis, 12th ed. Baltimore: Williams and Wilkins.

Campbell MJ, Donner A, Klar N (2007). Developments in cluster randomized trials and Statistics in Medicine. Statistics in Medicine 26:2-19.

Cecil RL, Larsen NP (1922). Clinical and bacteriological study of one thousand cases of lobar pneumonia. JAMA 79:343-349.

Chalmers I (1989). Evaluating the effects of care during pregnancy and childbirth. In: Chalmers I, Enkin M, Keirse MJNC, eds. Effective care in pregnancy and childbirth. Oxford: Oxford University Press, p 3-38.

Chalmers I (2005). Statistical theory was not the reason that randomisation was used in the British Medical Research Council’s clinical trial of streptomycin for pulmonary tuberculosis. In: Jorland G, Opinel A, Weisz G, eds. Body counts: medical quantification in historical and sociological perspectives. Montreal: McGill-Queens University Press, p 309-334.

Chalmers I (2013). UK Medical Research Council and multicentre clinical trials: from a damning report to international recognition. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/uk-medical-research-council-and-multicentre-clinical-trials-from-a-damning-report-to-international-recognition/) [Republished in the Journal of the Royal Society of Medicine 2013;106:498-509.]

Chalmers I, Dukan E, Podolsky SH, Davey Smith G (2011). The advent of fair treatment allocation schedules in clinical trials during the 19th and early 20th centuries. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/the-advent-of-fair-treatment-allocation-schedules-in-clinical-trials-during-the-19th-and-early-20th-centuries/) [Republished in the Journal of the Royal Society of Medicine 2011;105:221-227.]

Chapin S (1931). The problem of controls in experimental sociology. Journal of Educational Sociology 4:541-551.

Chick H (1976). Study of rickets in Vienna, 1919-1922. Medical History 20:41-51.

Chick H, Dalyell EJ, Hume EM, Mackay HMM, Henderson Smith H, Wimberger H (1923)
Observations upon the prophylaxis and cure of rickets at the University Kinderklinik, Vienna. In: Medical Research Council. Studies of rickets in Vienna, 1919-22. Special Report Series No. 77. London: Her Majesty’s Stationary Office, p 19-94.

Corey Mann HC (1926). Diets for boys during the school age. Medical Research Council Special Report Series No. 105. London: HMSO.

Cox DR (2009). Randomization for concealment. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/randomization-for-concealment/) [Republished in the Journal of the Royal Society of Medicine 2010;103:72-73.]

Claghorn KH (1927). The problem of measuring social treatment. Social Service Review 1:181-193.

Crenner C (2005). Private practice: in the early twentieth-century medical office of Richard Cabot. Baltimore: Johns Hopkins University Press.

deQ. Cabot PS (1940). A long-term study of children: The Cambridge-Somerville Youth Study. Child Development 11;143-151.

Detre KM, Peduzzi P, Chan Y-K (1981). Clinical judgement and statistics. Circulation 63:239-240.

Donner A, Klar N (2004). Pitfalls of and controversies in cluster randomization trials. American Journal of Public Health 94:416-22.

Eddy JM, Martinez CR Jr, Grossman JB, Cearley JJ, Herrera D, Wheeler AC, Rempel JS, Foney D, Gau JM, Burraston BO, Harachi TW, Haggerty KP, Seeley JR (2017). A randomized controlled trial of a long-term professional mentoring program for children at risk: Outcomes across the first 5 years. Prev Sci 18:899-910.

Evison IS (1995). Pragmatism and idealism in the professions: the case of Richard Clarke Cabot, 1868-1939. Unpublished dissertation. Chicago: University of Chicago.

Eyler J (2009). The fog of research: influenza vaccine trials during the 1918-19 pandemic. Journal of the History of Medicine and Allied Sciences 64:401-428.

Farrington DP, Welsh BC (2006). A half century of randomized experiments on crime and justice. Crime and Justice: A Review of Research 34:55-132.

Ferguson FR, Davey AFC, Topley WWC (1927). The value of mixed vaccines in the prevention of the common cold. Journal of Hygiene 26:98-109.

Fibiger J (1898). Om Serumbehandling af difteri [On Treatment of Diphtheria with Serum]. Hospitalstidende 6:309-325.

Forsetlund L, Chalmers I, Bjørndal A (2007). When was random allocation first used to generate comparison groups in experiments to assess the effect of social interventions? Economics of Innovation and New Technology 6:371-384.

Fisher RA (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture 33:503-513.

Gabriel JM (2014). The testing of sanocrysin: science, profit, and innovation in clinical trial design, 1926-31. Journal of the History of Medicine and Allied Sciences 69:604-632.

Greenwood M, Yule GU (1914-1915). The statistics of anti-typhoid and anti-cholera inoculation, and the interpretation of such statistics in general. Proc R Soc Med (Sect Epidemiol State Med) 8:113-190.

Hacking I (1988). Telepathy: origins of randomization in experimental design. Isis 79:427-451.

Hajar R (2015). The Clinicopathologic Conference. Heart Views 16:170-173.

Hill AB (1937). Principles of medical statistics. I–the aim of the statistical method. Lancet 1:41-43.

Imai K, King G, Nall C (2009a). The essential role of pair matching in cluster-randomized experiments, with application to the Mexican universal health insurance evaluation. Statistical Science 24:29-53.

Imai K, King G, Nall C (2009b). Rejoinder: matched pairs and the future of cluster-randomized experiments. Statistical Science 24:65-72.

Ivers NM, Halperin I J, Barnsley J, Grimshaw JM, Shah BR, Tu K, Upshur R, Zwarenstein M (2012). Allocation techniques for balance at baseline in cluster randomized trials: a methodological review. Trials 13:e120.

King DS (1933). Postoperative pulmonary complications. II. Carbon dioxide as a preventive in a controlled series. JAMA 100:21-26.

King DS (1935). Correspondence–diathermy in lobar pneumonia. NEJM 213:1324-1325.

Kopeloff N, Kirby GH (1923). Focal infection and mental disease. American Journal of Psychiatry 3:149-197.

Leighton G, McKinlay P (1930). Milk consumption and the growth of school children. Edinburgh: Department of Health for Scotland (and London: HMSO).

McCall WA (1923). How to experiment in education. New York: Macmillan.

McCollum EV (1924). The nutritional value of milk. In: Rogers LA, Lenoir RD, eds. World’s Dairy Congress, Washington, D.C., 2-10 October 1923. Washington: US Government Printing Office, p 421-427.

McCord J (1978). A thirty-year follow-up of treatment effects. American Psychologist 33:284-9.

McCord J (1981). Consideration of some effects of a counseling program. In Martin SE, Sechrest LB, Redner R, eds. New directions in the rehabilitation of criminal offenders. Washington, DC: National Academy Press. p 394-405.

McCord J (1984). A longitudinal study of personality development. In Mednick SA, Harway M, Finello KM, eds. Handbook of longitudinal research, vol. 2. New York: Praeger, p 522-531.

McCord J (1992). The Cambridge-Somerville Study: a pioneering longitudinal experimental study of delinquency prevention. In McCord J, Tremblay RE, eds. Preventing antisocial behavior: interventions from birth through adolescence. New York: Guilford Press, p 196-206.

McCord J, McCord W (1959a). A follow-up report on the Cambridge-Somerville Youth Study. Annals of the American Academy of Political and Social Science 322:89-96

McCord W, McCord J (1959b). Origins of crime: a new evaluation of the Cambridge-Somerville Youth Study. New York: Columbia University Press.

Medical Research Council (1934). The serum treatment of lobar pneumonia. BMJ 1:241-245.

Medical Research Council (1948). Streptomycin treatment of pulmonary tuberculosis. BMJ 2:769-782.

O’Brien L (1985). ‘A bold plunge into the sea of values’: the career of Dr. Richard Cabot. New England Quarterly 58:533-553.

Park WH, Bullowa JGM, Rosenblüth MB (1928). The treatment of lobar pneumonia with refined specific antibacterial serum. JAMA 91:1503-1508.

Pocock SJ, Simon R. (1975). Sequential Treatment Assignment with Balancing for Prognostic Factors in the Controlled Clinical Trial. Biometrics 31:103-115.

Podolsky SH (2006). Pneumonia before antibiotics: therapeutic evolution and evaluation in twentieth-century America. Baltimore: Johns Hopkins University Press.

Pollock JI (2005). Two controlled trials of supplementary feeding of British school children in the 1920s. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/two-controlled-trials-of-supplementary-feeding-of-british-school-children-in-the-1920s/) [Republished in the Journal of the Royal Society of Medicine 2006;99:323-327.]

Powers E (1949). An experiment in prevention of delinquency. Annals of the American Academy of Political and Social Science 261:77-88.

Powers E (1950). Some reflections on juvenile delinquency. Federal Probation 14:21-26.

Powers E, Witmer HL (1951). An experiment in the prevention of delinquency: the Cambridge-Somerville Youth Study. New York: Columbia University Press.

Rubin D (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66:688-701.

Stuart PH (2004). Individualization and prevention: Richard C. Cabot and early medical social work. Social Work in Mental Health 2:7-20.

Student [W.S. Gosset] (1931). The Lanarkshire milk experiment. Biometrika 23:398-406.

Thorndike EL, Ruger GJ (1916). The effects of outside air and recirculated air upon the intellectual achievement and improvement of school pupils. School & Society 4:260-264.

Wacholder S, Weinberg CR (1982). Paired versus two-sample design for a clinical trial of treatments with dichotomous outcome: Power considerations. Biometrics 38:801-812.

Walters JE (1931). Seniors as counselors. Journal of Higher Education 2:446-448.

Walters JE (1932). Measuring effectiveness of personnel counseling. Personnel Journal 11:227-236.

Weisburd D, Gill CE (2014). Block randomized trials at places: rethinking the limitations of small N experiments. Journal of Quantitative Criminology 30:97-112.

Weisburd D, Petrosino A (2004). Experiments, criminology. In Kempf-Leonard K, ed. Encyclopedia of social measurement. San Diego: Academic Press, p 877-884.

Welsh BC, Zane SN, Rocque M (2017). Delinquency prevention for individual change: Richard Clarke Cabot and the making of the Cambridge-Somerville Youth Study. Journal of Criminal Justice 52:79-89.

Welsh BC, Zane SN, Zimmerman GM, Yohros A (2019a). Association of a crime prevention program for boys with mortality 72 years after the intervention: a follow-up of a randomized clinical trial. JAMA Network Open 2:1-11 (e190782),

Welsh BC, Dill NE, Zane SN (2019b). The first delinquency prevention experiment: a socio-historical review of the origins of the Cambridge-Somerville Youth Study’s research design. Journal of Experimental Criminology 15:441-451.

Wessely S (2009). Surgery for the Treatment of Psychiatric Illness: The Need to Test Untested Theories. JLL Bulletin: Commentaries on the history of treatment evaluation (https://www.jameslindlibrary.org/articles/surgery-for-the-treatment-of-psychiatric-illness-the-need-to-test-untested-theories/) [Republished in the Journal of the Royal Society of Medicine 2009;102:445-451.]

White PD (1939). Richard Clarke Cabot. NEJM 220:1049-1052.

Winch WH (1908). The transfer of improveme nt in memory in school-age children. British Journal of Psychology 2:284-293.

Winch WH (1910). The transfer of improvement in memory in school-age children, II. British Journal of Psychology 3:386-405.


  1. There were two violations of the random allocation procedure. First, 8 boys were matched after the intervention began. Second, “brothers were assigned to that group to which the first of siblings was randomly assigned.” This violation involved a total of 40 boys, 21 in the treatment group and 19 in the control group (McCord 1992, p 199, n. 1).
  2. The decision to drop cases from the study was based on counselors’ views on a range of factors (e.g., cooperativeness, seriousness of the problem, distance) for each treatment boy under his or her charge (Powers and Witmer 1951, p 142-144).
  3. This was recognized, with respect to alternate allocation studies, as early as in Austin Bradford Hill’s elaboration of the “Principles of Medical Statistics” in 1937 (see Hill 1937).
  4. Brearly (1931) also notes that the term “experiment” was used in at least seven ways, only one of which involved “controlled experiments.”
  5. Edwin Powers was on the study research staff from 1937 to 1947. He also served as the study director from 1941 to 1951.
  6. Other versions of this included alternating the treatment plan every other day (see Fibiger 1898), or, in an anticipation of cluster randomization, of alternating wards of a hospital to treatment versus control groups (see Cecil and Larsen 1928).
  7. For more on Barnes, see (Brown 1934). Barnes’ paper was identified through a Google ngram search (4/17/20) of “patients were matched.” Sadly, JAMA no longer supports full-text searching of its issues, and we suspect there are more relevant articles within its run.
  8. It is likely not a coincidence that the widespread mortality of the Great Influenza likewise stimulated the Metropolitan Life Insurance Company to fund a series of alternate allocation trials of anti-pneumococcal serum in hospitals in Boston and New York, providing a crucial stimulus to this parallel methodological innovation (Podolsky, 2006).
  9. Such analysis was augmented through looking through each of the studies in the James Lind Library between 1920 and 1934.
  10. Pneumococcal pneumonia, then considered the dominant form of pneumonia, had been subdivided into subtypes (e.g., Type I, II, etc.) based on the mutually informing immunological serotype of the pneumococcal capsule and the complementary serum directed against such a capsule (Podolsky 2006).
  11. For King’s methodological dissection of another alternate allocation study two years later, see King (1935).
  12. Gosset, who published under the pseudonym Student, was the originator of the “Student’s t-test” while working at the Guiness brewery.
  13. On the one hand, this became the occasion for the aside that “some way of distinguishing the children from each other is necessary or the mischievous ones will play tricks.” On the other hand, such a plan points not only to the importance, in Gosset’s mind, of matching background characteristics to achieve a valid comparison, but of the importance of genetic aspects to the overall substrate of each individual at the time of study.
  14. Many thanks to Joseph Gabriel for generously making available the more than 500 images of pages from this collection.
  15. Sheldon and Eleanor Glueck were research professors at Harvard Law School. They had a personal and professional relationship with Cabot, and were influential in Cabot’s development of the CSYS.