The Replication Problem

By Allan Franklin

November 5, 2019 15:45 MST

This post by Allan Franklin concerns the replication problem – that is, the ability or inability to reproduce an experimental result – in physics. The replication problem is at least as important in biology and medicine, so I asked Prof. Franklin for permission to post the introduction to his book, Is It the Same Result? Replication in Physics. I hope it will engender an interesting discussion here. I will be the moderator of that discussion. –Matt Young

Allan Franklin is professor of physics emeritus at the University of Colorado. He began his career as an experimental high-energy physicist and later changed his research area to history and philosophy of science, particularly on the roles of experiment. He has twice been chair of the Forum on the History of Physics of the American Physical Society and served two terms on the Executive Council of the Philosophy of Science Association. In 2016, Franklin received the Abraham Pais Prize for History of Physics from the American Physical Society. He is the author of eleven books including most recently</i> Shifting Standards: Experiments in Particle Physics in the Twentieth Century, What Makes a Good Experiment? Reasons and Roles in Science, and Is It the Same Result? Replication in Physics.

One of the interesting issues in the philosophy of experiment is that of the replicability of experimental results.¹ The scientific community enthusiastically endorses the idea that “Replication – the confirmation of results and conclusions from one study obtained independently in another is considered the scientific gold standard (Jasny, Chin et al. 2011).” The underlying argument for this is that if an experiment has succeeded in revealing a real phenomenon or accurately measuring a quantity then that success should reappear when the experiment is repeated under the same circumstances or when it is reproduced in a different experiment. There are, however, questions about whether this standard is universally, or even typically, applied.²

Ian Hacking has noted that “… no one ever repeats an experiment. Typically serious repetitions of an experiment are attempts to do the same thing better—to produce a more stable, less noisy version of the phenomenon (Hacking 1983, p. 231).) Jack Steinberger, one of the leaders of a group that performed one of the second set of measurements of η_+-, discussed below [in the book], concurs. “When we first proposed this experiment, we took it for granted that a more precise measurement of φ_+- [the phase of the CP-violating amplitude] might have given a clue on the origin of CP violation, still one of the outstanding problems. This was the physics motivation for constructing the detector. There was another purely experimental: we saw a way of doing a much better measurement than had been done (private communication to the author, emphasis added).” At the time, φ_+- had not been well measured. In order to measure φ_+-, however, one must use an interference technique, which involves both the magnitude and phase of the amplitude, so that, in a sense, the measurement of η_+- is free. It is not, however, a requirement that a replication be better. It must simply be good enough.

In this discussion I will adopt a broad definition of replication. It will not be solely performing the experiment again with either the same or a very similar experimental apparatus. I will also include experiments performed with different apparatuses and experiments that examine different phenomena that bear on the same theory or hypothesis.³

In all of this work there is an implicit assumption that the experimental results are credible. Franklin has suggested that this credibility is provided by the use of an epistemology of experiment, a set of strategies used to argue for the correctness of an experimental result (Franklin 2002, pp. 2-6, chapter 6).⁴ He has also argued for the rationality of these strategies by embedding them within a Bayesian approach (Franklin and Howson 1988). The use of these strategies will be discussed below and is often an important part of determining whether a replication is successful or not.

There are several interesting questions involving the replication of experimental results. The first is how similar or different the results must be to count as successful or unsuccessful replications. Scientists have offered different answers to this question.⁵ Recently, questions have been raised concerning experimental results in high-energy physics. How statistically significant must a result be to be considered a discovery?⁶ Should the same standards be applied to a discovery and to a confirmation of that discovery? These questions do not have simple answers. The issue is more complex than merely stating that the results must agree within a certain number of standard deviations of the previous result or of a theoretical prediction to count as successful replication or as evidence for the hypothesis. A similar caveat applies to differences.

In the episodes discussed below I will illustrate some of the different answers that have been given to those questions. Perhaps, the simplest, and most appropriate, answer is, “It depends.” How one answers the question will depend on the experimental question being asked and the nature and quality of the initial experimental result and of the attempted replications. I suggest that there is, in fact, a spectrum that goes from clearly successful to clearly failed replications. In between are results which are regarded as being in agreement or disagreement depending on the circumstances. How the issue of agreement or disagreement was resolved will also be discussed.

One further possible problem concerning the replication of experimental results is what one might call the “bandwagon effect.” This is the possibility of experimenter bias to get results that agree with previous results or with accepted theory.⁷ As stated by the Particle Data Group, which assembles the “Review of Particle Physics,” the standard reference on particle properties, “The old joke about the experimenter who fights the systematics until he or she gets the ‘right’ answer (read ‘agrees with previous experiments’) and then publishes contains a germ of truth (Kelly, Horne et al. 1980, p. S286).” Possible examples of this are discussed below. One technique that experimenters have devised to overcome this possible experimenter bias is “blind analysis.” This is the practice of setting the selection criteria on the data before the final result is calculated and known.⁸

In this essay I will discuss the question of what one might mean by a successful replication, or of getting the same result. I will demonstrate, by examining cases from the history and practice of physics, that saying that two results are the same or that they agree is not always easy or obvious. There seems to be no universally agreed-upon standard for successful or failed replication or for the agreement or disagreement of experimental results. I will present cases in which the agreement between experimental results was accepted by a segment, but certainly not all, of the physics community. In some episodes later experimental results persuasively argued that the initial results were wrong, even though the initial evidence appeared to be quite strong. These examples will illustrate problems and difficulties of replication.

Notes

Various terms have been used to describe the replication of experiments. These include “repetition” or “repeatability” to describe the performance of an experiment under the same or almost the same conditions. “Reproduction” or “reproducibility” has meant that the experimental conditions have changed. I believe that “repetition” shows the consistent operation of the experimental apparatus although it can, on occasion, argue for the correctness of a result. “Reproducibility” argues for the validity, or correctness, of an experimental result. In the narrower context of standardized measurement, the International Organization for Standardization has decreed (ISO 21748:2010(E), p. 3): “Repeatability conditions include: the same measurement procedure or test procedure; the same operator; the same measuring or test equipment used under the same condition; the same location; repetition over a short period of time. Reproducibility requires only that the measurement must reappear under changed conditions. That is, (ISO 21748:2010(E), p. 3): “reproducibility conditions and observation conditions where independent test/measurement results are obtained with the same method on identical test/measurement items in different test or measurement facilities with different operators using different equipment[.]” Source: “Guidance for the use of repeatability, reproducibility and trueness estimates in measurement uncertainty estimates,” Publication ISO 21748: 2010(E). I will use “replication” to describe both.
For a discussion of the replication problem in psychology see (Aarts, Anderson et al. 2015), (Gilbert, King et al. 2016), and (Anderson, Bahnik et al. 2016, p. 1037-c).
Franklin and Howson (1984) have argued that “different” experiments provide more support for a hypothesis or an experimental result than replications of the “same” experiment. Here “different” experiments are those which have different theories of the experimental apparatus. Those theories can be compared by examining their consequence classes.
These strategies include: 1) Experimental checks and calibration, in which the experimental apparatus reproduces known phenomena; 2) Reproducing artifacts that are known in advance to be present; 3) Elimination of plausible sources of error and alternative explanations of the result; 4) Using the results themselves to argue for their validity. In this case one argues that there is no plausible malfunction of the apparatus, or background effect, that would explain the observations; 5) Using an independently well-corroborated theory of the phenomena to explain the results; 6) Using an apparatus based on a well-corroborated theory; 7) Using statistical arguments; 8) Manipulation, in which the experimenter manipulates the object under observation and predicts what they would observe if the apparatus was working properly. Observing the predicted effect strengthens belief in both the proper operation of the experimental apparatus and in the correctness of the observation; 9) The strengthening of one’s belief in an observation by independent confirmation; 10) Using “blind” analysis, a strategy for avoiding possible experimenter bias, by setting the selection criteria for “good” data independent of the final result.
This has been the subject of recent discussions of experiments in psychology. See (Simons 2013, Srivastava 2015). It is interesting to note that the statistical criterion for a significant effect used in psychology is two standard deviations, whereas particle physics demands a five-sigma effect for a discovery claim.
In high energy physics and in gravity wave physics the statistical criterion for a discovery is that the observed effect be five standard deviations above background. For a discussion and history of the criterion see (Franklin 2013a, Prologue).
For a discussion of this issue see (Franklin 1986, Chapter 8). See also the discussion of the history of η_+-, the magnitude of the CP violating amplitude in K meson decay, discussed below.
See (Franklin 2002, Chapter 6).

Bibliography

Aarts, A.A., J.E. Anderson, et al. (2015). “Estimating the Reproducibility of Psychological Science.” Science 349(6251): aac4716-4711 - aac4716-4718.

Anderson, C.J., S. Bahnik, et al. (2016). “Response to Comment on “Estimating the reproducibilty of psychological science”.” Science 351: 1037-c.

Franklin, A. (1986). The Neglect of Experiment. Cambridge, Cambridge University Press.

Franklin, A. (2002). Selectivity and Discord. Pittsburgh, University of Pittsburgh Press.

Franklin, A. (2013a). Shifting Standards: Experiments in Particle Physics in the Twentieth Century. Pittsburgh, University of Pittsburgh Press.

Franklin, A. and C. Howson (1984). “Why Do Scientists Prefer to Vary Their Experiments?” Studies in History and Philosophy of Science 15: 51-62.

Franklin, A. and C. Howson (1988). “It Probably is a Valid Experimental Result: A Bayesian Approach to the Epistemology of Experiment.” Studies in History and Philosophy of Science 19: 419-427.

Gilbert, D. T., King, G., et al. (2016). “Comment on “Estimating the Reproducibility of Psychological Science”.” Science 351: 1037-b.

Hacking, I. (1983). Representing and Intervening. Cambridge, Cambridge University Press.

Jasny, B.R., G. Chin, et al. (2011). “Again, and Again, and Again…” Science 334: 1225.

Kelly, R.L., C.P. Horne, et al. (1980). “Review of Particle Properties.” Reviews of Modern Physics 52: S1-S286.

Simons, D. (2013). What Counts as a Successful Replication? Blog post.

	To see earlier posts, select the Archives at the top of this page Recent Comments To see the comment in context of the discussion click on the text that indicates how long ago the comment was posted, such as "2 hours ago". Then wait for the post and then the comments to load (may take many seconds). The comment should have a gray vertical bar to the left of the commenter's avatar.
	Copyright © The Panda’s Thumb and original authors — Content provided under Creative Commons BY-NC-ND License 4.0.