Like words, numbers comprise a representational language. As such, it is fair to expect that, in practical applications, it can be shown that numbers are anchored to real objects and real processes in the real world. In our present context, statistical findings reported out of a controlled, or other so-called “systematic” trial, must be expected to accurately reflect the essential qualities of the medical (or other) practices under their review. But such a happy coincidence cannot always be confirmed.
Thus, in a trial of homeopathy, if verum is found to perform no better than placebo, it is usually concluded that homeopathy “failed” the test. But the corresponding question must also be addressed – both as a qualitative test of the specific trial and as a principled challenge to the capabilities of the controlled trial format: is such a format – the idealized or abstract representation of the controlled trial – capable in the first place of measuring the relatively evanescent properties of homeopathic medicine?
At the outset, let me be clear that I do, in fact, consider the controlled trial capable of measuring homeopathic action; I have not always been convinced this was true, but at this point I absolutely endorse such a belief. Nevertheless, we encounter at least three matters of concern:
first, in adapting to homeopathic practices, the controlled trial is an experimental method that tests “real objects and real processes” indirectly, that is, it measures perception of those processes, rather than the processes themselves, as for example might be accomplished through laboratory testing;
second, we are not able to confirm that the controlled trial can adequately account for the evanescent quality of those processes in homeopathy, such as the generally subtle symptomatic display generated by homeopathic remedies, or the gradual appearance of symptomatic responses over a period of days or weeks; and,
third, in designing a trial environment – a protocol – that seeks to measure homeopathy, are researchers able to consistently control for confounding factors that devolve from the complications inherent in the processes and products being examined, for example, are they able to assure us that they have harvested all symptomatic responses in the verum group, even when the remedy being tested has a proving record of hundreds or thousands of symptoms?
In the present installment of this two-part series, we will explore the implications of the second point, above, in establishing limits to the ability of the controlled trial to record subtle differences between “real” objects and processes, when we attempt to measure those processes indirectly. In discussing this question, I will not restrict myself to a review only of issues directly related to homeopathy, but will draw on a fairly broad range of examples, to clarify the types of concerns that must be addressed, as the controlled trial continues to develop into a more mature research method.
The limitations encountered in applying the controlled trial to various practices are obvious, palpable, and well recognized, even in the skeptical camp, though no one seems until now to have made a particular point of it. In the case of the general academic and medical communities, this failure can be attributed to a lack of expertise in so-called “systematic” research, and even to a lack of interest in it. In the case of the skeptical community, the failure to recognize obvious limitations of their practice can only be attributed – and must be attributed – to bias …the significance of which is clear, for our judgment regarding the ultimate merits of this art form – no sarcasm intended: after all, we are all in our own domains, quite accustomed to acknowledging that, even as scientists or professionals, the quality of our work depends on concrete knowledge, but also on our individual abilities to apply our knowledge and skills “artfully.”
In the second part of this two-part series, we will explore the more practical end of this question, that is, the problem of evaluating the quality, reliability, and credibility of the “evidence” produced by specialists in one or the other of research methodologies. Specifically, we will focus on the evidentiary value of clinical observations, for example, as reported in the traditional clinical case study, and on the quality of evidence produced from so-called “systematic” research.
That human perception is, or is made out to be, notoriously unreliable in recording objective reality, is the cornerstone of the modern faith in numbers, in the form of the controlled trial, as being capable of providing relief from distortions introduced by observer bias. It is ironic, then, that in the so-called “systematic” trial, we invariably are measuring human perception, and only human perception, which, to underscore the obvious, we already know to be inaccurate. Therefore, the conclusion suggests itself, if homeopathy fails in a controlled trial, that all we have demonstrated, with certainty, is that fallible human perception failed to distinguish real effects from imaginary ones. To the skeptic who is hostile to the bone to homeopathic practice, such a circular, self-reinforcing scenario must be very reassuring.
But in the face of 200 years of unrelenting controversy, and the competition of evidence produced empirically versus evidence produced by quantitative measurements, we need something to break the deadlock, beyond a self-satisfied presumption that numbers are more “objective” than perceptions. Reinforcing this need is the previously remarked fact, that the controlled trial is doing little more than “observing the observer,” in other words, it is simply applying the human perceptual apparatus through a different filter, than is usual. In short, we must find a way to confirm, what the statistician asserts as a matter of faith, that numbers really are more objective and more reliable than observation. And to do that, we must measure the capability of the statistician’s methods against real differences between real objects and real processes in the real world, differences that are already known and readily demonstrated.
In other words, if we take two objects, which are differently constituted and readily distinguished, and if in the context of a “scientific” experiment the controlled trial fails to distinguish between them, then we have established a limitation to the methodology of controlled research. Further, if such a limitation exists in one domain, it must be assumed that it exists in other domains as well – as we assume as a matter of principle, in any case, since even a skeptic would not insist that the controlled trial is “good to go” for any scientific application one cared to imagine.
Once this essential point is finally grasped, it becomes apparent that the practitioner of the art of controlled research must henceforth be expected to present not only a protocol for a proposed trial, but also a Statement of Efficacy, placing the trial within the context of the characteristics of the object under his gaze. The statistician must be expected to define precisely – in fact, statistically – how sensitive an instrument is needed, to accurately measure the object, or process, he seeks to measure. By comparison, the naked eye possesses sufficient power to observe Betelgeuse unaided. On the other hand, a sixteen-inch catadioptric telescope, a clear night sky, and a time exposure might be needed before one can record the presence of a 15th-magnitude dwarf.
It is beyond the scope of this paper to suggest specific formulae to accomplish this task, in specifying the limits or tolerances of the controlled trial in measuring medical process. But the forthcoming discussion does intend to establish, at least in a rudimentary way, what variety of process it is, in a trial of homeopathy, that demands more precise (statistical) characterization, in the service of improving precision and accuracy of research.
In our more earthly domain – as compared to the stargazer – a placebo control may be all that is needed to establish the influence of bias on observation. But, in some circumstances, randomization between the arms of a trial may be essential to ensure accurate results. In other circumstances, a crossover design may be needed. These are by now standard modifications to the format of the controlled trial, long since introduced to adapt this important methodology to varying circumstances in trialing conventional medications. It is unfortunate that no one has thought it advisable to introduce similar, standardized, demonstrably effective safeguards into trials of practices, medical or otherwise, that do not conform in structure or process to the characteristic profile of the allopathic method. One can only believe this failure reflects the roots of controlled research in the allopathic tradition, and the reticence, of those who are possessed of a violent disdain for alternative practices, to explore technical innovations that might cast more light on those practices, than shadow.*
The Pepsi Challenge
Blinded, randomized, and replicated many times over, The Pepsi Challenge was one of the most successful marketing campaigns in recent memory, demonstrating to an affable public the unquestionable superiority of taste that Pepsi enjoyed over Coke, especially considering that devotees of Coke, participating in these trials, often revealed through the shocked expression on their faces, that they preferred the “wrong” product. Of course, we must note that the reported outcomes were not nitpicked by a panel of scientists, and the advertising spots were paid for, rather than being earned through a blinded peer review screening process…
…so, in other words, these methodological inadequacies allowed numerous confounders to create suspicion, in the minds of some, concerning the validity of the findings. For example, in one iteration of the “trial,” the bottle of Pepsi was covered with a wrapper marked with an “M,” while the bottle of Coke was covered with a wrapper marked with a “Q.” Of course, the management of the Coca Cola bottling company objected that, because of this, the only fair conclusion regarding this trial was that people liked the letter “M” better than they liked the letter “Q.”
Ahh, peer review, indeed!
In any case, what these “trials” really do demonstrate is what the skeptical community has told us all along: human perception can’t be trusted. People don’t even know what they like, well enough, to distinguish it from a competitor.
However, these trials also tell us that experimental measurements of human perception are unable to distinguish one product from another: Coke and Pepsi have similar, but nevertheless differing recipes. And they do not taste the same. Many people have no preference between them, but many others are devoted to one or the other. And yet, under the watchful eye of the experimentalist, they cannot tell the difference between them. And, since the trial, replicated by the manufacturers of Coke, in that instance showed that people preferred Coke over Pepsi, we are forced to conclude that the randomized, blinded trial showed that there was no consistent evidence for a difference in taste between the two products, although of course the real differences between the products are clearly documented in the patented recipes of each.
In short, the differences between Coke and Pepsi escaped notice of the controlled trial. Do we on that account conclude that the two products are indistinguishable? I think instead we are well advised to seek an answer to the questions: why did human perception fail, and why was the controlled trial unable to confirm real differences between real products in the real world?
Clearly, if the skeptical community wishes to establish, scientifically, the superiority of one product over the other, it will have to improve the methodology embodied in its trials, to better account for whatever confounders they discover that gummed up the works.
Facetiousness aside, however, I would recommend that the same challenge awaits them in their efforts to trial homeopathy, high end audio, and other non-allopathic processes. A few suggestions may suffice to illustrate the types of experiments that might be constructed, to confirm whether the controlled trial is capable of measuring real objects and real processes in the real world:
a) In a blinded trial, auditors will be asked to identify which performance of a piano sonata or a song is being played on a very expensive audio system, and which is performed by a live artist, sitting at a real piano positioned between the loudspeakers before them.
b) Blinded auditors identify which selection is performed on a Stradivarius, and which on an instrument by Joseph Guarneri del Gesu.
If the auditors cannot distinguish live from recorded performance, would the experimentalist therefore assure us there was no difference between the two performances? If the auditors cannot distinguish Stradivarius from Guarneri, would the experimentalist therefore assure us there was no difference between the instruments?
Being charitable, I assume the experimentalist – the statistician – would concede the trial outcome was invalid. Which is exactly the point: we have run into a limit on the capability of the controlled trial to measure reality.
Taking this a step further, I would suggest a series of trials of high-end audio gear, to begin to establish the gradations of qualitative steps, the continuum, that characterizes the range of quality found in increasingly fine audio components.
a) Blinded comparison of a boom box (System 1: S1) with a $300,000 high-end stereo system (S2).
b) …replace the boom box with a Bose (S3)
c) …replace the Bose with a $10,000 component system. (S4)
d) …replace the amplifier in S4 with a $10,000 pre-amp and a $10,000 power amp. (S5)
e) …replace the power amp in S5 with an $80,000 pair of Class A mono-blocks. (S6)
f) …replace the $100 interconnects with $1,000 interconnects. (S7)
g) … (S8 , S9….)
The hypothesis: as the component systems become increasingly expensive, the auditor will be less successful distinguishing which “arm” utilized, for example, the expensive cable and which utilized the inexpensive cable: in other words, in a $300,000 stereo system, the sound quality might be so good that the improvement achieved by incorporating more expensive cable would be relatively less noticeable, against the “background” of higher quality sound reproduction.
But experimentation on intermediate system set-ups could help isolate measurable effects of specific components, for example, by comparing S9 with itself, alternately played through the expensive and the inexpensive interconnects (the usual procedure), but also testing it against a less expensive system altogether: does changing the interconnect affect the ability of the auditor to distinguish S9 from S3?
Not to be cynical, but I can already hear the excuses emanating from the skeptical community, about how impractical and expensive it would be to conduct such a series of trials.
But our eminently practical friends ignore, as usual, the fact that science is not the handmaiden of practicality. Knowledge comes at a price, it takes effort, and it takes a determination to be exhaustive in its analyses; it requires that we resist assuming that our methodology works – especially in the absence of independent corroboration of its findings – and that, instead, we seek confirmation of our results.
Science takes work: work such as the indefatigable efforts of Darwin to record and organize the full range of empirical facts to be discovered in the field by the persistent observer; or the efforts of Freud to document a nearly endless variety of mental “products” and to analyze their inter-relationships exhaustively, and publish his findings in the 24-volumes that make up his lifetime’s labors; or the life’s work of Hahnemann, his tireless effort to record each and every symptomatic response to remedies, in order to provide a comprehensive data base on which to build a reliable medical practice.
Against such untiring labor, such monumental achievements, frankly, it is hard to credit very much the efforts of a relative handful of researchers, in producing fewer than 200 hastily contrived controlled trials over a period of several decades, that passed minimal scrutiny for methodological adequacy, and that present negative findings against homeopathy. Have any of these gentlemen, or all of them together, committed more than a few months, or even cumulatively more than a few years, to experimentation into homeopathic medicine? How do we credit the allopathic physician, who manages to fit a few trials of homeopathic remedies and herbal teas, into the time that is left to him after he completes a busy day practicing in the ER, or scrubbing up for surgery? The more so, as these trials are easy to nitpick analytically – they simply fail to measure accurately, that which they claim to measure.
Placebo, or Verum?
The Goldman Visual Field Test measures peripheral vision. The patient looks into a kaleidoscope, in which points of light flash on and off at different places in the field. The patient presses a button to register when he sees a point of light, and the points he misses define the impairment he may have in his visual field. But it is common in these tests for subjects to “see” a point of light when there isn’t one there. In this situation, the examiner knows when there is a point of light, and when there isn’t, so there isn’t any question when the subject gets it “wrong.” And yet, in this test it doesn’t matter if the patient is wrong: all that counts are the right answers. The wrong answers may be chalked up to placebo effect, but they have no bearing on whether other perceptions are real, or not, and do not affect the patient’s score.
This is relevant to the questions with which we are faced, since everyone can produce a placebo response, even those who are receiving verum. Because of this, some percentage of verum subjects will produce both verum and placebo responses, and this little twist must be accounted for in calculating outcomes in a controlled trial: the real rate of verum response will be greater than what is reflected in the statistics, because a greater or lesser percentage of those responses will be concealed, statistically, within the response rate returned by the control. True, in most situations the impact of this fact on our confidence in findings will not be great, but we cannot assume that this is always true, unless we are satisfied with being completely irresponsible in our approach to scientific investigation.
In the case of trials of homeopathy, this dynamic takes on an added twist, since every trial participant is capable of producing proving symptoms. Because of this, in principle, every placebo response of a verum subject must be considered, in a homeopathic proving trial, to be, potentially, superimposed over a verum response, whether the verum response is observed, or observable, or not.
Such a dynamic is at work in all clinical trials, of allopathic as well as homeopathic medicines. Yet it is characteristic of homeopathy, and of homeopathic trials that most symptomatic responses are mild, and that the small doses applied mean that only those trial participants, who are most sensitive to the remedies, will respond to them in the first place.
These facts have the unexpected consequence that those participants who populate the group that responds to verum, will, because of their sensitivities, also populate the group that is most subject to the power of suggestion. In short, in a homeopathic proving trial, the same group of participants – the most sensitive participants in the trial – will tend to respond to both verum and placebo: statistically, this will mean that the experimental group, to take the most extreme case, may in reality produce twice the number of symptom responses as the control, yet “perform” no better than placebo, statistically.
In practice, no effort is made in the controlled trial to distinguish placebo from verum responses. In a formula, both (placebo and verum) may be represented by “s” (symptom), and the formula looks like this:
s : s
If there were 20 responders in both the experimental arm and the control arm, then this outcome shows that verum performed no better than placebo:
20 : 20
Of course, in this scenario – the usual practice in the controlled trial – “s” refers not to each symptom that is produced, but to the fact that individual participants in the trial produced one or more symptoms. In the control arm, only placebo responses occur, for the simple reason, of course, that verum has not been administered.
But in the experimental arm, verum responses may be masked – statistically – because the single group of 20 participants can account for both numbers: the 20 individuals who produced the placebo symptoms, may also have produced the verum symptoms, yet they get counted only once.
Thus, to accurately record outcomes in such a trial, the type of symptomatic response must be differentiated as either a verum symptom (vs) or a placebo symptom (ps):
vs + ps : ps
In the experimental arm, therefore, if “John” produced a verum symptom and also a placebo symptom, in the first formula (fig. 1), he would be counted only once, as a “responder” in the experimental group, even though he produced two symptoms. In short, the verum response is lost to the final count (fig. 2).
Given these considerations, it is imperative that the symptoms produced by trial participants be analyzed clinically, not just statistically, since “real” symptomatic responses will typically present a different profile than symptomatic responses produced by placebo. Thus, a clinically based analysis of responses may be able to differentiate, in many cases, those responses that were truly “placebo responses,” from those that represented proving symptoms.**
In this case, then, “John” – a “complex responder” as we might call him – would be counted twice, his placebo response chalked up as “ps” and his verum response chalked up as “vs.” Then, when the final tally was made, we would have, potentially, a record (just as an example, for 20 responders in the experimental group) of 20 placebo responses and as many as another 20 verum responses. The formula then shows…
vs + ps : ps
20 + 20 : 20 =
40 : 20
…et voila! Verum has outperformed placebo, after all!
As a precautionary note to researchers who may choose to implement this measure, it should be emphasized that a trial participant should only be counted once as a placebo responder, and once as a verum responder. After all, the purpose of this “double count” is to assure that verum symptoms are not neutralized by placebo symptoms, in the summary results of the statistical survey. But we do not want to count all symptoms, just whether the individual participant should be counted as responding to placebo and/or verum: after all, if “John” produced 38 placebo responses, for example, he could single-handedly bury the verum group by his own extreme susceptibility!
The failure of statisticians, over the past 200 years, to notice this situation is simply one more example of bias. It represents the universal human tendency to sustain faith in the value of our own beliefs, in the face of a challenge from unfamiliar sources. In the present context, the statistician – conveniently – sees no reason to question the methods he has used to test his preferred (allopathic) medical practices, and assumes they are, prima facie, adequate for testing everthing else as well.
In short, remembering that homeopathic remedies typically produce mild symptoms, and that the small homeopathic dose is specifically designed to produce mild, subtle effects, it is not surprising that such effects are often missed in “objective” trials. As we now see, this problem may be exacerbated, since many of the symptoms may be masked by a cloud of undifferentiated statistics, counting one thing but claiming to have counted everything else in the world, too.
So long as no one is challenging them, the statisticians amongst us appear quite content insisting the marbles are either black or white … never mind that one over there, with the stripes.
Finally, in this context, it is not surprising to recall that the statistician has no working definition of placebo, no laboratory-based, no clinical, and no descriptive standards by which to measure the specific symptomatic product. Instead, he relies exclusively on the statistical behavior of the control group, to set the standard for the experimental group. But such a procedure guarantees that the “complex responder,” as we have dubbed him, is counted only once, eliminating the (statistical) testimony of his second symptom, the one that, arguably, is a response to the real medicine.
Such, at least, is the attitude that must be adopted in designing a trial intended to measure homeopathic action: in other words, the statistician must be able to prove whether such a process is or is not affecting the observed results of the trial … assuming it is not asking too much, to expect accuracy to characterize the products of scientific inquiry.
In the present paper I have tried to show, what should in any case be obvious, that reality is sometimes difficult to measure. It may be costly and inconvenient, but it remains true that scientific research, that deserves the name, demands we demonstrate that we have in fact measured our subject accurately and exhaustively.
To that end, the scientific community, the general public, and our government leaders – who often look to research scientists for guidance – should insist, at long last, that standards be established, based on proofs, that our research methods return honest results. After all, making one’s way, blindfolded, through a maze of self-referencing calculations, is not a substitute for good observation or accurate measurement.
* In the editorial at the head of the present issue, I have discussed this point in some detail, in respect to the example of Belladonna, a homeopathic remedy with 1040 associated symptoms, as compared to the one symptom that, archetypically, characterizes the field of action of the allopathic medicament.
** See my book review in the March 2006 issue of this journal for a discussion of this process.