CHAPTER III-2
BACKGROUND AND ANALYSIS OF THE RESAMPLING METHOD
INTRODUCTION
The term "resampling" has been applied to a variety of
techniques for statistical inference, among which stochastic
permutation and the bootstrap are the most characteristic.
Resampling methods are evolving rapidly, and their scopes and
interrelationships are not always clear. Therefore, the aim of
this chapter is to distinguish the various techniques falling
under the resampling rubric from other related techniques, in
order to aid discussion of the set of methods.
There are two domains corresponding to the term
"resampling". The wider domain includes all uses of simulation
techniques for statistical inference (though not uses of
simulation for the development of other techniques). The
narrower sub-domain includes only simulation techniques that
reuse the observed data to constitute a universe from which to
draw repeated samples (without replacement = permutation
techniques, with replacement = bootstrap techniques).
Another statement of definition of the narrower
quintessential domain: Those techniques that a) use in their
entirety (though not necessarily with replacement) the sample
data to repeatedly produce hypothetical samples, either by
drawing subsamples stochastically or by rearranging the original
observations stochastically, and then b) compare the results of
those simulation samples to the observed sample.
It will not always be easy to keep these two domains - or
domain and sub-domain - clearly distinguished. Additionally, I
argue that it is often useful to extend the term "resampling"
beyond statistical inference and into probability, where one
generates the simulation samples from a known device rather than
from an unknown universe estimated by the observed data, because
the mathematical simulation processes are identical in
probability and statistics.
Because resampling is still in its early stages, there is
little consensus about its definitions as well as its practices,
which means that the discussion will inevitably have many loose
ends and be open to many rebuttals. But I hope that the reader
will view the vigor and yeastiness of the controversy as
indicating that this is the beginning of a discussion where there
are important issues to be discussed, rather than concluding that
the discussion that follows is unsatisfactory because it is
subject to so many criticisms. (Indeed, absence of loose ends
typically indicates that a topic is so settled that no further
discussion is needed.) The appropriate question, as I see it, is
not whether there are flaws in the discussion to follow, but
rather whether the issues deserve to be aired in public where
they can be thrashed out.
The next section discusses the intellectual paths that have
led to the general resampling method. The following section
provides a classification of resampling methods and discusses
their characteristics. After that comes a section of comment.
ROADS TO RESAMPLING
Several quite different intellectual roads have led to the
body of methods called "resampling" as of 1996. The fact that
such different roads lead to the same place may be considered
empirical evidence for the inevitability of the general approach
to inferential statistics, even before high-speed computers
became commonplace and cheap.
Dwass, and Chung-Fraser: Approximation of Classical Method
The first publication of any of the techniques that now make
up the resampling kit bag was by Meyer Dwass in 1957, and by J.
H. Chung and D. A. S. Fraser in 1958. Both papers pointed to the
value of Fisher's permutation test (1935; see also Pitman's
advances in the direction begun by Fisher; 1937, pp. 322-335)
misleadingly called the "randomization" technique. Both noted
that with a large sample the "exact" Fisher test is not feasible
because of the computational difficulty (before the age of
powerful computers). They then suggested that a randomly-
generated selected sub-set of the possible permutations could
provide the benefits of the permutation test without excessive
computational cost.
The underlying idea was to use the power of sampling, in a
fashion similar to the way it is used in empirical samples from
large universes of data, in order to approximate the ideal test
based on the complete set of permutations. And they showed that
the approximation would be quite satisfactory. So their vision
was to gain the benefits of the classical array of methods -
though not a parametric test in this case - by the technical
device of simulation sampling.
Unfortunately, this technical trick does not work in the
case of parametric tests themselves, because there is no way to
use the device of sampling to replace the formulaic basis and the
tabular superstructure of such methods as the Normal-based z test
or the t-test (though the stochastic permutation test may be seen
as a substitute for the t-test and hence a way of evading its
use). Therefore, the path opened by Dwass and Chung-Fraser does
not immediately broaden out intellectually into the resampling
highway, though Draper and Stoneman built on the earlier work
with an application of the permutation test to regression (1966).
The first method used for comparison of liquor prices in
Chapter III-1 is a permutation test.
It should be noted that the stochastic permutation test is
one of the two central resampling techniques; it is true
"resampling" in the sense of treated the observed data as the
best guess about the nature of the universe of interest, and then
re-using those data as the basis for experimental samples. It
should also be noted that stochastic permutation and the
bootstrap are identical except for whether or not the samples are
taken with replacement, and the two methods converge toward the
same result as observed sample size increases; in many
applications it is difficult or impossible to establish a clear
philosophic justification for the use of one or the other
technique. There is one conceptual difference, however; the
stochastic permutation may be plausibly be seen as a sampling
"approximation" to an "exact" technique; no such notion is
possible for the bootstrap, so the latter is even more distant
from the conventional approach than is the former.
The inherent sensibleness of a stochastic permutation test
is evidenced by its independent simultaneous discovery and also
by its later re-discovery by Feinstein (1973), and by me (1969).
Additional evidence of this is the independent re-discovery of
the same stochastic permutation principle in the somewhat
different context of tests of significance with survival data in
1970 by Forsythe and Frey.
The foregoing writers viewed a stochastic simulation as a
less exact approximation of the ideal test. There was no mention
that the essential nature of simulation differs from formulaic
tests in not requiring counting the points in the sample space,
the central element in probability theory.
Barnard's Test for the Fit to a Distribution
In a few brief paragraphs in a comment in 1963 that is
difficult to find even with the citation in hand, Barnard
(1963)suggested a simulation test for how well a given sample
fits a theoretical distribution - specifically, a runs test based
on comparison to the results of drawings from a (horizontal)
distribution - and envisioned doing the work with the aid of a
computer. As with Dwass and Chung-Fraser, Barnard was offering a
simulation technique as a inexact substitute for a formulaic
method when the formulaic method is infeasible. This was stated
clearly in a comment by Hope (1968) in the context of further
work on Barnard's test. Hope recommended that Monte Carlo tests
not be the tool of first resort. "It is preferable to use a
known test of good efficiency instead of a Monte Carlo test
procedure ..." (Hope, 1968, p. 582). Hope did go further,
however, and recommended that a resampling test be used when "the
necessary conditions for applying the [conventional] test may not
be satisfied, or the underlying distribution may be unknown or it
may be difficult to decide on an appropriate test criterion.
Also, it is possible that only a physical model can be obtained
which cannot be expressed in mathematical terms" (p. 582), But
resampling still is seen as a second-best method. [1]
Barnard's test is in the penumbra of resampling because it
uses an independent device - a coin, or random numbers - to
generate the trial simulation samples. It may be viewed either
as a third category intermediate between formulaic methods and
core resampling, or as a member of the larger resampling domain
but not of the core sub-domain.
An Overall Approach to Inference, and the Bootstrap
In 1967 (Simon, 1969a, 1969b, chapters 23-25; 3rd edition
with Paul Burstein, 1985)) I developed the resampling method in a
very general context, starting with first principles of
simulation and statistical inference rather than with any
particular formulaic device. I illustrated the general idea and
showed its breadth and power with a variety of methods (including
the bootstrap and the stochastic permutation test and many
others) for a range of problems including hypothesis tests,
confidence intervals, fits to distributions, fixing of sample
size, and other statistical needs. The intellectual basis was
the centuries-old practice of experimentation to learn the odds
in gambling games, together with the idea of Monte Carlo
simulations of complex physical phenomena at Rand during and
after World War II (see Ulam, 1976, pp. 196-199; Metropolis and
Ulam, 1949; for their group, Monte Carlo was "a statistical
approach to the study of differential equations", a device for
dealing with the "completely intractable task...in closed form"
[pp. 335, 337, 338], that is, a taking from statistics, whereas
for me the method was an approach to statistical practice, a
giving to statistics). I referred to the work as "Monte Carlo"
and I wrote that it departed from the earlier work of Dwass and
Chung-Fraser (seen as examples of the practice of resampling at
large) and at Rand in two main ways: 1) I dealt with simple
problems where the value of the technique's use was that persons
could arrive at sound solutions that are perfectly understandable
rather than using mysterious formulas that are often wrongly
chosen; this included problems in probability as well as
statistical inference. 2) Despite illustrating the use of the
same general method on probabilistic problems, I focused on the
problems of statistical inference rather than those considered in
studies of probability, and mapped out the entire range of
applied problems in inferential statistics; this emphasis on a
new general technique to be used across the board, rather than a
single particular device to be used in a particular situation;
was the most radical innovation and the aspect of the work that
evoked (and still evokes) the most resistance because it calls
into question the existing body of formulaic methods.
When first developing this material I was not aware of the
work of Dwass and of Chung and Fraser and hence, like Feinstein
after me (1973), I re-invented their idea2. I subsequently
attributed to them the entire vision of a Monte Carlo approach to
statistical inference, though in retrospect it can be seen that
their view of the matter was much less general.
When in 1976, I (with Atkinson and Shevokas in 1976)
published results of controlled experiments showing that persons
arrive at more correct answers to basic statistical problems when
they are taught and employ resampling methods than conventional
formulaic methods, I wrote: "It must be emphasized that the
Monte Carlo method as described here really is intended as an
alternative to conventional analytic methods in actual problem-
solving practice...the simple Monte Carlo method described here
is complete in itself for handling most - perhaps all - problems
in probability and statistics" (1976, p. 734). I believe that
that statement, with the vision that it expresses, is the radical
departure from previous thought on the practice of statistics.
The development of particular techniques is subsidiary.
It seems to me that this general statement is the most
important element in the resampling approach.
The Bootstrap: Re-sampling with Replacement
The prices of liquor in private-enterprise and state-owned
systems discussed in Chapter III-1 can also be done with a
process of sampling with replacement rather than permuation; such
a test was dubbed the "bootstrap" by Bradley Efron in 1979. It
was first published in three examples in Simon (1969b and in
further discussion in correspondence with Kruskal, 1969, and was
in common use at the University of Illinois in the early 1970s,
being the only method used for hypothesis-testing in the 1976
text by Atkinson, Shevokas, and Travers.
Recently I have concluded that a bootstrap-type test has
better theoretical justification than a permutation test in this
case, though the two reach almost identical results with a sample
this large. The following discussion of which is most appropri-
ate brings out the underlying natures of the two approaches, and
illustrates how resampling raises issues which tend to be buried
amidst the technical complexity of the formulaic methods, and
hence are seldom discussed in print.
Imagine a class of 42 students, 16 men and 26 women who come
into the room and sit in 42 fixed seats. We measure the distance
of each seat to the lecturer, and assign each a rank. The women
sit in ranks 1-5, 7-20, etc., and the men in ranks 6, 22, 25-26,
etc. You ask: Is there a relationship between sex and ranked
distance from the front? Here the permutation procedure that
resamples without replacement - as used above with the state
liquor prices - quite clearly is appropriate.
Now, how about if we work with actual distances from the
front? If there are only 42 seats and they are fixed, the permu-
tation test and sampling without replacement again is appropri-
ate. But how about if seats are movable?
Consider the possible situation in which one student can
choose position without reference to others. That is, if the
seats are movable, it is not only imaginable that A would be
sitting where B now is, with B in A's present seat - as was the
case with the fixed chairs - but A could now change distance from
the lecturer while all the others remain as they are. Sampling
with replacement now is appropriate. (To use a technical term,
the cardinal data provide more actual degrees of freedom - more
information - than do the ranks).
Note that (as with the liquor prices) the seat distances do
not comprise an infinite population. Rather, we are inquiring
whether a) the universe should be considered limited to a given
number of elements, or b) could be considered expandable without
change in the probabilities; the latter is a useful definition of
"sampling with replacement".
As of 1996, the U.S. state liquor systems seem to me to
resemble a non-fixed universe (like non-fixed chairs) even though
the actual number of states is presently fixed. The question the
research asked was whether the liquor system affects the price of
liquor. We can imagine another state being admitted to the
union, or one of the existing states changing its system, and
pondering how the choice of system will affect the price. And
there is no reason to believe that (at least in the short run)
the newly-made choice of system would affect the other states'
pricing; hence it makes sense to sample with replacement (and use
the bootstrap) even though the number of states clearly is not
infinite or greatly expandable.
In short, the presence of interaction - a change in one
entity causing another entity also to change - implies a finite
universe composed of those elements, and use of a permutation
test. Conversely, when one entity can change independently, an
infinite universe and sampling with replacement with a bootstrap
test is indicated.
Efron's Route to the Bootstrap and Development of It
Efron connects his work - at first, his apparent rediscovery of the
bootstrap and later his wider applications of resampling - to the
jackknife. "Historically the subject begins with the Quenouille-Tukey
jackknife, which is where we will begin also" (1979; 1982, p. 1). This
connection to the jackknife is immediately obvious in the title of his
first work on the subject, and in his discussions on the subject since
then.
Diaconis and Efron later wrote: "There are close
theoretical connections among the methods [cross-validation,
jackknife, bootstrap]. One line of thinking develops them all,
as well as several others, from the bootstrap" (1983, p. 130).
But this statement refers to the logical connections, which are
the reverse of Efron's historical process.
However, the jackknife (Quenouille, 1956, and Tukey, 1958)
and cross-validation (or "sample splitting"; see Mosier, 1951,
pp. 5-11), are entirely outside the definitions of resampling,
whether narrow or broad. They are connected with each other and
with the bootstrap by a very different line of thinking than the
concept of resampling; rather, they share the common aim of
inferring reliability. That is, the motivations for the
Quenouille and Tukey in inventing the jackknife, and for Efron in
developing the bootstrap, may have been similar. The natures of
the devices are very different, to wit:
Cross-validation separates the available data into two or
more segments, and then tests the model generated in one segment
against the data in the other segment(s); clearly there is no re-
use of the same data, nor is there any use of repeated simulation
trials.
The jackknife - which, as Low (entry in the Encyclopedia of
Statistical Science, Kotz & Johnson, v. 8, 1983) notes,
"reduce[s] the size of the sample [really, the resampling
universe] in each of the re-computations of the statistic", does
not use the data in their entirety for each trial3, a key
characteristic of resampling. That is, the observations omitted
from the experimental samples are designated systematically,
whereas in resampling (see definition below) observations are
probabilistically omitted from the experimental samples. To put
it differently, every jackknife analysis of a given set of data
produces the same result, unlike resampling processes (assuming
no problem with the seed in the random-number generator).
The jackknife has in common with the resampling techniques
discussed here the partial re-use of data, but it does not a)
resample from it, or b) use all of it for any given sample. The
jackknife has more in common with such scientific practices as
examining the results when leaving off the extreme observations
in a sample -- as is suggested visually by some of Tukey's
graphic techniques -- than it does with other methods included in
the present definition of resampling. (The jackknife also makes
use of the t distribution, which also puts it outside of the
basic definitions of resampling as discussed below.)
Indeed, though Efron (quoted above) came to the bootstrap
and then resampling more generally by way of the jackknife, he
notes: "In fact it would be more logical to begin with the
bootstrap..." (1982, p. 1). (And indeed, discussion of the
jackknife has diminished severely over time in connection with
resampling.) But it is even more logical to begin with the
general vision of resampling as embodied in the definition given
here and in the wide range of techniques shown in my 1969 book.
The literature has been moving more and more toward that vision,
as I read the literature. And there is some movement in
introductory texts to present resampling techniques as tools of
first resort rather than tool to which one should turn only when
stymied in the search for formulaic methods.
There seems to have been no connection between Efron's
development and the concept of Monte Carlo simulation, and
simulation in general; the index of Efron and Tibshirani (1993)
lists them only with reference to specific practices in a few
particular bootstrap applications, with no reference to Stan Ulam
(the putative father or the Monte Carlo method and label at
Rand). Nor is there connection between Efron's work and that of
Dwass and Chung-Fraser; to my knowledge, they are never referred
to in his writings (though I have not examined them all).
At the end of this survey of the origins of resampling, it
is interesting to note that the long tradition of experimental
studies of distributions and properties of estimators in
statistics and econometrics, with Student being an early
distinguished example, did not enter into the thinking of any of
the intellectual streams discussed above. Nor did the use of
simulation for pedagogical illustration such as the sampling
distribution of the sample mean.
THE CHARACTERISTICS AND THE CLASSIFICATION OF METHODS
The previous section briefly described resampling, giving
both a core definition and a wider definition. This section goes
into more detail about the characteristics of resampling
techniques in contrast to techniques that are outside the
resampling domain(s).
Re-use of the Available Data to Generate Repeated Samples
Systematic re-use of the available data is the central
characteristic of resampling, and it is at the core of the
following core definition of resampling: Use in their entirety
(though not necessarily with replacement) the observed data to
repeatedly produce experimental samples, either by drawing
subsamples stochastically or by rearranging the original
observations stochastically, and then compare the results of
those simulation-trial samples to the original sample data.
Consider this Efron-Tibshirani definition of the bootstrap:
"A bootstrap sample x* = (x1* ...) is obtained by randomly
sampling n times, with replacement, from the original data points
x1 ..." Another similar and very clear statement elsewhere:
"Each bootstrap sample has n elements, generated by sampling with
replacement n times from the original data set" (p. 13, below
their Figure 2.1). If we now amend that definition by writing
"with or without replacement", the permutation test is included
and the definition is a formal and precise description of core
resampling methods4.
The wider definition of resampling also includes other uses
of simulation techniques for statistical inference that generate
samples by random drawings from distributions derived other than
from the observed data - for example, Barnard's drawings from a
horizontal distribution against which to compare an observed
sample. This wider definition would seem to encompass devices to
serve all or most of the purposes of statistical inference, and
such a wide definition was the basis of the suggestion made
explicitly in Simon, Atkinson, and Shevokas (1976) that
resampling be thought of as the first option in all situations.
If I were to choose a label for the wider domain, I would
call it the "best-guess-universe" method. This not only has the
virtue of including inferential simulation methods other than
those that re-use the data, but this label also points up that
when one has a better guess about the universe than just the
observed data - when the data are very few, for example, and
other information or assumptions should be used in a Bayesian
spirit - one should then not be limited to the use of the
observed data.
One can also broaden the definition of core resampling to
include not only problems in probabilistic statistics but also
problems in probability, by including the phrase "or the data-
generating mechanism (such as a die)" after "observed data" in
the definition above. Problems in pure probability may at first
seem different in nature than the probabilistic-statistical
(inverse probability) problems, and foreign to the concerns of
statisticians. But the same logic as stated in the definition
above applies to problems in probability as to problems in
inferential statistics. The only difference is that in
probability problems the model is known in advance -- say, the
model implicit in a deck of cards plus a game's rules for dealing
and counting the results -- rather than the model being assumed
to be inferred, and best described by the observed data, as in
resampling statistics.
Efron has given a definition in the same spirit: "You use
the data to estimate probabilities and then you pick yourself up
by your bootstraps and see how variable the data are in that
framework" (Science, 13 July, 1984, p. 157). Though Efron
focuses upon the variability of a sample statistic in this
definition, the centrality of re-use is apparent. (And though he
was referring only to the bootstrap technique, this definition
obviously applies to permutation tests as well.)
It has been noted earlier that the jackknife and cross-
validation do not fit the definition of resampling. Nor do other
standard closed-form methods in inference.
Non-Use of the Gaussian Distribution
The non-use of the Normal distribution is another of the
central characteristics of resampling. The Gaussian
characteristic separates the lines of work included here as
resampling from such methods as cross-validation, which is likely
to use a Gaussian-distribution-based test to determine the
goodness of fit of the model, and in any case does not break with
the older tradition in this respect.
The Normal distribution might enter into resampling work if
the problem is to test whether a given sample fits the Gaussian
shape reasonably closely. And it might be used to broaden the
best-guess universe when there are very few observed data.
This is one of the two characteristics that Diaconis and
Efron also cite as fundamental to the methods under discussion
here. They have written of "freedom from two limiting factors
that have dominated statistical theory since its beginnings: the
assumption that the data conform to a bell-shaped curve and the
need to focus on statistical measures whose theoretical
properties can be analyzed mathematically". And they say that
"Freedom from the reliance on Gaussian assumptions is a signal
development in statistics" (1983, p. 116).
Even more generally, resampling proceeds without the use of
any theoretical distributions, which is another reason not to
consider the jackknife as a resampling method.
The entire body of resampling methods (see e. g. Simon,
1969b, 1993; Noreen, 1986; and Efron and Tibshirani, 1993)
proceeds without the Gaussian distribution. It should be noted,
however, that there several reasons for departing from the
Gaussian distribution. My aim was to avoid the use of any
intellectual device or formula that the typical user does not
understand completely, all the way down to the intuitive roots.
The use of any parametric test founded on the Gaussian
distribution fails on this criterion, if only because of the
intuitive difficulty of the very formula for the Gaussian
distribution, which few know and fewer understand. Technical
advantages such as the increase in efficiency and reduction in
bias that non-parametric (especially simulation) tests often (but
not always) provide is in my view a bonus, rather than the
central motivation that it was for Efron.
Computer (or Computational) Intensivity
In the view of many, involvement with computers is central
to resampling methods under discussion here. Noreen called his
book Computer Intensive Methods for Testing Hypotheses (1989),
and Diaconis and Efron titled their 1983 Scientific American
article "Computer-Intensive Methods in Statistics".
In my view, however, computer intensivity is not a
fundamental demarcation between resampling and conventional
methods. For small data sets, resampling tests can often be done
quite satisfactorily without any calculating machinery, let alone
high-powered machinery. For example, the law-school example that
is the centerpiece of Efron's "Computer-Intensive Methods..."
article can be done with a pack of 15 cards. A hundred samples
of 15 draws (with replacement) provides quite a satisfactory test
for most purposes, and (aside from the computation of the
correlation coefficient for each sample, which is not part of the
bootstrap operation) can be done in an hour or two, less time
than a conventional test might take even if the user did not have
to look up the conventional formula. A computer is more
convenient than shuffling cards, of course. But a thousand
repetitions of that test can be done on the cheapest and most
primitive personal computer in a couple of minutes at the most,
which is not computationally intensive. And doing the test
without the intercession of the computer often helps make the
process intuitively clear to the person who performs the test.
An example of a practical problem in hypothesis-testing,
performed without the computer by a research assistant in an hour
or so (Lyon and Simon, 1968), concerned whether average state
income is related to the price elasticity of demand for
cigarettes. The arc elasticity was estimated for 73 state tax
changes, and then the medians were calculated for the 36 tax
changes among the high-income states and the 36 tax changes among
the low-income states. A Monte Carlo randomization test was then
conducted by shuffling cards, and twenty trials were sufficient
to show that the difference in observed medians was not
infrequent on the null hypothesis of no difference due to income.
Almost any regression analysis is at least as computer-
intensive as most resampling methods.
A difference between resampling methods (as defined here)
and the jackknife and cross-validation is that though heavy use
of the computer may not be necessary in many problems to arrive
at acceptable resampling estimates, more intensive computing will
produce more precise estimates. This is not true of the
jackknife or cross-validation, which further distinguishes them
from the methods referred to here as resampling. If statistical
significance had been "in the cards" in the cigarette-tax case
above, a much larger number of trials could have been drawn in a
couple of hours.
Because flipping coins and taking samples of random numbers
with paper and pencil is cumbersome, and a nuisance after a
while, I developed the Resampling Stats language in 1973. It was
programmed in batch mode by Dan Weidenfeld for a mainframe (Simon
and Weidenfeld, 1974), then in interactive mode for the Apple
about 1980 by Derek Kumar, then for the IBM-PC starting in 1983,
and in 1991 for the Macintosh. Standard languages such as Basic,
or even languages written for the specific purpose of simulation
(except APL), do not allow the user to write a program which
closely resembles the operations one does by hand in resampling
simulations, as does Resampling Stats. Nor do conventional
statistical packages that provide a bootstrap option, such as
Minitab or RATS. The language and program are illustrated below.
Though the use of computers may not be crucial, there is no
doubt that the easy and cheap access to personal computers has
greatly advanced the use of resampling methods.
Some Non-Issues
Because of the identification of the bootstrap with the
whole of resampling on the part of many persons, it is worth
noting characteristics of the bootstrap that are not necessary
characteristics of resampling tests generally.
The bootstrap samples with replacement. But permutation
tests and other resampling tests such as some correlation and
matching tests sample without replacement (though correlation
tests may also be done with replacement if it is judged
appropriate). So the issue of replacement is not a defining
characteristic of resampling.
Efron wrote that "Originally I called the bootstrap
distribution the `combination distribution'. That is because it
takes combinations of the original data rather than permutations.
There are no permutations to take in a one-sample problem."
(letter of April 26, 1984). This characteristic distinguishes
the bootstrap from permutation tests resampling in the line of
Dwass, and Chung and Fraser, and also from the one-sample
correlation problem for which I propose a measure of association
(different from the correlation coefficient) which gets the job
done, yet is intuitive and requires no formalism to explain
(1969, examples 16-19, pp. 399-409), and is amenable to a
resampling test of significance. But this characteristic is
specific to the bootstrap and not to resampling at large.
Intended for Complex and Difficult Problems, versus For All
Problems
Here we come to the crucial distinction between the point of
view urged here and many other writers on resampling, including
Hope (as quoted earlier), Westfall and Young (1993) and Hall
(1992). The orientation away from routine problems also is seen
in this quote from Mosteller: "It gives us another way to get
empirical information in circumstances that almost defy
mathematical analysis." (Kolata, 1988, p. C1). And though Efron's
primary illustration (1983) -- the law-school GPA and LSAT
correlation problem -- is well-handled by standard techniques,
the main (though perhaps not exclusive) purpose of Efron's
bootstrap seems to be to handle problems that are not easily
dealt with by standard techniques, e. g. "the bootstrap can
routinely answer questions which are far too complicated for
traditional statistical analysis" (Efron and Tibshirani, 1986, p.
54). And "...the new methods free the statistician to attack
more complicated problems, exploiting a wider array of
statistical tools" (Efron and Diaconis, p. 116). In this respect
Efron's focus is similar to that of the original Monte Carlo
simulations of probabilistic problems sufficiently difficult to
defy analytic solution, as noted earlier in connection with Ulam.
Most of the articles in the technical literature describe
advanced applications. (This is explainable to some considerable
degree by the fact that the technical journals do not favor
simple applications or transparent and "obvious" ideas.
In contrast, the point of view urged here is that resampling
provides a powerful tool that researchers and decision-makers
(rather than only statisticians) can use with relatively small
chance of error and with total understanding of the tool, in
contrast to Normal-distribution-based methods which are
understood down to the root by almost no users, no matter how
sophisticated. (Evidence for the statement: Ask a small sample
of users of statistics to write and interpret the formula for the
Gaussian distribution.)
Friedman expresses a similar view. "Eventually, it [he was
referring to the bootstrap, but by implication the comment refers
to all resampling] will take over the field, I think" (Kolata,
1988, p. C1).
One of the virtues of resampling is that it induces users to
invent their own methods. This does not imply keeping people in
ignorance of resampling (and other) methods that have been
invented by others, and surely the learning of that body of
experience will assist them in re-invention. What is sought is
that the user not simply choose among a set of pre-written
templates or formulas and then simply fill in the unknowns,
because that process is likely to result in an unsound choice of
method. (An additional benefit of re-invention as a method of
study is that people are particularly likely to remember what
they themselves actively invent.)
The true revolution connected with resampling, in my view,
is in the step away from any analytic device in handling a
particular set of data, away from "statistical measures whose
theoretical properties can be analyzed mathematically", as
Diaconis and Efron put it (1983, p. 116). The sample of
resampling methods in my 1969 text takes this step to its logical
extreme. The variety of methods was chosen to illustrate the
power and scope of the general method, and also to stake out the
ground for future discussion.
COMMENTS
1. Resampling methods are not always better than
other methods, nor are they always to be preferred; they can be
more subject to skewness than conventional tests, and there can
be so little information in a sample that combining additional
assumptions such as Normality may improve reliability.
Nevertheless, I suggest that one should think first of resampling
methods in all or most situations.
Furthermore, a resampling procedure may be the method of
choice even when a more efficient conventional test exists,
because of the higher likelihood that the wrong conventional test
will be used than the wrong resampling test. That is, the
likelihood of "Type 4 error" -- using the wrong test -- is lower
when the user is oriented to resampling, a consideration which I
consider to be of great importance. I urge that we think in
terms of a validity concept - perhaps it should be given a label
such as "statistical utility" - which takes into account the
likelihood that an appropriate test will be used, as well as the
efficiency of test that is used (assuming it to be appropriate)5.
The proper way to assess the statistical utility of resampling
versus other methods must be empirical inquiry rather than
esthetic taste cum analytic judgment. And the controlled tests
that bear upon the matter (see Simon, Atkinson, and Shevokas,
1976) find better results for resampling methods, even without
the use of computers.
The test of statistical utility should be with respect to
users, in my view, and not with respect to statisticians. The
notion that there is a skilled statistician with sound scientific
judgment at the elbow of every user, and therefore that the test
of statistical utility should be with respect to statisticians,
seems quite implausible. If others disagree, the matter could
easily be checked by examining a sample of scientific papers in
various disciplines.
2. Insight into the prospects for the promotion of
resampling as the tool of first recourse, in 1969 and now, can be
gained from Efron's remark: "I've taken a tremendous amount of
guff. Statisticians are hard to convince. They tend to be very
conservative in practice." Indeed, he found that resampling
methods met sheer disbelief at first. "When I presented it to
people they said it wouldn't work", says Efron. And even if
people accept its validity, they find reasons to reject it.
"Some said it was too simple. Others said it was too
complicated" (Science, p. 158). Fortunately for the field, he
persevered.
Another source of difficulty for resampling is the
fundamental attitude of the statistics profession toward non-
proof-based methods. As S. Stigler put the matter in a related
connection: "Within the context of post-Newtonian scientific
thought, the only acceptable grounds for the choice of an error
distribution were to show that the curve could be mathematically
derived from an acceptable set of first principles" (1986, p.
110). This may be related to Mosteller's comment quoted above
that the bootstrap (and presumably all resampling) is "anti-
intuitive"
CONCLUSIONS AND SUMMARY
There is solid agreement on the nature of the core
techniques of resampling - stochastic permutation tests and
bootstrap procedures. Both constitute a best-guess universe from
the observed data, and they differ only in whether or not the
drawings are replaced.
There is less agreement about whether such simulation
techniques as goodness-of-fit procedures should be considered
resampling. In my view, they have many crucial characteristics
in common with the core techniques - including the use of the
best-guess-universe concept - and they differ greatly from the
conventional methods in not calculating probabilities by way of
sample-space analysis. Hence they should be considered part of
the same extended set as the core techniques, I argue.
Resampling appropriately includes hypothesis-testing as well
as confidence intervals, as well as other devices such as
goodness-of-fit.
The literature has mostly addressed the use of resampling
methods when conventional methods are not available, either
because assumptions are not easily met or because the problems
are too complex for conventional methods. In contrast, I urge
that they should be the first alternative considered for all
problems in probabilistic statistics (and in probability as
well), though there are some problems for which resampling
methods are inferior to conventional methods. They are practical
tools for users of statistics who are not professional
statisticians, and who all too often fall into confusion and
frustration in using conventional methods which their intuition
cannot follow down to the foundations.
**FOOTNOTES**
[1]: An example of the continuing belief among many
statisticians that resampling methods should be used when closed-
form methods are not feasible, rather than being the tool of
first resort, may be found in a review by Leger et. al.: "The
bootstrap should not be viewed as a replacement for mathematics,
for only with a sound theoretical foundation can resampling
methods be applied safely in practice." (1992, p. 396)
ENDNOTES
1. I am grateful to Peter Bruce for his excellent
suggestions and criticism of two previous drafts of this article.
2. John Pratt pointed out their work when I submitted an
article to JASA, which he was then editing.
3. One could draw only a sub-sample of jackknife
observations, with or without replacement, and consider the
result a resampling test, akin to the relationship between the
Dwass sampling procedure and the Fisher randomization test. But
though sampling is essential for the feasibility of the Fisher
test when the sample grows moderately, even in these days of
cheap computation, this is not so for the jackknife because of
the much smaller number of possibilities in the complete set.
For other reasons to come, too, the jackknife is not in the
spirit of other tests labeled here as resampling. But the
inclusion or exclusion of the jackknife is not critical to the
discussion, and hence it would be best not to get caught up in
this matter.
4. Efron sometimes also uses the term "bootstrap" in
fashions other than the above definition from time to time. For
example, he writes of "bootstrapping the entire process of data
analysis" (1983 article with Diaconis), which suggests that he
identifies the term with all resampling methods including
permutation tests, etc. And in some places he refers to it as a
"method for assigning measures of accuracy to statistical
estimates" (Efron-Tibshirani, p. 10), while elsewhere he includes
hypothesis tests, so either there is no difference in his mind
between those two topics or his definition shifts from time to
time.
5. This point was stressed by Simon, Atkinson, and
Shevokas.
It must be emphasized that the Monte Carlo method as
described here really is intended as an alternative to
conventional analytic methods in actual problem-solving
practice. This method is not a pedagogical device for
improving the teaching of conventional methods. This
is quite different than the past use of the Monte Carlo
method to help teach sampling theory, the binomial
theorem and the central limit theorem. The point that
is usually hardest to convey to teachers of statistics
is that the method suggested here really is a complete
break with conventional thinking, rather than a
supplement to it or an aid in teaching it. That is,
the simple Monte Carlo method described here is
complete in itself for handling most -- perhaps all --
problems in probability and statistics (1976, p. 734,
second italics added here).
This does not include such matters as the design of
experiments, and decision analysis. It also would be better to
use the term "problems in compound probability calculation". And
at that time we were not aware of some of the limitations of the
bootstrap and presumably of other resampling tests that have been
uncovered since then.