THE BASIC TECHNIQUES OF RESAMPLING Julian L. Simon John Arbuthnot, doctor to Queen Anne of England, began the publication of formal statistical inference in 1710. He observed that more boys than girls are born, which he assumed is necessary for the survival of the species, and he wished to prove that birth sex is indeed not a 50-50 probability. The records for London showed that male births exceeded female 82 years in a row. Arbuthnot therefore set forth to (in modern language) test the hypothesis that a universe with a 50-50 probability of producing males could result in 82 successive years with preponderantly male births. This is a canonical problem. You have some observed "sample" data, and you want to connect them to some specified "population" from which they may have come. The previous sentence was purposely worded vaguely because statistical questions can be stated in many different ways. But in this case statisticians agree on how to proceed: Specify the universe, and compare its behavior against the observed sample. If it is unlikely that a sample as surprising as the observed sample should come from the specified universe, conclude that the sample did not come from that universe. Arbuthnot used the multiplication rule of Pascal and Fermat to calculate that the probability of (1/2)82 is extremely small. "From whence it follows, that it is Art, not Chance, that "governs" - that is, "Divine Providence". (His argument is complex and debatable, as statistical inference often is; the mathematics is the easy part, especially when resampling methods are used.) Please notice that Arbuthnot could have considered the numbers of boys and girls observed in each year, rather than treating each year as a single observation - an even stronger test because of the vast amounts of information. Arbuthnot surely did not analyze the data for any or all of the individual years because the calculus of probability was still in its infancy. Luckily, the test Arbuthnot made was more than powerful enough for his purposes. But if instead of 82 years in a row, only (say) 81 or 61 of the 82 years had shown a preponderance of males, Arbuthnot would have lacked the tools for a test (though he knew the binomial and logarithms). Nowadays, one conventionally uses the Gaussian (Normal) approximation to the binomial distribution to produce the desired probability. But that method requires acquaintance with a considerable body of statistical procedure, and utilizes a formula that almost no one knows and even fewer can explain intuitively. Instead, users simply "plug in" the data to a table which, because it is an arcane mystery, invites misuse and erroneous conclusions. The experimental resampling method of earlier gamblers could easily have given Arbuthnot a satisfactory answer for (say) 61 of 82 years, however. He had in fact likened the situation to a set of 82 coins. He could simply have tossed such a set repeatedly, and found that almost never would as many as 81 or 61 heads occur. He could then have rested as secure in his conclusion as with the formulaic assessment of the probability of 82 years in a row. And because of the intuitive clarity of the experimental method, one would not be likely to make a misleading error in such a procedure. By the grace of the computer, such problems can be handled more conveniently today. The self-explanatory commands in Illustration 3 suffice, using the language RESAMPLING STATS and producing the results shown there. Illustration 3 The intellectual advantage of the resampling method is that though it takes repeated samples from the sample space, it does not require that one know the size of the sample space or of a particular subset of it. To estimate the probability of getting (say) 61 males in 82 births with the binomial formula requires that one calculate the number of permutations of a total of 82 males and females, and the number of those permutations that include 61 or more males. In contrast, with a resampling approach one needs to know only the conditions of producing a single trial yielding a male or female. This conceptual difference, which will be discussed at greater length below, is the reason that, compared to conventional methods, resampling is likely to have higher "statistical utility" - a compound of efficiency plus the chance that the ordinary scientist or decision-maker will use a correct procedure. VARIETIES OF RESAMPLING METHODS A resampling test may be constructed for every case of statistical inference - by definition. Every real-life situation can be modeled by symbols of some sort, and one may experiment with this model to obtain resampling trials. A resampling method should always be appropriate unless there are insufficient data to perform a useful resampling test, in which case a conventional test - which makes up for the absence of observations with an assumed theoretical distribution such as the Normal or Poisson - may produce more accurate results if the universe from which the data are selected resembles the chosen theoretical distribution. Exploration of the properties of resampling tests is an active field of research at present. For the main tasks in statistical inference - hypothesis testing and confidence intervals - the appropriate resampling test often is immediately obvious. For example, if one wishes to inquire whether baseball hitters exhibit behavior that fits the notion of a slump, one may simply produce hits and outs with a random-number generator adjusted to the batting average of a player, and then compare the number of simulated consecutive sequences of either hits or outs with the observed numbers for the player. The procedure is also straightforward for such binomial situations as the Arbuthnot birth-sex case. Two sorts of procedures are especially well-suited to resampling: 1) A sample of the permutations in Fisher's "exact" test (confusingly, also called a "randomization" test). This is appropriate when the size of the universe is properly assumed to be fixed, as discussed below. 2) The bootstrap procedure. This is appropriate when the size of the universe is properly assumed not to be fixed. Let's compare the permutation and bootstrap procedures in the context of a case which might be analyzed either way. The discussion will highlight some of the violent disagreements in the philosophy of statistics which the use of resampling methods frequently brings to the surface - one of its great benefits. In the 1960s I studied the price of liquor in the sixteen "monopoly" states (where the state government owns the retail liquor stores) compared to the twenty-six states in which retail liquor stores are privately owned. (Some states were omitted for technical reasons. The situation and the price pattern has changed radically since then.) These were the representative 1961 prices of a fifth of Seagram 7 Crown whiskey in the two sets of states: 16 monopoly states: $4.65, $4.55, $4.11, $4.15, $4.20, $4.55, $3.80, $4.00, $4.19, $4.75, $4.74, $4.50, $4.10, $4.00, $5.05, $4.20 26 private-ownership states: $4.82, $5.29, $4.89, $4.95, $4.55, $4.90, $5.25, $5.30, $4.29, $4.85, $4.54, $4.75, $4.85, $4.85, $4.50, $4.75, $4.79, $4.85, $4.79, $4.95, $4.95, $4.75, $5.20, $5.10, $4.80, $4.29. The economic question that underlay the investigation - having both theoretical and policy ramifications - is as follows: Does state ownership affect prices? The empirical question is whether the prices in the two sets of states were systematically different. In statistical terms, we wish to test the hypothesis that there was a difference between the groups of states related to their mode of liquor distribution, or whether instead the observed $.49 differential in means might well have occurred by happenstance. In other words, we want to know whether the two sub-groups of states differed systematically in their liquor prices, or whether the observed pattern could well have been produced by chance variability. At first I used a resampling permutation test as follows: Assuming that the entire universe of possible prices consists of the set of events that were observed, because that is all the information available about the universe, I wrote each of the forty-two observed state prices on a separate card. The shuffled deck simulated a situation in which each state has an equal chance for each price. On the "null hypothesis" that the two groups' prices do not reflect different price-setting mechanisms, but rather differ only by chance, I then examined how often that simulated universe stochastically produces groups with results as different as observed in 1961. I repeatedly dealt groups of 16 and 26 cards, without replacing the cards, to simulate hypothetical monopoly-state and private-state samples, each time calculating the difference in mean prices. The probability that the benchmark null-hypothesis universe would produce a difference between groups as large or larger than observed in 1961 is estimated by how frequently the mean of the group of randomly-chosen sixteen prices from the simulated state-ownership universe is less than (or equal to) the mean of the actual sixteen state-ownership prices. If the simulated difference between the randomly-chosen groups was frequently equal to or greater than observed in 1961, one would not conclude that the observed difference was due to the type of retailing system because it could well have been due to chance variation. The computer program in Illustration 4, using the language RESAMPLING STATS performs the operations described above (MATHEMATICA and APL could be used in much the same fashion). Illustration 4 The results shown - not even one "success" in 10,000 trials - imply a very small probability that two groups with mean prices as different as were observed would happen by chance if drawn from the universe of 42 observed prices. So we "reject the null hypothesis" and instead find persuasive the proposition that the type of liquor distribution system influences the prices that consumers pay. As I shall discuss later, the logical framework of this resampling version of the permutation test differs greatly from the formulaic version, which would have required heavy computation. The standard conventional alternative would be a Student's t-test, in which the user simply plugs into an unintuitive formula and table. And because of the unequal numbers of cases and unequal dispersions in the two samples, an appropriate t test is far from obvious, whereas resampling is not made more difficult by such realistic complications. Recently I have concluded that a bootstrap-type test has better theoretical justification than a permutation test in this case, though the two reach almost identical results with a sample this large. The following discussion of which is most appropriate brings out the underlying natures of the two approaches, and illustrates how resampling raises issues which tend to be buried amidst the technical complexity of the formulaic methods, and hence are seldom discussed in print. Imagine a class of 42 students, 16 men and 26 women who come into the room and sit in 42 fixed seats. We measure the distance of each seat to the lecturer, and assign each a rank. The women sit in ranks 1-5, 7-20, etc., and the men in ranks 6, 22, 25-26, etc. You ask: Is there a relationship between sex and ranked distance from the front? Here the permutation procedure that resamples without replacement - as used above with the state liquor prices - quite clearly is appropriate. Now, how about if we work with actual distances from the front? If there are only 42 seats and they are fixed, the permutation test and sampling without replacement again is appropriate. But how about if seats are movable? Consider the possible situation in which one student can choose position without reference to others. That is, if the seats are movable, it is not only imaginable that A would be sitting where B now is, with B in A's present seat - as was the case with the fixed chairs - but A could now change distance from the lecturer while all the others remain as they are. Sampling with replacement now is appropriate. (To use a technical term, the cardinal data provide more actual degrees of freedom - more information - than do the ranks). Note that (as with the liquor prices) the seat distances do not comprise an infinite population. Rather, we are inquiring whether a) the universe should be considered limited to a given number of elements, or b) could be considered expandable without change in the probabilities; the latter is a useful definition of "sampling with replacement". As of 1992, the U.S. state liquor systems seem to me to resemble a non-fixed universe (like non-fixed chairs) even though the actual number of states is presently fixed. The question the research asked was whether the liquor system affects the price of liquor. We can imagine another state being admitted to the union, or one of the existing states changing its system, and pondering how the choice of system will affect the price. And there is no reason to believe that (at least in the short run) the newly-made choice of system would affect the other states' pricing; hence it makes sense to sample with replacement (and use the bootstrap) even though the number of states clearly is not infinite or greatly expandable. In short, the presence of interaction - a change in one entity causing another entity also to change - implies a finite universe composed of those elements, and use of a permutation test. Conversely, when one entity can change independently, an infinite universe and sampling with replacement with a bootstrap test is indicated. A program to handle the liquor problem with an infinite- universe bootstrap distribution simply substitutes the random sampling command GENERATE for the TAKE command in Illustration 4. The results of the new test are indistinguishable from those in Illustration 4. Confidence Intervals So far we have discussed the interpretation of sample data for testing hypotheses. The devices used for the other main theme in statistical inference - the estimation of confidence intervals - are much the same as those used for testing hypotheses. Indeed, the bootstrap method discussed above was originally devised for estimation of confidence intervals. The bootstrap method may also be used to calculate the appropriate sample size for experiments and surveys, another important topic in statistics. OTHER RESAMPLING TECHNIQUES We have so far seen examples of three of the most common resampling methods - binomial, permutation, and bootstrap. These methods may be extended to handle correlation, regression, and tests where there are three or more groups. Indeed, resampling can be used for every other statistic in which one may be interested - for example, statistics based on absolute deviations rather than squared deviations. This flexibility is a great virtue because it frees the statistics user from the limited and oft-confining battery of textbook methods. ON THE NATURE OF RESAMPLING TESTS As will be discussed at more length in Chapter 00, resampling is a much simpler intellectual task than the formulaic method because simulation obviates the need to calculate the number of points in the entire sample space. In all but the most elementary problems where simple permutations and combinations suffice, the calculations require advanced training and delicate judgment. Resampling avoids the complex abstraction of sample-space calculations by substituting the particular information about how elements in the sample are generated randomly in a specific event, as learned from the actual circumstances; the analytic method does not use this information. In the case of the gamblers prior to Galileo, resampling used the (assumed) facts that three fair dice are thrown with an equal chance of any outcome, and they took advantage of experience with many such events performed one at a time; in contrast, Galileo made no use of the actual stochastic element of the situation, and gained no information from a sample of such trials, but rather replaced all possible sequences by exhaustive computation. The analytic method for obtaining solutions - using permutation and combination formulas, for example - is not theoretically superior to resampling. Resampling is not "just" a stochastic-simulation approximation to the formulaic method. It is a quite different route to the same endpoint, using different intellectual processes and utilizing different sorts of inputs; both resampling and formulaic calculation are shortcuts to estimation of the sample space and its partitions. The much lesser degree of intellectual difficulty is the source of the central advantage of resampling. It improves the probability that the user will arrive at a sound solution to a problem - the ultimate criterion for all except for pure mathematicians. A common objection is that resampling is not "exact" because the results are "only" a sample. Ironically, the basis of all statistics is sample data drawn from actual populations. Statisticians have only recently managed to win most of their battles against those bureaucrats and social scientists who, out of ignorance of statistics, believed that only a complete census of a country's population, or examination of every volume in a library, could give satisfactory information about unemployment rates or book sizes. Indeed, samples are sometimes even more accurate than censuses. Yet many of those same statisticians have been skittish about simulated samples of data points taken from the sample space - drawn far more randomly than the data themselves, even at best. They tend to want a complete "census" of the sample space, even when sampling is more likely to arrive at a correct answer because it is intellectually simpler (as with the gamblers and Galileo.) If there is legitimate concern about whether there are enough repetitions in a resampling procedure, the matter can be handled in exactly the same fashion as sample size is handled with respect to the actual data. One may compute the amount of error associated with various numbers of repetitions. And at very low cost of computer time this error may be reduced until it is vanishingly small compared with the sampling error associated with the actual sampling process. (Research on how to do this precisely is needed, however.) page # teachbk II-2meth May 7, 1996