Produces bootstrapped estimates, standard errors and distributions (P.W. Lane).
Options
PRINT = string token |
Controls printed output (estimates , graphs , vcovariance ); default esti |
---|---|
DATA = variates, factors or texts |
Data vectors from which the statistics are to be calculated; no default |
AUXILIARY = pointers |
Further sets of data vectors, each set to be resampled independently |
ANCILLARY = any type |
Other relevant information needed to calculate the statistics |
NTIMES = scalar |
Number of times to resample; default 100 |
SEED = scalar |
Seed for random number generator; default continue from previous generation or use system clock |
GRAPHICS = string token |
Type of graphics (lineprinter , highresolution ); default high |
PROBABILITY = scalar |
Probability level for confidence interval; default 0.95 |
METHOD = string token |
What type of bootstrapping to use (random , balance , permute ); default rand |
BLOCKSTRUCTURE = formula |
Block structure to use for random permutations |
CIMETHOD = string token |
What type of confidence intervals to provide (bca , percentile ); default perc |
VCOVARIANCE = symmetric matrix |
Saves the variance-covariance matrix of the statistics |
Parameters
LABEL = texts |
Texts, each containing a single line, to label the statistics; default 'Statistic' |
---|---|
ESTIMATE = scalars |
Saves the bootstrap mean for each statistic |
SE = scalars |
Saves the bootstrap standard error for each statistic |
LOWER = scalars |
Saves the bootstrap lower confidence limit for each statistic |
UPPER = scalars |
Saves the bootstrap upper confidence limit for each statistic |
STATISTIC = variates |
Saves the series of bootstrap estimates of each statistic |
WINDOW = scalars |
Graphical window to use for displaying bootstrap distribution for each statistic; default 4 |
SCREEN = string tokens |
Whether to clear graphical frame or draw on top (clear , keep ); default clea |
Description
The bootstrap is a method of providing distributional information, such as standard errors, about statistical estimates – without making precise distributional assumptions about the data. It can also provide estimates with reduced bias. This is achieved by “resampling” from the data; that is, generating new data sets by sampling with replacement from the data set being investigated. A good introduction to the bootstrap is given by Efron & Tibshirani (1986); a fuller treatment can be found in Efron & Tibshirani (1993).
The BOOTSTRAP
procedure can be used for any statistic or set of statistics that can be calculated by Genstat from one or more data matrices. You need to provide a procedure called RESAMPLE
which calculates the statistics from the data, as explained in the Method section. There are also several examples of RESAMPLE
in the standard examples, which can be extracted by the commands:
LIBEXAMPLE 'BOOTSTRAP'; EXAMPLE=Ex
PRINT Ex; JUSTIF=left
The options and parameters of RESAMPLE
must not be changed. The body of the procedure should store the required statistics in scalars called STATISTIC[1...s]
using variates, factors and texts called DATA[1...d]
, where each of s
and d
can be any positive integer. The EXIT
parameter of RESAMPLE
should be set to indicate when any of the calculations fail, as can sometimes happen if degenerate data-sets are generated (see Example 3).
The data for BOOTSTRAP
are provided as a list of vectors (variates, factors or texts) using the DATA
option. From this, the procedure will generate new data by resampling from the set of units: all the vectors must have the same length, and each new sample uses the same set of units for all vectors. The procedure RESAMPLE
is then called to calculate the statistics.
Extra information required in procedure RESAMPLE
to calculate the statistics, which is not to be resampled along with the data matrix, can be passed as a list of data structures using the ANCILLARY
option of BOOTSTRAP
(see Examples 2 and 3).
The procedure can also deal with statistics calculated from several independent data matrices. For example, the difference in means between two independent samples must be dealt with by resampling independently from each sample, which may have different numbers of observations. In this case, one data matrix is specified as a list of vectors using the DATA
option as usual, and the second data matrix is specified as a pointer using the AUXILIARY
option. This option may be set to any number of pointers, each storing a list of vectors; resampling is done independently for each set of vectors (see Example 4).
The option NTIMES
specifies how many times the resampling is carried out. The default value is 100, which has been found by many users of the bootstrap to be sufficient for producing standard errors and bias-reduced estimates. However, the number should be increased to get reliable distributional information: 1000 or more may be needed for reliable 95% confidence limits.
Printed output is controlled by the PRINT
option, with settings estimates
for the estimates and their standard errors and confidence limits, and vcovariance
for the variance-covariance matrix. The graphs
setting draws a histogram of the bootstrap distributions. The default setting is just estimates
.
A label should be provided for each statistic, using the LABEL
parameter; by default, bootstrapping will be done for a single statistic which will be labelled simply as Statistic
. The estimates and their standard errors can be saved by the ESTIMATE
and SE
parameters. Also, a variance-covariance matrix of the estimates can be saved using the VCOVARIANCE
option. The number of labels, s
say, must match the number of statistics, called STATISTIC[1...s]
, calculated in your version of the RESAMPLE
procedure.
The parameters LOWER
and UPPER
allow confidence limits for each statistic to be saved, with the probability level specified in the PROBABILITY
option (default 0.95 i.e. 95% confidence intervals). By default the intervals are constructed as percentiles of the empirical distribution of the bootstrap estimates. However, provided there are no auxiliary data vectors, you can request bias-corrected and accelerated limits instead by setting option CIMETHOD=bca
(see Efron & Tibshirani, 1993, Section 14.3). The full sets of bootstrap estimates can be saved by setting the STATISTICS
parameter; each variate will contain n values, where n is the setting of the NTIMES
option.
Three methods of bootstrapping are provided. By default, resampling is completely pseudo-random, using Genstat’s random-number generator. The generator can be initialized by setting option SEED
, thereby producing reproducible results; otherwise, the initialization uses the system clock. A second alternative is balanced bootstrapping, requested by setting METHOD=balance
. In this case, the resampling is constrained to ensure that each unit of the data matrix occurs the same number of times in the complete set of generated samples (see Examples 3 and 4). The third method, specified by METHOD=permute
, is simply to permute the units of the data matrix. Note that this method gives no variation in results if the statistics are independent of the order of the data, like the sample mean. However, this method provides permutation tests, a type of randomization test that can be applied to grouped data (see Example 4). When METHOD=permute
, you can set the BLOCKSTRUCTURE
option to a model formula to define how the randomization is to be done (see the RANDOMIZE
directive for details).
If the graphics
setting of the PRINT
option is used, the procedure will display the distribution of each set of bootstrap estimates as a histogram. By default, this will be a high-resolution plot on the current device, but the GRAPHICS
option can be set to line
to produce a line-printer histogram. In a high-resolution plot, the histogram is enhanced with a smoothed line, giving a clearer indication of the distribution of the statistic. By default, the display for the statistics will appear in graphical window 4, one at a time (this window is set by default to fill the whole graphical frame). But the WINDOW
and SCREEN
parameters can be set to arrange for concurrent displays of the statistics in differently sized windows.
Options: PRINT
, DATA
, AUXILIARY
, ANCILLARY
, NTIMES
, SEED
, GRAPHICS
, PROBABILITY
, METHOD
, BLOCKSTRUCTURE
, CIMETHOD
, VCOVARIANCE
.
Parameters: LABEL
, ESTIMATE
, SE
, LOWER
, UPPER
, STATISTIC
, WINDOW
, SCREEN
.
Method
Samples are generated by scaling uniform random numbers produced by the URAND
function. For the balanced bootstrap, a list of repeated unit numbers is sorted into random order and used one block at a time. For the permutation test, the RANDOMIZE
directive is used to re-order the data at random.
BOOTSTRAP
needs a subsidiary procedure RESAMPLE
to calculate the statistics of interest. RESAMPLE
has an option, DATA
, which is used to supply the data vectors (variates, factors or texts) from which the statistics are to be calculated. Other relevant information can be supplied through the AUXILIARY
and ANCILLARY
options, which correspond to the AUXILIARY
and ANCILLARY
options of BOOTSTRAP
itself. There are two parameters: STATISTIC
supplies a list of scalars to store the estimates of each statistic, and EXIT
a list of scalars which should be set to zero or one according to whether or not each statistic could be estimated successfully with the supplied data vectors. If the value of EXIT
is not calculated in RESAMPLE
, the BOOTSTRAP
procedure assumes that the calculations succeeded.
This example shows a version of RESAMPLE
which calculates the correlation between two variates.
PROCEDURE [PARAMETER=pointer] 'RESAMPLE'
OPTION 'DATA', " (I: variates, factors or texts) data
vectors from which to calculate
the statistics; no default"\
'AUXILIARY', " (I: pointers) auxiliary sets of data
vectors, each of which is to be
resampled independently"\
'ANCILLARY'; " (I: any type of structure) other
relevant information needed to
calculate the statistics "
MODE=p; TYPE=!t(variate,factor,text),'pointer',*;\
SET=yes,no,no; LIST=yes; DECLARED=yes; PRESENT=yes
PARAMETER 'STATISTIC', " (O: scalars) to save the calculated
statistics "\
'EXIT'; " (O: scalars) to save an exit code
to indicate failure (EXIT[i]=1) or
success (EXIT[i]=0) when calculating
each STATISTIC[i]"\
MODE=p; TYPE='scalar'; SET=yes
CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2])
& EXIT[1] = STATISTIC[1]==C('missing')
ENDPROCEDURE
VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605,\
653,575,545,572,594] Y
& [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\
3.12,2.74,2.76,2.88,2.96] Z
BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation'
The RESAMPLE
procedure is called within a loop, and the statistics that are returned are loaded into variates. If any statistics fail to be calculated, as recorded by the EXIT
parameter of RESAMPLE
, they are stored as missing values. BOOTSTRAP
will then base its estimation on the successful generations, but reports how many failures occurred.
The bootstrap estimates are formed as simple means of the stored variates, and the s.e.s are square roots of the sample variance. The TABULATE
directive is used to estimate quantiles from the stored variates, to define confidence limits. The variance-covariance matrix is formed from the statistics using the FSSPM
directive.
The graphical representation uses DHISTOGRAM
or HISTOGRAM
on the stored variates. The smoothed curves are calculated from the transformed percentages from the histogram: LOGIT(CUM(%))
. A smoothing spline is fitted on this scale, by the FIT
directive with the SSPLINE
function, using 4 d.f. The resulting fitted values are then backtransformed and drawn on the plot with the monotonic
setting of the PEN
directive.
Action with RESTRICT
If any of the data vectors is restricted, BOOTSTRAP
will use only the units that are not restricted for any of the vectors. The data vectors that are passed to the RESAMPLE
procedure are all restricted to this identified set of units, but otherwise match the original data vectors. Each set of vectors supplied in pointers in the AUXILIARY
option are treated separately in this way.
References
Efron, B. & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54-77.
Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall, London.
See also
Procedures: JACKKNIFE
, APERMTEST
, CHIPERMTEST
, HBOOTSTRAP
,RPERMTEST
.
Example
CAPTION 'BOOTSTRAP example','1) Bootstrapped correlation.',\ !t('The data are scores from two tests on new admissions to Law School ',\ '(Efron, 1981, The Jackknife, the Bootstrap & Other Resampling Plans.',\ 'CBMS Monograph 38, SIAM, Philadelphia); listed in Table 1 of Hinkley',\ '(1983), Encyclopedia of Statistics, Volume 4, page 282.');\ STYLE=meta,plain,plain " Define RESAMPLE to calculate the correlation between the two scores." PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2]) & EXIT[1] = STATISTIC[1]==C('missing') ENDPROCEDURE VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605,653,575,545,572,594] Y & [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\ 3.12,2.74,2.76,2.88,2.96] Z BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation' CAPTION '2) A permutation test.',\ !t('Five wines are tested in a completely randomized design for',\ 'alcohol content. The variance-ratio for the treatment effect is',\ 'estimated by resampling with random permutation of the observations.') " Re-define RESAMPLE to calculate the ratio. The treatment factor must be passed to the procedure via the ANCILLARY option so that AKEEP can extract the treatment sum of squares." PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no; \ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes ANOVA [PRINT=*] DATA[1] AKEEP TERMS=ANCILLARY[1],'*Units*'; SS=sstreat,ssresid; DF=dftreat,dfresid CALC STATISTIC[1] = (sstreat/dftreat)/(ssresid/dfresid) ENDPROCEDURE FACTOR Wine READ Wine,%Alcohol; FREP=labels E 4.931 D 7.263 A 4.857 C 3.361 B 6.871 E 4.141 C 3.164 B 3.012 A 5.668 D 12.185 B 4.223 E 3.323 A 4.668 C 2.686 D 7.776 : TREATMENT Wine ANOVA %Alcohol BOOTSTRAP [DATA=%Alcohol; ANCILLARY=Wine; METHOD=permute; NTIMES=500;\ PROBABILITY=0.90; SEED=46921] 'Ratio' CAPTION !t(\ 'The observed variance ratio of 6.41 is well outside the 90% confidence',\ 'interval. A one-sided permutuation test at the 95% level therefore',\ 'rejects the hypothesis that the observed treatment differences could',\ 'have arisen by chance from this set of data.') CAPTION '3) Balanced bootstrap.',\ !t('Fit parallel exponential curves to the relationship between',\ 'Sugar yield and Soil phosphorus in four years.',\ 'Estimate the asymptotic yields for each year, with standard errors.',\ '(Note that FITCURVE does not provide these s.e.s.)') PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC y = ANCILLARY[1]+DATA[1] MODEL y FITCURVE [PRINT=*] ANCILLARY[2,3] RKEEP ESTIMATES=est; EXIT=ex " Extract asymptotes: first two parameters are rate and range." EQUATE [OLDFORMAT=!(-2,4)] OLD=est; NEW=STATISTIC " Pass on information about success of fitting" EQUATE OLD=ex; NEW=EXIT ENDPROCEDURE FACTOR [LEVELS=4; VALUES=16(1...4)] Year READ Beetwt,%sugar,SoilP 7.23 18.5 5.4 7.69 18.0 5.4 24.64 20.1 7.8 26.67 19.8 8.0 39.78 19.5 18.0 44.98 19.3 15.6 41.59 19.7 30.4 44.08 19.8 33.8 48.37 19.4 50.4 44.76 19.0 51.0 49.73 18.6 44.0 51.54 18.5 40.2 47.69 19.0 57.2 45.66 19.4 65.0 50.18 18.6 27.0 47.69 18.7 30.0 8.82 13.8 5.6 1.81 13.9 4.8 15.82 14.5 10.2 9.04 14.0 8.6 24.41 15.0 21.6 22.60 14.1 17.2 26.45 15.2 36.4 20.80 15.3 37.2 28.30 14.2 44.4 22.60 14.7 44.4 14.24 13.5 41.0 35.94 15.6 30.2 25.54 15.8 60.8 27.13 15.6 47.0 31.42 15.6 27.0 34.13 15.4 29.0 19.90 16.1 3.0 20.60 16.0 2.0 34.70 16.7 6.2 35.40 16.4 6.2 46.80 17.1 19.8 40.50 16.9 17.2 43.00 16.9 29.6 48.60 17.1 28.0 47.30 17.0 42.8 41.30 17.1 46.2 44.30 17.0 36.6 47.60 16.6 40.0 45.60 17.0 42.2 44.60 17.0 52.0 44.00 17.2 23.4 40.10 16.6 28.0 14.35 16.1 4.0 14.35 15.5 3.8 26.71 16.6 8.0 25.12 16.4 6.4 33.39 17.2 18.2 33.79 16.2 14.8 36.68 17.0 35.0 33.69 16.8 29.6 34.98 17.0 37.2 35.78 17.0 40.0 42.06 17.2 39.6 38.77 17.3 36.8 40.66 17.3 52.4 37.28 17.2 45.6 34.68 17.3 22.0 32.59 17.2 26.0 : CALC Sugar = Beetwt * %sugar / 100 MODEL Sugar FITCURVE [PRINT=model,estimates] SoilP,Year RKEEP FITTED=Fit CALCULATE Simplres = Sugar-Fit CAPTION !t(\ 'Resample the residuals, adding to the fitted values already calculated.',\ 'The calculations require the fitted values, and the explanatory variate',\ 'and factor, but the values of these vectors must not be resampled.',\ 'Use the balanced bootstrap, which ensures that all observations occur',\ 'an equal number of times in the complete set of bootstrap samples.') BOOTSTRAP [DATA=Simplres; ANCILLARY=Fit,SoilP,Year; METHOD=balance;\ SEED=23845] 'Year 1','Year 2','Year 3','Year 4' CAPTION '4) Use of auxiliary data.',\ !t('Estimate difference in medians between the heights',\ 'of active volcanos in America and in Asia/Oceania',\ '(see the Guide to Genstat, Part 2, Section 2.1).') PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\ 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC STATISTIC[1] = MEDIAN(DATA[1])-MEDIAN(AUXILIARY[1][1]) ENDPROCEDURE CAPTION !t(\ 'Since the samples are independent, they must be resampled separately.',\ 'The sets of vectors provided by the AUXILIARY option are each subjected',\ 'to separate resampling. The sets must be combined into pointers, to',\ 'allow the possibility of more than one additional set of data.') VARIATE America,AsiaOcea; VALUES=!(130,126,124,124,113,89,83,77,70,62,58,51,\ 51,42,40,34,199,197,193,185,177,172,157,156,140,102,93,86,36,140,102,100,\ 94,83,83,82,73,67,67,66,60,57,57,53,49,43,43,40,35,35), !(156,125,122,120,\ 112,109,103,100,100,96,95,95,90,83,81,81,81,77,75,75,73,71,71,67,66,66,64,\ 62,60,60,60,59,58,57,56,56,55,54,54,52,52,52,51,50,49,49,48,45,44,44,37,\ 36,36,26,26,24,19,11,10,137,41) CAPTION 'Calculate and print the difference using the procedure.' POINTER [VALUE=AsiaOcea] Aux RESAMPLE [DATA=America; AUXILIARY=Aux] Diff; EXIT=exit PRINT Diff CAPTION !t('Produce the bootstrapped estimate and 90% confidence interval;',\ 'using balanced resampling.') BOOTSTRAP [DATA=America; AUXILIARY=Aux; METHOD=balance; PROBABILITY=0.90]\ 'Difference' CAPTION '5) Bias-corrected and accelerated confidence limits.',\ !t('Spatial data test (Efron, B. & Tibshirani, R.J., 1993,',\ 'An Introduction to the Bootstrap, Chapman & Hall, London).') PROCEDURE [PARAMETER=pointer] 'RESAMPLE' OPTION 'DATA', " (I: variates, factors or texts) data vectors from which to calculate the statistics; no default"\ 'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each of which is to be resampled independently"\ 'ANCILLARY'; " (I: any type of structure) other relevant information needed to calculate the statistics "\ MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ LIST=yes; DECLARED=yes; PRESENT=yes PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics " 'EXIT'; " (O: scalars) to save an exit code to indicate failure (EXIT[i]=1) or success (EXIT[i]=0) when calculating each STATISTIC[i]"\ MODE=p; TYPE='scalar'; SET=yes CALC STATISTIC[1] = SUM((DATA[1] - MEAN(DATA[1]))**2) / NOBS(DATA[1]) ENDPROCEDURE VARIATE [VALUES=48,36,20,29,42,42,20,42,22,41,45,14,6,\ 0,33,28,34,4,32,24,47,41,24,26,30,41] A & [VALUES=42,33,16,39,38,36,15,33,20,43,34,22,7,\ 15,34,29,41,13,38,25,27,41,28,14,28,40] B BOOTSTRAP [DATA=A; NTIMES=2000; SEED=245875; PROBABILITY=0.90; CIMETHOD=bca]\ 'Variance'