BOOTSTRAP procedure

Produces bootstrapped estimates, standard errors and distributions (P.W. Lane).

Options

`PRINT` = string token	Controls printed output (`estimates`, `graphs`, `vcovariance`); default `esti`
`DATA` = variates, factors or texts	Data vectors from which the statistics are to be calculated; no default
`AUXILIARY` = pointers	Further sets of data vectors, each set to be resampled independently
`ANCILLARY` = any type	Other relevant information needed to calculate the statistics
`NTIMES` = scalar	Number of times to resample; default 100
`SEED` = scalar	Seed for random number generator; default continue from previous generation or use system clock
`GRAPHICS` = string token	Type of graphics (`lineprinter`, `highresolution`); default `high`
`PROBABILITY` = scalar	Probability level for confidence interval; default 0.95
`METHOD` = string token	What type of bootstrapping to use (`random`, `balance`, `permute`); default `rand`
`BLOCKSTRUCTURE` = formula	Block structure to use for random permutations
`CIMETHOD` = string token	What type of confidence intervals to provide (`bca`, `percentile`); default `perc`
`VCOVARIANCE` = symmetric matrix	Saves the variance-covariance matrix of the statistics

Parameters

`LABEL` = texts	Texts, each containing a single line, to label the statistics; default `'Statistic'`
`ESTIMATE` = scalars	Saves the bootstrap mean for each statistic
`SE` = scalars	Saves the bootstrap standard error for each statistic
`LOWER` = scalars	Saves the bootstrap lower confidence limit for each statistic
`UPPER` = scalars	Saves the bootstrap upper confidence limit for each statistic
`STATISTIC` = variates	Saves the series of bootstrap estimates of each statistic
`WINDOW` = scalars	Graphical window to use for displaying bootstrap distribution for each statistic; default 4
`SCREEN` = string tokens	Whether to clear graphical frame or draw on top (`clear`, `keep`); default `clea`

Description

The bootstrap is a method of providing distributional information, such as standard errors, about statistical estimates – without making precise distributional assumptions about the data. It can also provide estimates with reduced bias. This is achieved by “resampling” from the data; that is, generating new data sets by sampling with replacement from the data set being investigated. A good introduction to the bootstrap is given by Efron & Tibshirani (1986); a fuller treatment can be found in Efron & Tibshirani (1993).

The BOOTSTRAP procedure can be used for any statistic or set of statistics that can be calculated by Genstat from one or more data matrices. You need to provide a procedure called RESAMPLE which calculates the statistics from the data, as explained in the Method section. There are also several examples of RESAMPLE in the standard examples, which can be extracted by the commands:

LIBEXAMPLE 'BOOTSTRAP'; EXAMPLE=Ex

PRINT Ex; JUSTIF=left

The options and parameters of RESAMPLE must not be changed. The body of the procedure should store the required statistics in scalars called STATISTIC[1...s] using variates, factors and texts called DATA[1...d], where each of s and d can be any positive integer. The EXIT parameter of RESAMPLE should be set to indicate when any of the calculations fail, as can sometimes happen if degenerate data-sets are generated (see Example 3).

The data for BOOTSTRAP are provided as a list of vectors (variates, factors or texts) using the DATA option. From this, the procedure will generate new data by resampling from the set of units: all the vectors must have the same length, and each new sample uses the same set of units for all vectors. The procedure RESAMPLE is then called to calculate the statistics.

Extra information required in procedure RESAMPLE to calculate the statistics, which is not to be resampled along with the data matrix, can be passed as a list of data structures using the ANCILLARY option of BOOTSTRAP (see Examples 2 and 3).

The procedure can also deal with statistics calculated from several independent data matrices. For example, the difference in means between two independent samples must be dealt with by resampling independently from each sample, which may have different numbers of observations. In this case, one data matrix is specified as a list of vectors using the DATA option as usual, and the second data matrix is specified as a pointer using the AUXILIARY option. This option may be set to any number of pointers, each storing a list of vectors; resampling is done independently for each set of vectors (see Example 4).

The option NTIMES specifies how many times the resampling is carried out. The default value is 100, which has been found by many users of the bootstrap to be sufficient for producing standard errors and bias-reduced estimates. However, the number should be increased to get reliable distributional information: 1000 or more may be needed for reliable 95% confidence limits.

Printed output is controlled by the PRINT option, with settings estimates for the estimates and their standard errors and confidence limits, and vcovariance for the variance-covariance matrix. The graphs setting draws a histogram of the bootstrap distributions. The default setting is just estimates.

A label should be provided for each statistic, using the LABEL parameter; by default, bootstrapping will be done for a single statistic which will be labelled simply as Statistic. The estimates and their standard errors can be saved by the ESTIMATE and SE parameters. Also, a variance-covariance matrix of the estimates can be saved using the VCOVARIANCE option. The number of labels, s say, must match the number of statistics, called STATISTIC[1...s], calculated in your version of the RESAMPLE procedure.

The parameters LOWER and UPPER allow confidence limits for each statistic to be saved, with the probability level specified in the PROBABILITY option (default 0.95 i.e. 95% confidence intervals). By default the intervals are constructed as percentiles of the empirical distribution of the bootstrap estimates. However, provided there are no auxiliary data vectors, you can request bias-corrected and accelerated limits instead by setting option CIMETHOD=bca (see Efron & Tibshirani, 1993, Section 14.3). The full sets of bootstrap estimates can be saved by setting the STATISTICS parameter; each variate will contain n values, where n is the setting of the NTIMES option.

Three methods of bootstrapping are provided. By default, resampling is completely pseudo-random, using Genstat’s random-number generator. The generator can be initialized by setting option SEED, thereby producing reproducible results; otherwise, the initialization uses the system clock. A second alternative is balanced bootstrapping, requested by setting METHOD=balance. In this case, the resampling is constrained to ensure that each unit of the data matrix occurs the same number of times in the complete set of generated samples (see Examples 3 and 4). The third method, specified by METHOD=permute, is simply to permute the units of the data matrix. Note that this method gives no variation in results if the statistics are independent of the order of the data, like the sample mean. However, this method provides permutation tests, a type of randomization test that can be applied to grouped data (see Example 4). When METHOD=permute, you can set the BLOCKSTRUCTURE option to a model formula to define how the randomization is to be done (see the RANDOMIZE directive for details).

If the graphics setting of the PRINT option is used, the procedure will display the distribution of each set of bootstrap estimates as a histogram. By default, this will be a high-resolution plot on the current device, but the GRAPHICS option can be set to line to produce a line-printer histogram. In a high-resolution plot, the histogram is enhanced with a smoothed line, giving a clearer indication of the distribution of the statistic. By default, the display for the statistics will appear in graphical window 4, one at a time (this window is set by default to fill the whole graphical frame). But the WINDOW and SCREEN parameters can be set to arrange for concurrent displays of the statistics in differently sized windows.

Options: PRINT, DATA, AUXILIARY, ANCILLARY, NTIMES, SEED, GRAPHICS, PROBABILITY, METHOD, BLOCKSTRUCTURE, CIMETHOD, VCOVARIANCE.

Parameters: LABEL, ESTIMATE, SE, LOWER, UPPER, STATISTIC, WINDOW, SCREEN.

Method

Samples are generated by scaling uniform random numbers produced by the URAND function. For the balanced bootstrap, a list of repeated unit numbers is sorted into random order and used one block at a time. For the permutation test, the RANDOMIZE directive is used to re-order the data at random.

BOOTSTRAP needs a subsidiary procedure RESAMPLE to calculate the statistics of interest. RESAMPLE has an option, DATA, which is used to supply the data vectors (variates, factors or texts) from which the statistics are to be calculated. Other relevant information can be supplied through the AUXILIARY and ANCILLARY options, which correspond to the AUXILIARY and ANCILLARY options of BOOTSTRAP itself. There are two parameters: STATISTIC supplies a list of scalars to store the estimates of each statistic, and EXIT a list of scalars which should be set to zero or one according to whether or not each statistic could be estimated successfully with the supplied data vectors. If the value of EXIT is not calculated in RESAMPLE, the BOOTSTRAP procedure assumes that the calculations succeeded.

This example shows a version of RESAMPLE which calculates the correlation between two variates.

PROCEDURE [PARAMETER=pointer] 'RESAMPLE'

OPTION 'DATA', " (I: variates, factors or texts) data

vectors from which to calculate

the statistics; no default"\

'AUXILIARY', " (I: pointers) auxiliary sets of data

vectors, each of which is to be

resampled independently"\

'ANCILLARY'; " (I: any type of structure) other

relevant information needed to

calculate the statistics "

MODE=p; TYPE=!t(variate,factor,text),'pointer',*;\

SET=yes,no,no; LIST=yes; DECLARED=yes; PRESENT=yes

PARAMETER 'STATISTIC', " (O: scalars) to save the calculated

statistics "\

'EXIT'; " (O: scalars) to save an exit code

to indicate failure (EXIT[i]=1) or

success (EXIT[i]=0) when calculating

each STATISTIC[i]"\

MODE=p; TYPE='scalar'; SET=yes

CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2])

& EXIT[1] = STATISTIC[1]==C('missing')

ENDPROCEDURE

VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605,\

653,575,545,572,594] Y

& [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\

3.12,2.74,2.76,2.88,2.96] Z

BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation'

The RESAMPLE procedure is called within a loop, and the statistics that are returned are loaded into variates. If any statistics fail to be calculated, as recorded by the EXIT parameter of RESAMPLE, they are stored as missing values. BOOTSTRAP will then base its estimation on the successful generations, but reports how many failures occurred.

The bootstrap estimates are formed as simple means of the stored variates, and the s.e.s are square roots of the sample variance. The TABULATE directive is used to estimate quantiles from the stored variates, to define confidence limits. The variance-covariance matrix is formed from the statistics using the FSSPM directive.

The graphical representation uses DHISTOGRAM or HISTOGRAM on the stored variates. The smoothed curves are calculated from the transformed percentages from the histogram: LOGIT(CUM(%)). A smoothing spline is fitted on this scale, by the FIT directive with the SSPLINE function, using 4 d.f. The resulting fitted values are then backtransformed and drawn on the plot with the monotonic setting of the PEN directive.

Action with `RESTRICT`

If any of the data vectors is restricted, BOOTSTRAP will use only the units that are not restricted for any of the vectors. The data vectors that are passed to the RESAMPLE procedure are all restricted to this identified set of units, but otherwise match the original data vectors. Each set of vectors supplied in pointers in the AUXILIARY option are treated separately in this way.

References

Efron, B. & Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1, 54-77.

Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall, London.

Example

CAPTION 'BOOTSTRAP example','1) Bootstrapped correlation.',\
  !t('The data are scores from two tests on new admissions to Law School ',\
  '(Efron, 1981, The Jackknife, the Bootstrap & Other Resampling Plans.',\
  'CBMS Monograph 38, SIAM, Philadelphia); listed in Table 1 of Hinkley',\
  '(1983), Encyclopedia of Statistics, Volume 4, page 282.');\
  STYLE=meta,plain,plain
" Define RESAMPLE to calculate the correlation between the two scores."
PROCEDURE [PARAMETER=pointer] 'RESAMPLE'
OPTION    'DATA',      " (I: variates, factors or texts) data vectors from
                         which to calculate the statistics; no default"\
          'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each
                         of which is to be resampled independently"\
          'ANCILLARY'; " (I: any type of structure) other relevant
                         information needed to calculate the statistics "\
          MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ 
          LIST=yes; DECLARED=yes; PRESENT=yes
PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\
          'EXIT';      " (O: scalars) to save an exit code to indicate
                         failure (EXIT[i]=1) or success (EXIT[i]=0)
                         when calculating each STATISTIC[i]"\
          MODE=p; TYPE='scalar'; SET=yes

  CALC STATISTIC[1] = CORRELATION(DATA[1]; DATA[2])
  & EXIT[1] = STATISTIC[1]==C('missing')

ENDPROCEDURE

VARIATE [VALUES=576,635,558,578,666,580,555,661,651,605,653,575,545,572,594] Y
& [VALUES=3.39,3.30,2.81,3.03,3.44,3.07,3.00,3.43,3.36,3.13,\
  3.12,2.74,2.76,2.88,2.96] Z
BOOTSTRAP [DATA=Y,Z; SEED=77320] 'Correlation'

CAPTION '2) A permutation test.',\
  !t('Five wines are tested in a completely randomized design for',\
  'alcohol content. The variance-ratio for the treatment effect is',\
  'estimated by resampling with random permutation of the observations.')
" Re-define RESAMPLE to calculate the ratio.
  The treatment factor must be passed to the procedure via the ANCILLARY
  option so that AKEEP can extract the treatment sum of squares."
PROCEDURE [PARAMETER=pointer] 'RESAMPLE'
OPTION    'DATA',      " (I: variates, factors or texts) data vectors from
                         which to calculate the statistics; no default"\
          'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each
                         of which is to be resampled independently"\
          'ANCILLARY'; " (I: any type of structure) other relevant
                         information needed to calculate the statistics "\
          MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no; \
          LIST=yes; DECLARED=yes; PRESENT=yes
PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\
          'EXIT';      " (O: scalars) to save an exit code to indicate
                         failure (EXIT[i]=1) or success (EXIT[i]=0)
                         when calculating each STATISTIC[i]"\
          MODE=p; TYPE='scalar'; SET=yes
  ANOVA [PRINT=*] DATA[1]
  AKEEP TERMS=ANCILLARY[1],'*Units*'; SS=sstreat,ssresid; DF=dftreat,dfresid
  CALC STATISTIC[1] = (sstreat/dftreat)/(ssresid/dfresid)
ENDPROCEDURE

FACTOR Wine
READ Wine,%Alcohol; FREP=labels
E  4.931 D  7.263 A  4.857 C  3.361 B  6.871 E  4.141 C  3.164 B  3.012
A  5.668 D 12.185 B  4.223 E  3.323 A  4.668 C  2.686 D  7.776 :
TREATMENT Wine
ANOVA %Alcohol

BOOTSTRAP [DATA=%Alcohol; ANCILLARY=Wine; METHOD=permute; NTIMES=500;\
  PROBABILITY=0.90; SEED=46921] 'Ratio'

CAPTION !t(\
  'The observed variance ratio of 6.41 is well outside the 90% confidence',\
  'interval. A one-sided permutuation test at the 95% level therefore',\
  'rejects the hypothesis that the observed treatment differences could',\
  'have arisen by chance from this set of data.')

CAPTION '3) Balanced bootstrap.',\
  !t('Fit parallel exponential curves to the relationship between',\
  'Sugar yield and Soil phosphorus in four years.',\
  'Estimate the asymptotic yields for each year, with standard errors.',\
  '(Note that FITCURVE does not provide these s.e.s.)')
PROCEDURE [PARAMETER=pointer] 'RESAMPLE'
OPTION    'DATA',      " (I: variates, factors or texts) data vectors from
                         which to calculate the statistics; no default"\
          'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each
                         of which is to be resampled independently"\
          'ANCILLARY'; " (I: any type of structure) other relevant
                         information needed to calculate the statistics "\
          MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ 
          LIST=yes; DECLARED=yes; PRESENT=yes
PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\
          'EXIT';      " (O: scalars) to save an exit code to indicate
                         failure (EXIT[i]=1) or success (EXIT[i]=0)
                         when calculating each STATISTIC[i]"\
          MODE=p; TYPE='scalar'; SET=yes
  CALC y = ANCILLARY[1]+DATA[1]
  MODEL y
  FITCURVE [PRINT=*] ANCILLARY[2,3]
  RKEEP ESTIMATES=est; EXIT=ex
  " Extract asymptotes: first two parameters are rate and range."
  EQUATE [OLDFORMAT=!(-2,4)] OLD=est; NEW=STATISTIC
  " Pass on information about success of fitting"
  EQUATE OLD=ex; NEW=EXIT
ENDPROCEDURE

FACTOR [LEVELS=4; VALUES=16(1...4)] Year
READ Beetwt,%sugar,SoilP
 7.23 18.5  5.4   7.69 18.0  5.4  24.64 20.1  7.8  26.67 19.8  8.0
39.78 19.5 18.0  44.98 19.3 15.6  41.59 19.7 30.4  44.08 19.8 33.8
48.37 19.4 50.4  44.76 19.0 51.0  49.73 18.6 44.0  51.54 18.5 40.2
47.69 19.0 57.2  45.66 19.4 65.0  50.18 18.6 27.0  47.69 18.7 30.0

 8.82 13.8  5.6   1.81 13.9  4.8  15.82 14.5 10.2   9.04 14.0  8.6
24.41 15.0 21.6  22.60 14.1 17.2  26.45 15.2 36.4  20.80 15.3 37.2
28.30 14.2 44.4  22.60 14.7 44.4  14.24 13.5 41.0  35.94 15.6 30.2
25.54 15.8 60.8  27.13 15.6 47.0  31.42 15.6 27.0  34.13 15.4 29.0

19.90 16.1  3.0  20.60 16.0  2.0  34.70 16.7  6.2  35.40 16.4  6.2
46.80 17.1 19.8  40.50 16.9 17.2  43.00 16.9 29.6  48.60 17.1 28.0
47.30 17.0 42.8  41.30 17.1 46.2  44.30 17.0 36.6  47.60 16.6 40.0
45.60 17.0 42.2  44.60 17.0 52.0  44.00 17.2 23.4  40.10 16.6 28.0

14.35 16.1  4.0  14.35 15.5  3.8  26.71 16.6  8.0  25.12 16.4  6.4
33.39 17.2 18.2  33.79 16.2 14.8  36.68 17.0 35.0  33.69 16.8 29.6
34.98 17.0 37.2  35.78 17.0 40.0  42.06 17.2 39.6  38.77 17.3 36.8
40.66 17.3 52.4  37.28 17.2 45.6  34.68 17.3 22.0  32.59 17.2 26.0 :
CALC Sugar = Beetwt * %sugar / 100
MODEL Sugar
FITCURVE [PRINT=model,estimates] SoilP,Year
RKEEP FITTED=Fit
CALCULATE Simplres = Sugar-Fit
CAPTION !t(\
  'Resample the residuals, adding to the fitted values already calculated.',\
  'The calculations require the fitted values, and the explanatory variate',\
  'and factor, but the values of these vectors must not be resampled.',\
  'Use the balanced bootstrap, which ensures that all observations occur',\
  'an equal number of times in the complete set of bootstrap samples.')
BOOTSTRAP [DATA=Simplres; ANCILLARY=Fit,SoilP,Year; METHOD=balance;\
  SEED=23845] 'Year 1','Year 2','Year 3','Year 4'

CAPTION '4) Use of auxiliary data.',\
  !t('Estimate difference in medians between the heights',\
  'of active volcanos in America and in Asia/Oceania',\
  '(see the Guide to Genstat, Part 2, Section 2.1).')
PROCEDURE [PARAMETER=pointer] 'RESAMPLE'
OPTION    'DATA',      " (I: variates, factors or texts) data vectors from
                         which to calculate the statistics; no default"\
          'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each
                         of which is to be resampled independently"\
          'ANCILLARY'; " (I: any type of structure) other relevant
                         information needed to calculate the statistics "\
          MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\ 
          LIST=yes; DECLARED=yes; PRESENT=yes
PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "\
          'EXIT';      " (O: scalars) to save an exit code to indicate
                         failure (EXIT[i]=1) or success (EXIT[i]=0)
                         when calculating each STATISTIC[i]"\
          MODE=p; TYPE='scalar'; SET=yes
  CALC STATISTIC[1] = MEDIAN(DATA[1])-MEDIAN(AUXILIARY[1][1])
ENDPROCEDURE

CAPTION !t(\
  'Since the samples are independent, they must be resampled separately.',\
  'The sets of vectors provided by the AUXILIARY option are each subjected',\
  'to separate resampling. The sets must be combined into pointers, to',\
  'allow the possibility of more than one additional set of data.')

VARIATE America,AsiaOcea; VALUES=!(130,126,124,124,113,89,83,77,70,62,58,51,\
  51,42,40,34,199,197,193,185,177,172,157,156,140,102,93,86,36,140,102,100,\
  94,83,83,82,73,67,67,66,60,57,57,53,49,43,43,40,35,35), !(156,125,122,120,\
  112,109,103,100,100,96,95,95,90,83,81,81,81,77,75,75,73,71,71,67,66,66,64,\
  62,60,60,60,59,58,57,56,56,55,54,54,52,52,52,51,50,49,49,48,45,44,44,37,\
  36,36,26,26,24,19,11,10,137,41)

CAPTION  'Calculate and print the difference using the procedure.'
POINTER  [VALUE=AsiaOcea] Aux
RESAMPLE [DATA=America; AUXILIARY=Aux] Diff; EXIT=exit
PRINT    Diff
CAPTION  !t('Produce the bootstrapped estimate and 90% confidence interval;',\
  'using balanced resampling.')
BOOTSTRAP [DATA=America; AUXILIARY=Aux; METHOD=balance; PROBABILITY=0.90]\
  'Difference'

CAPTION '5) Bias-corrected and accelerated confidence limits.',\
  !t('Spatial data test (Efron, B. & Tibshirani, R.J., 1993,',\
  'An Introduction to the Bootstrap, Chapman & Hall, London).')
PROCEDURE [PARAMETER=pointer] 'RESAMPLE'
OPTION    'DATA',      " (I: variates, factors or texts) data vectors from
                         which to calculate the statistics; no default"\
          'AUXILIARY', " (I: pointers) auxiliary sets of data vectors, each
                         of which is to be resampled independently"\
          'ANCILLARY'; " (I: any type of structure) other relevant
                         information needed to calculate the statistics "\
          MODE=p; TYPE=!t(variate,factor,text),'pointer',*; SET=yes,no,no;\
          LIST=yes; DECLARED=yes; PRESENT=yes
PARAMETER 'STATISTIC', " (O: scalars) to save the calculated statistics "
          'EXIT';      " (O: scalars) to save an exit code to indicate
                         failure (EXIT[i]=1) or success (EXIT[i]=0)
                         when calculating each STATISTIC[i]"\
          MODE=p; TYPE='scalar'; SET=yes
  CALC STATISTIC[1] = SUM((DATA[1] - MEAN(DATA[1]))**2) / NOBS(DATA[1])
ENDPROCEDURE

VARIATE   [VALUES=48,36,20,29,42,42,20,42,22,41,45,14,6,\
                  0,33,28,34,4,32,24,47,41,24,26,30,41] A
&         [VALUES=42,33,16,39,38,36,15,33,20,43,34,22,7,\
                  15,34,29,41,13,38,25,27,41,28,14,28,40] B

BOOTSTRAP [DATA=A; NTIMES=2000; SEED=245875; PROBABILITY=0.90; CIMETHOD=bca]\
          'Variance'

Updated on August 30, 2019

Was this article helpful?

Yes No