SVSTRATIFIED procedure

Analyses stratified random surveys by expansion or ratio raising (S.D. Langton).

Options

`PRINT` = string token	Controls printed output (`summary`, `totals`, `means`, `influence`, `ratios`, `extra`); default `summ`, `tota`, `infl`
`PLOT` = string token	Controls which high-resolution graphs are plotted (`single`, `separate`); default `*` i.e. none
`XMISSING` = string token	Action if x-variable contains missing values (`estimate`, `fault`); default `esti`
`RESTRICTED` = string token	Action with restricted (or filtered) observations (`omit`, `add`); default `omit`
`STRATUMFACTOR` = factor	Stratification factor; default `*` i.e. unstratified
`NINFLUENCE` = scalar	Number of influential points to print; default 10
`METHOD` = string token	Method for ratio analysis (`separate`, `combined`, `classicalcombined`); default `sepa`
`SAVESUMMARY` = string token	Whether to save just the overall summaries instead of those for each stratum (`yes`, `no`); default `no`
`COMBINEDSTRATUM` = scalar	Stratum for which the ratio should be set to the combined ratio estimate; default `*`
`ROWS` = scalars	Number of rows of plot-matrix; default `*` i.e. set automatically depending on number of levels of `STRATUMFACTOR`
`COLUMNS` = scalars	Number of columns of plot-matrix; default `*` i.e. set automatically depending on number of levels of `STRATUMFACTOR`
`NBOOT` = scalar	Number of bootstrap samples to use; default 0
`SEED` = scalar	Seed for random number generator for bootstrap; default 0
`CIPROBABILITY` = scalars	The probability level for the confidence intervals; default 0.95
`CIMETHOD` = string token	Method for forming confidence intervals (`automatic`, `tdistribution`, `percentile`); default `auto`
`COMPACT` = string token	Whether to produce output in a compact (plaintext) format (`yes`, `no`); default `no`

Parameters

`Y` = variates	Response data
`X` = variates	Base data; if unset expansion raising is used
`LABELS` = variates, factors or texts	Structure for labelling influential points
`NUNITS` = tables, scalars or variates	Numbers of units in each stratum in the population
`XTOTALS` = tables, scalars or variates	Population totals of the base data in each stratum
`TOTALS` = tables or scalars	Saves total estimates
`SETOTALS` = tables or scalars	Saves standard errors of estimates
`MEANS` = tables or scalars	Saves mean estimates
`SEMEANS` = tables or scalars	Saves standard errors of mean estimates
`RATIOS` = tables	Saves estimates of ratios
`FITTEDVALUES` = variates	Saves fitted values for the observations
`INFLUENCE` = variates	Saves influence statistics
`LTOTALS` = tables or scalars	Saves lower confidence limit for total
`UTOTALS` = tables or scalars	Saves upper confidence limit for total
`LMEANS` = tables or scalars	Saves lower confidence limit for mean
`UMEANS` = tables or scalars	Saves upper confidence limit for mean
`VARIANCES` = tables or scalars	Saves residual variances in each stratum

Description

SVSTRATIFIED analyses the results from a stratified random survey, either by expansion or ratio raising, and allows detection of outliers. The sample data are supplied, in a variate, using the Y parameter. Similarly the base data are provided using the X parameter. The LABELS parameter can supply a variate, factor or text for labelling individual units in the output. If X is unset or missing, expansion raising is used (i.e. the usual stratified random sampling analysis) but within a stratum units must either all have base data or all lack it. (Note: stratum is used here in the survey sense, not as in the ANOVA directive: i.e. the units are assumed to be classified into groups, and each group is called a stratum.) If option XMISSING is set to fault, any missing base data will cause a fault.

The vectors Y, X and LABELS should usually have one row for each unit in the survey population, with unsampled or non-responding units having a missing value in the Y variate. However, if parameter NUNITS is set, the Y variate may contain only the response data; NUNITS then supplies the information about the number of units in each stratum in the full population. Similarly, if ratio estimation is required, XTOTALS should contain the population totals of X in each stratum.

The METHOD specifies which method of ratio estimation to use. The setting separate estimates a ratio for each stratum, whereas settings combined and classicalcombined assume a common ratio in all strata. The classicalcombined method follows the approach shown in most textbooks, where the estimate for a stratum is given by ∑X × ratio where the summation is over all units in the stratum. This approach can produce illogical estimates in some situations (e.g. the estimate may be less than the sum of the responses) and so the combined method estimates only for the unobserved units and adds this to the sum of the observed responses in the stratum, i.e. ∑Y + ∑X × ratio where the summation of Y is over sampled (or responding) units and the summation of X is over unsampled units. Option COMBINEDSTRATUM is used with the separate ratio method and allows the ratio in a particular stratum to be reset to the combined ratio value; this can be a useful technique for dealing with the extreme ratios sometimes produced when the sampling fraction in a stratum is very low.

Printing is controlled via the PRINT option. The default settings are summary, totals and influence; these print a summary of the data, estimated totals and influence statistics, respectively. The setting means produces a table showing the estimated means, whilst ratio produces a low-resolution plot of the confidence limits for the ratio estimates; this can be useful when deciding whether a combined ratio estimate is to be used. The setting extra displays extra information relating to the analysis, including sums and means of the response data and raising factors (weights).

The CIPROBABILITY option sets the probability level used in calculation of confidence limits for means and totals. The CIMETHOD option controls how confidence limits are formed after bootstrapping: percentile uses simple percentiles of the bootstrapped distribution, whilst tdistribution calculates a standard error from the bootstrapped estimates and then uses the t-distribution to form intervals; the default of automatic uses the percentile method unless less than 400 bootstrap samples have been made.

The NINFLUENCE option controls the number of points of high influence printed. The COMPACT option can be used to switch to a compact, plain-text style for the output, designed for printing concise summaries of an analysis. When COMPACT=yes, the information printed depends on the width of the first output channel, with more information being displayed when this can be done without splitting tables.

By default all standard errors and confidence limits are calculated using the conventional approximations. Alternatively, bootstrap methods may be used by setting the NBOOT option to the required number of bootstrap samples. In the case of ratio estimation, the samples are used to form bootstrap estimates of the ratio, which are then applied to the known population totals for X. Bootstrapping is carried out independently in each stratum, using the method described by Sarndal et al. (1992, page 442); this involves creating a “pseudopopulation” containing n replicates of each observation, where n is nearest integer to the expansion raising factor (inverse of inclusion probability) for the stratum. Bootstrap samples of the same size as the original sample are then taken from the pseudopopulation and used to compute the estimates. The SEED option specifies the seed to use in the random number generator used to construct the bootstrap samples. The default value of zero continues an existing sequence of random numbers or, if the generator has not yet been used in this run of Genstat, it initializes the generator automatically.

Graphical output is available by setting the PLOT option. The setting single produces a single plot of the response data against X or against the stratum number if X is unset. A fitted line is shown if one of the combined ratio methods is used. The separate setting produces one graph for each stratum, with up to six graphs on each screen. All graphs are plotted on the log scale.

Output can be saved using the parameters TOTALS, SETOTALS, MEANS, SEMEANS, LTOTALS, UTOTALS, LMEANS and UMEANS. These are generally set to a table classified by the stratification factor but, if option SAVESUMMARY=yes, then they save scalars containing only the grand total summed over all strata. Ratios can be saved in a table using the RATIOS parameter, whilst the residual variances in each stratum can be saved using VARIANCES; the latter are useful for working out optimal allocation strategies for future surveys. Fitted values and influence statistics may be saved using parameters FITTEDVALUES and INFLUENCE. The fitted values are the X value multiplied by the appropriate ratio for each unit or, where expansion raising is used, the mean Y value for the stratum.

Options: PRINT, PLOT, XMISSING, RESTRICTED, STRATUMFACTOR, NINFLUENCE, METHOD, SAVESUMMARY, COMBINEDSTRATUM, ROWS, COLUMNS, NBOOT, SEED, CIPROBABILITY, CIMETHOD, COMPACT.

Parameters: Y, X, LABELS, NUNITS, XTOTALS, TOTALS, SETOTALS, MEANS, SEMEANS, RATIOS, FITTEDVALUES, INFLUENCE, LTOTALS, UTOTALS, LMEANS, UMEANS, VARIANCES.

Method

The methods used are described in most survey analysis textbooks; see for example, Sampford (1962) or Lehtonen & Pahkinen (1994). Most calculations are carried out using Genstat table structures.

Action with `RESTRICT`

The action with RESTRICT depends of the setting of the RESTRICTED option. By default restricted units are totally excluded from the analysis. If RESTRICTED is set to add, restricted observations are excluded from the ratio calculations but then added back into the total estimates; this is a technique for dealing with nonrepresentative outliers (see e.g. Lee, 1995), which are believed to be genuine observations but are not representative of the wider population.

References

Lee, H. (1995). Outliers in Business Surveys. Chapter 26 of Business Survey Methods (ed. Cox, Binder, Hinnappa, Christianson, Colledge & Kott). Wiley, New York.

Lehtonen, R. & Pahkinen, E.J. (1994). Practical Methods for Design and Analysis of Complex Surveys. Wiley, New York.

Sampford, M.R. (1962). An introduction to Sampling Theory. Oliver & Boyd, London.

Example

CAPTION      'SVSTRATIFIED example',\
             'Orkney oats data (Sampford, Table 5.1, page 61).';\
             STYLE=meta,plain
" Firstly stratified random sample, entered with sample data
  only, plus table with population size - see Table 6.1, page 73."
VARIATE      Oats
READ         Oats
15 20 18 18 23 27 25 60 28 128 69 72 :
FACTOR       [LEVELS=3; VALUES=4(1,2,3)] Stratum
TABLE        [CLASS=Stratum; VALUES=12,12,11] N
SVSTRATIFIED [PRINT=summary,totals; STRATUMFACTOR=Stratum] Oats; NUNITS=N

" Secondly ratio analysis - data entered as one row for each farm
  in the population - see page 109."
VARIATE      Oats
READ         Farm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 :
READ         Crops
50 50 52 58 60 60 62 65 65 68 71 74 78 90 91 92 96 110 140 140 156 156 190
198 209 240 274 300 303 311 324 330 356 410 430 :
READ         Oats
17 17 10 16 6 15 20 18 14 20 24 18 23 0 27 34 25 24 43 48 44 45 60 63 70 28
62 59 66 58 128 38 69 72 103 :
" To form the sample of 5 farms used, replace the others with missing values."
CALCULATE    Oats=MVINSERT(Oats; Farm.NI.!(1,15,23,30,33))
SVSTRATIFIED [PRINT=summary,totals,means] Oats; X=Crops

Updated on March 5, 2019

Was this article helpful?

Yes No