Performs hot-deck and model-based imputation for survey data (S.D. Langton).
|Controls printed output (
||Imputation method (
||Method for calculating distances (
||Percentage threshold for matches|
||Absolute threshold for matches|
||Variables to use for distance calculation or factors|
||Ranges to use for distance calculations with each of the
||Provides labels for the cases|
||Seed for random numbers; default 0|
||The variate provides logical (0 or 1) values to indicate whether each unit is to be imputed, alternatively the scalar specifies a number of rows to be selected at random to be imputed to allow the effectiveness of the imputation process to be studied; default
||Logical variate indicating whether each unit can be used as a donor; default
||Regression analysis to use for
||Saves unit numbers of receptor (imputed) cases|
||Saves unit numbers of donor cases|
||Saves the distances for the chosen receptor-donor pairs|
||Structure containing missing values|
||New structures with imputed values|
||Whether to overwrite any existing data for imputed cases (
Survey data frequently contain missing values. When all the information is missing for a sample unit it is generally appropriate to allow for this by modifying the weights, but when only certain variables are missing (item non-response) imputation is often used to fill in the missing values.
SVHOTDECK performs “hot-deck” imputation (see for example Korn & Graubard 1998) whereby replacement values are taken from another unit, chosen at random, usually from a list of suitable matches determined on the basis of a suitable distance metric. The procedure can also be used for model-based imputation; in this case the imputed value is taken as the sum of the fitted value from a regression model and a residual chosen at random from another unit. In the description below “donor” is used to mean a unit supplying data to a “receptor” that has a missing value initially.
The data are usually supplied by the
OLDSTRUCTURE parameter, in variates and/or factors, containing missing values. The
NEWSTRUCTURE parameter supplies new variates or factors to contain the values of each
OLDSTRUCTURE variate or factor, but with the missing values replaced by the imputed values. By default, imputation is carried out for any row of data where an
OLDSTRUCTURE contains missing values. Alternatively, the rows to be imputed can be specified by setting option IMPUTE. This can supply a logical variate, containing the value one in the units whose values are to be imputed, and zero elsewhere, or it can supply a scalar specifying a number of rows to be selected at random to be imputed. The scalar setting is useful if you want to study the effectiveness of the imputation process.
By default, imputed values will be used only to replace the missing values in each
OLDSTRUCTURE, unless the corresponding setting of the
OVERWRITE parameter is
yes. Imputed values are then inserted even if the original value is not missing. This would allow you, for example, to compare real and imputed data in order to check the efficiency of the imputation process. Alternatively, you might set
OVERWRITE=yes for every
OLDSTRUCTURE in order to preserve the correlations between the variables by taking all the values from each donor.
By default, any row of
OLDSTRUCTURE with no missing values may be used as a donor, unless option
DONORS is used to specify a logical variate to indicate the rows that are to act as potential donors.
DVARIABLES option is used to supply one of more variables to use to determine the matching between donors and receptors. In the simplest case, if you set
DVARIABLES to a single factor, the donors are selected at random from receptors with the same factor value (e.g. to replace observations by others from the same stratum). For more complex matching,
DVARIABLES can be set to a list of variates or factors which are then used to determine a distance between each receptor and the potential donors. By default the distance for a
DVARIABLES variate is calculated as
d = |xi – xj| / r
where r is the observed range of the data, but an alternative value of r may be supplied using the
DRANGES should be set to 1 if no scaling of the distances is required. For a
DVARIABLES factor a simple matching criterion is used, so d = 0 if xi and xj are the same, and d = 1 if they are not.
Matches are then determined using these distances according to a “minimax” approach, where the best match is the one with the minimum value of the maximum absolute difference between any of the
DVARIABLES. Alternatively you can set the
DMETHOD option to
mean to use the mean of the absolute differences, or to
regression to request that the distances are determined on the basis of predictions from a regression.
RSAVE option specifies the regression analysis to use when
DMETHOD=regression. The terms in the model must include the
RSAVE is not specified, the most recent regression analysis is used. The calculation of the distances between units is then weighted by the appropriate regression coefficients: for example, if the slope of
x1 is 0.24 and two units have
x1 values of 10 and 20, the distance is
(20 – 10) × 0.24 = 2.4.
DRANGES are ignored when
Conventional hot-deck imputation is the default method. Alternatively, if you set option
SVHOTDECK will do model-based imputation. Note, though, that this cannot be used if
DMETHOD=regression. Model-based imputation uses a regression analysis, specified by the
RSAVE option. If
RSAVE is not specified, the most recent regression analysis is used. The method creates an imputed value by adding a random residual to the fitted value of the selected donor. This method can be used only if the
OLDSTRUCTURE is the same as the y-variate in the regression.
DVARIABLES will frequently be left unset in this situation, so that the residuals are chosen totally at random. However, in some situations it may be preferable to select residuals from similar units, in which case
DVARIABLES can be used to determine the matching, as with the hot-deck method.
SVHOTDECK will determine the single best match for each unit, where possible. In many cases (e.g. when doing multiple imputation), it is required to select one at random from the closest matches. The
%THRESHOLD option specifies the tolerance to use in these situations: for example, setting
%THRESHOLD to 10 requests that the match is selected at random from amongst the donors with distance up to 10% greater than the minimum distance. The
SEED option specifies the seed for the random numbers that are used for this operation (default 0). Alternatively, if it is desired to specify the distance relative to the minimum in absolute terms, the
THRESHOLD option should be used instead. If both
%THRESHOLD are set, both criteria must be met. The
THRESHOLD value is normally set relative to the minimum distance, but, if it is set to a negative value this is taken to mean that a match is selected at random from those with a distance less than the absolute value of the
THRESHOLD. Thus, for example, if
THRESHOLD is set to -0.2 and
METHOD=mean, any units with a mean distance of less than 0.2 (after taking into account settings of
DRANGES) from the unit to be imputed are considered matches, and one of these is selected at random. Alternatively, if
THESHOLD is set to 0.2 and the best match is for example 0.18, any units with a mean distance of less than 0.18 + 0.2 = 0.38 are considered matches, and one of these is selected at random.
UDONORS options can be used to save the unit numbers of the receptor (imputed) cases and the donor cases, respectively. Note that, if the
IMPUTE option is set, the
NEWSTRUCTURE parameters need not be set. The use of
UDONORS then allows more complicated methods of replacement to be used than those provided directly by
Printed output and plots are controlled by the
||provides information about each match,|
||provides a summary,|
||produces a list of recipients and donors,|
||prints correlations as well as giving a scatter plot of the predictions against the actual data, and|
||gives details of the model used when
check it is necessary to impute for data values that are present. This can be achieved either by specifying these units using
IMPUTE, or by setting
IMPUTE to a scalar, in which case the appropriate number of rows will be selected at random.
SVHOTDECK takes restrictions from any
DVARIABLES vectors. Only unrestricted units are used as either donors or receptors. However, restrictions on
DONORS are ignored.
Korn, E.L. & Graubard, B.I. (1999). Analysis of Health Surveys. Wiley, New York.
Commands for: Survey analysis.
CAPTION 'SVHOTDECK example',\ 'Orkney oats data (Sampford, Table 5.1, page 61).';\ STYLE=meta,plain VARIATE Oats READ Farm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 : READ Crops 50 50 52 58 60 60 62 65 65 68 71 74 78 90 91 92 96 110 140 140 156 156 190 198 209 240 274 300 303 311 324 330 356 410 430 : READ Oats 17 17 10 16 6 15 20 18 14 20 24 18 23 0 27 34 25 24 43 48 44 45 60 63 70 28 62 59 66 58 128 38 69 72 103 : "Insert some missing values to impute" CALCULATE Oatsmiss = MVINSERT(Oats; Farm.IN.!(17,23,30)) "First nearest match. Set DRANGE to 1 to make distances easy to interpret" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ SEED=600209] Oatsmiss; NEWSTRUCTURE=Oatsimp1 "now pick at random from those within 20 acres of nearest match on crops" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ THRESHOLD=20; SEED=12345] Oatsmiss; NEWSTRUCTURE=Oatsimp2 "and at random from those differing in crop area by 20 hectares or less" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ THRESHOLD=-20; SEED=23456] Oatsmiss; NEWSTRUCTURE=Oatsimp3 PRINT Farm,Crops,Oats,Oatsmiss,Oatsimp1,Oatsimp2,Oatsimp3;\ DECIMALS=0; FIELD=9