SVHOTDECK procedure

Performs hot-deck and model-based imputation for survey data (S.D. Langton).

Options

`PRINT` = string token	Controls printed output (`summary`, `monitoring`, `check`, `list`, `regression`); default `summ`
`METHOD` = string token	Imputation method (`hotdeck`, `modelbased`); default `hotd`
`DMETHOD` = string token	Method for calculating distances (`mean`, `minimax`, `regression`); defaule `mini`
`%THRESHOLD` = scalar	Percentage threshold for matches
`THRESHOLD` = scalar	Absolute threshold for matches
`DVARIABLES` = variates or factors	Variables to use for distance calculation or factors
`DRANGES` = scalars	Ranges to use for distance calculations with each of the `DVARIABLES`; default `*` uses the observed range
`LABELS` = variate, factor or text	Provides labels for the cases
`SEED` = scalar	Seed for random numbers; default 0
`IMPUTE` = variate or scalar	The variate provides logical (0 or 1) values to indicate whether each unit is to be imputed, alternatively the scalar specifies a number of rows to be selected at random to be imputed to allow the effectiveness of the imputation process to be studied; default `*` imputes values for any units where an `OLDSTRUCTURE` contains a missing value
`DONORS` = variate	Logical variate indicating whether each unit can be used as a donor; default `*` implies that all units are used with complete data for each `OLDSTRUCTURE`
`RSAVE` = rsave	Regression analysis to use for `METHOD=model` or `DMETHOD=regression`
`URECEPTORS` = variate	Saves unit numbers of receptor (imputed) cases
`UDONORS` = variate	Saves unit numbers of donor cases
`DISTANCES` = variate	Saves the distances for the chosen receptor-donor pairs

Parameters

`OLDSTRUCTURE` = variates or factors	Structure containing missing values
`NEWSTRUCTURE` = variates or factors	New structures with imputed values
`OVERWRITE` = string tokens	Whether to overwrite any existing data for imputed cases (`yes`, `no`); default `no`

Description

Survey data frequently contain missing values. When all the information is missing for a sample unit it is generally appropriate to allow for this by modifying the weights, but when only certain variables are missing (item non-response) imputation is often used to fill in the missing values. SVHOTDECK performs “hot-deck” imputation (see for example Korn & Graubard 1998) whereby replacement values are taken from another unit, chosen at random, usually from a list of suitable matches determined on the basis of a suitable distance metric. The procedure can also be used for model-based imputation; in this case the imputed value is taken as the sum of the fitted value from a regression model and a residual chosen at random from another unit. In the description below “donor” is used to mean a unit supplying data to a “receptor” that has a missing value initially.

The data are usually supplied by the OLDSTRUCTURE parameter, in variates and/or factors, containing missing values. The NEWSTRUCTURE parameter supplies new variates or factors to contain the values of each OLDSTRUCTURE variate or factor, but with the missing values replaced by the imputed values. By default, imputation is carried out for any row of data where an OLDSTRUCTURE contains missing values. Alternatively, the rows to be imputed can be specified by setting option IMPUTE. This can supply a logical variate, containing the value one in the units whose values are to be imputed, and zero elsewhere, or it can supply a scalar specifying a number of rows to be selected at random to be imputed. The scalar setting is useful if you want to study the effectiveness of the imputation process.

By default, imputed values will be used only to replace the missing values in each OLDSTRUCTURE, unless the corresponding setting of the OVERWRITE parameter is yes. Imputed values are then inserted even if the original value is not missing. This would allow you, for example, to compare real and imputed data in order to check the efficiency of the imputation process. Alternatively, you might set OVERWRITE=yes for every OLDSTRUCTURE in order to preserve the correlations between the variables by taking all the values from each donor.

By default, any row of OLDSTRUCTURE with no missing values may be used as a donor, unless option DONORS is used to specify a logical variate to indicate the rows that are to act as potential donors.

The DVARIABLES option is used to supply one of more variables to use to determine the matching between donors and receptors. In the simplest case, if you set DVARIABLES to a single factor, the donors are selected at random from receptors with the same factor value (e.g. to replace observations by others from the same stratum). For more complex matching, DVARIABLES can be set to a list of variates or factors which are then used to determine a distance between each receptor and the potential donors. By default the distance for a DVARIABLES variate is calculated as

d = |x_i – x_j| / r

where r is the observed range of the data, but an alternative value of r may be supplied using the DRANGES option. DRANGES should be set to 1 if no scaling of the distances is required. For a DVARIABLES factor a simple matching criterion is used, so d = 0 if x_i and x_j are the same, and d = 1 if they are not.

Matches are then determined using these distances according to a “minimax” approach, where the best match is the one with the minimum value of the maximum absolute difference between any of the DVARIABLES. Alternatively you can set the DMETHOD option to mean to use the mean of the absolute differences, or to regression to request that the distances are determined on the basis of predictions from a regression.

The RSAVE option specifies the regression analysis to use when DMETHOD=regression. The terms in the model must include the DVARIABLES. If RSAVE is not specified, the most recent regression analysis is used. The calculation of the distances between units is then weighted by the appropriate regression coefficients: for example, if the slope of x1 is 0.24 and two units have x1 values of 10 and 20, the distance is

(20 – 10) × 0.24 = 2.4.

DRANGES are ignored when DMETHOD=regression.

Conventional hot-deck imputation is the default method. Alternatively, if you set option METHOD=modelbased, SVHOTDECK will do model-based imputation. Note, though, that this cannot be used if DMETHOD=regression. Model-based imputation uses a regression analysis, specified by the RSAVE option. If RSAVE is not specified, the most recent regression analysis is used. The method creates an imputed value by adding a random residual to the fitted value of the selected donor. This method can be used only if the OLDSTRUCTURE is the same as the y-variate in the regression. DVARIABLES will frequently be left unset in this situation, so that the residuals are chosen totally at random. However, in some situations it may be preferable to select residuals from similar units, in which case DVARIABLES can be used to determine the matching, as with the hot-deck method.

By default, SVHOTDECK will determine the single best match for each unit, where possible. In many cases (e.g. when doing multiple imputation), it is required to select one at random from the closest matches. The %THRESHOLD option specifies the tolerance to use in these situations: for example, setting %THRESHOLD to 10 requests that the match is selected at random from amongst the donors with distance up to 10% greater than the minimum distance. The SEED option specifies the seed for the random numbers that are used for this operation (default 0). Alternatively, if it is desired to specify the distance relative to the minimum in absolute terms, the THRESHOLD option should be used instead. If both THRESHOLD and %THRESHOLD are set, both criteria must be met. The THRESHOLD value is normally set relative to the minimum distance, but, if it is set to a negative value this is taken to mean that a match is selected at random from those with a distance less than the absolute value of the THRESHOLD. Thus, for example, if THRESHOLD is set to -0.2 and METHOD=mean, any units with a mean distance of less than 0.2 (after taking into account settings of DRANGES) from the unit to be imputed are considered matches, and one of these is selected at random. Alternatively, if THESHOLD is set to 0.2 and the best match is for example 0.18, any units with a mean distance of less than 0.18 + 0.2 = 0.38 are considered matches, and one of these is selected at random.

The URECEPTORS and UDONORS options can be used to save the unit numbers of the receptor (imputed) cases and the donor cases, respectively. Note that, if the IMPUTE option is set, the OLDSTRUCTURE and NEWSTRUCTURE parameters need not be set. The use of URECEPTORS and UDONORS then allows more complicated methods of replacement to be used than those provided directly by SVHOTDECK.

Printed output and plots are controlled by the PRINT option, with the settings:

`monitoring`	provides information about each match,
`summary`	provides a summary,
`list`	produces a list of recipients and donors,
`check`	prints correlations as well as giving a scatter plot of the predictions against the actual data, and
`regression`	gives details of the model used when `DMETHOD` is set to `regression`.

To use check it is necessary to impute for data values that are present. This can be achieved either by specifying these units using IMPUTE, or by setting IMPUTE to a scalar, in which case the appropriate number of rows will be selected at random.

Options: PRINT, METHOD, DMETHOD, %THRESHOLD, THRESHOLD, DVARIABLES, DRANGES, LABELS, SEED, IMPUTE, DONORS, RSAVE, URECEPTORS, UDONORS, DISTANCE.

Parameters: OLDSTRUCTURE, NEWSTRUCTURE, OVERWRITE.

Action with `RESTRICT`

SVHOTDECK takes restrictions from any OLDSTRUCTURE or DVARIABLES vectors. Only unrestricted units are used as either donors or receptors. However, restrictions on IMPUTE and DONORS are ignored.

References

Korn, E.L. & Graubard, B.I. (1999). Analysis of Health Surveys. Wiley, New York.

Example

CAPTION   'SVHOTDECK example',\
          'Orkney oats data (Sampford, Table 5.1, page 61).';\
          STYLE=meta,plain
VARIATE   Oats
READ      Farm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 :
READ      Crops
50 50 52 58 60 60 62 65 65 68 71 74 78 90 91 92 96 110 140 140 156 156 190
198 209 240 274 300 303 311 324 330 356 410 430 :
READ      Oats
17 17 10 16 6 15 20 18 14 20 24 18 23 0 27 34 25 24 43 48 44 45 60 63 70 28
62 59 66 58 128 38 69 72 103 :

"Insert some missing values to impute"
CALCULATE Oatsmiss = MVINSERT(Oats; Farm.IN.!(17,23,30))
"First nearest match.  Set DRANGE to 1 to make distances easy to interpret"
SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\
          SEED=600209] Oatsmiss; NEWSTRUCTURE=Oatsimp1
"now pick at random from those within 20 acres of nearest match on crops"
SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\
          THRESHOLD=20; SEED=12345] Oatsmiss; NEWSTRUCTURE=Oatsimp2
"and at random from those differing in crop area by 20 hectares or less"
SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\
          THRESHOLD=-20; SEED=23456] Oatsmiss; NEWSTRUCTURE=Oatsimp3
PRINT     Farm,Crops,Oats,Oatsmiss,Oatsimp1,Oatsimp2,Oatsimp3;\
          DECIMALS=0; FIELD=9

Updated on March 5, 2019

Was this article helpful?

Yes No