Checks the fit of a linear, generalized linear or nonlinear regression (P.W. Lane, R. Cunningham & C. Donnelly).
Options
PRINT = string tokens |
What to print (index , y , residuals , leverages , Cook ); default * |
---|---|
RMETHOD = string token |
Type of residual to use (deviance , Pearson , simple , deletion ); default * i.e. as set in MODEL |
INDEX = variate or factor |
Which variable to use as index; default !(1...n) |
ENVELOPE = string token |
Type of envelope with Normal and half-Normal plots (none , rough , smooth , asymptotic ); default none |
PROBABILITY = scalar |
Approximate probability level for envelope; default 0.95 |
NSIMULATIONS = scalar |
How many simulations to generate for rough or smooth envelopes; default (1+PROB )/(1-PROB ) |
SHADE = string token |
Whether to show shaded envelope rather than boundaries (no , yes ); default no |
RESIDUALS = variate |
To store chosen type of residuals; default * |
LEVERAGES = variate |
To store leverages; default * |
COOK = variate |
To store modified Cook’s statistics; default * |
GRAPHICS = string token |
Type of graphics to use (lineprinter , highresolution ); default high |
TITLE = text |
Title for graph; default identifier of response |
WINDOW = numbers |
Window or series of windows in which to display graphs; default 4, or 5…8 for composite |
SCREEN = string token |
Treatment of previous graphics screen (clear , keep ); default clea |
SAVE = regression save structure |
Specifies which model to check; default * |
Parameters
YSTATISTIC = string tokens |
What to display in the graph (residuals , Cook , leverages , absresiduals ); default resi |
---|---|
XMETHOD = string tokens |
What type of graph (fittedvalues , index , normal , halfnormal , histogram , composite ); default comp |
Description
Procedure RCHECK
provides “diagnostic” information for checking the fit of regression models. Those directives make some checks, such as for large residuals and influential points, and give access to simple and standardized residuals and leverages through directive RKEEP
. The RCHECK
procedure automatically accesses these quantities via RKEEP
and in addition can calculate deletion residuals and modified Cook’s statistics. A range of graphs can then be drawn to help check the fit of the regression model. The defaults are intended to provide a sensible display from the simple command
RCHECK
following the fit of a regression model.
The procedure is controlled by the YSTATISTIC
and XMETHOD
parameters. These can be set to display various types of residuals, as specified by the RMETHOD
option; the default is the setting of this option in the MODEL
command in force when the model was fitted. In addition, the absolute residuals, the leverages, or the modified Cook’s statistics can be displayed. Each of these sets of statistics can be plotted against the fitted values or against an index variable; by default, the index just orders the values in the order of the units. The statistics can also be shown as Normal or half-Normal plots, or as a histogram (the Normal plot for absolute residuals being the same as the half-Normal plot). A set of four such plots is displayed as a composite picture: histogram, plot against fitted values, Normal plot and half-Normal plot (with an index plot replacing the Normal plot for absolute residuals). Graphs can be displayed in line-printer style by setting the GRAPHICS
option, though some features are not then available.
The chosen type of residuals, the leverages and Cook’s statistics can be printed, or stored in variates using the RESIDUALS
option.
Plots of the residuals against fitted values or an index variable are displayed with a smoothed line fitted through the points, to indicate any potential trend.
Normal and half-Normal plots can be enhanced with an “envelope” by setting the ENVELOPE
option. The rough
setting produces an upper and lower bound for the values, and a median line, produced by simulation. The bounds correspond approximately to individual confidence intervals for each value, with probability as set by the PROBABILITY
option (default 95%). The number of simulations by default is the minimum to allow estimation of the required limits: this is (1+PROBABILITY
) / (1-PROBABILITY
). A larger number of simulations can be requested with the NSIMULATIONS
option, to give better estimates at the expense of more computing time. The smooth
setting requests that the bounds are smoothed, using a cubic smooting spline with 4 d.f. The asymptotic
setting produces bounds calculated from the asymptotic distribution of Normal order statistics. The envelope for all these settings can be displayed as a shaded region rather than as a set of three lines by setting the SHADE
option to yes
.
Envelopes cannot be calculated for nonlinear models or curves, nor for generalized linear models with inverse Normal, negative binomial, geometric, multinomial or calculated distributions. Nor can they be produced for deletion residuals or Cook’s statistics; they are not appropriate for leverages, which have no associated distributional assumption.
The graphical displays can be controlled as usual using the TITLE
and SCREEN
options. The WINDOW
option can be used to select a defined windows for high-resolution plots. Otherwise window 4 is used for a single plot or windows 5-8 for composite plots. These are redefined if necessary to fill the frame.
The colours and symbols used in the displays can be controlled by setting the attributes of the following pens with the PEN
directive before calling the procedure:
pen 2 | zero lines in fitted-value, Normal and index plots; |
---|---|
pen 3 | points and histogram bars; |
pen 4 | shading of envelopes; |
pen 5 | smooth line in fitted-value and index plots of residuals, and envelope bounds if unshaded. |
The procedure exits if there are fewer than four observations, or fewer than two non-missing standardized residuals.
Options: PRINT
, RMETHOD
, INDEX
, ENVELOPE
, NSIMULATIONS
, PROBABILITY
, SHADE
, RESIDUALS
, LEVERAGES
, COOK
, GRAPHICS
, TITLE
, WINDOW
, SCREEN
, SAVE
.
Parameters: YSTATISTIC
, XMETHOD
.
Method
Standardized residuals and leverages are accessed using RKEEP
from the latest fitted regression model, or from that specified by the SAVE
option. Deletion residuals di are calculated for linear models as follows:
di = ri /√((n–p–ri2)/(n–p-1))
where ri are the standardized residuals, n is the number of observations, and p is the number of parameters in the model. For generalized linear models,
di = SIGN(rdi) × √((1-li) × rdi2 + li) × rpi2)
where rdi and rpi are the standardized deviance and Pearson residuals respectively.
Modified Cook’s statistics ci are calculated as follows:
ci = ABS(di) × √{ (n–p) × li / (p × (1-li)) }
where li are the leverages. In Normal plots, the Normal quantiles are calculated as follows:
qi = NED( (i-0.375) / (n+0.25) )
while for a half-Normal plot they are given by
qi = NED( 0.5 + 0.5 × (i-0.375) / (n+0.25) )
For generalized linear models, fitted values are transformed by an approximate variance-stabilizing transformation before use in graphs:
Poisson, multinomial, negative binomial and geometric 2 × SQRT(fitted)
binomial, Bernoulli | 2 × ANG(100 × fitted / nbinomial) |
---|---|
gamma, exponential | LOG(fitted) |
inverse Normal | 1 / fitted |
The smoothed line displayed for fitted-value or index plots is calculated as a straight line if the number n of distinct explanatory values is >3. Otherwise it is a cubic smoothing spline, with 2 d.f. for n>9, 3 for n>34 or 4 for n>59.
For Normal linear models, envelopes are calculated by default from ns sets of Normal random numbers, where
ns = (1 + PROBABILITY
) / (1 – PROBABILITY
).
If the number of observations is less than 100, the values are transformed using the projection matrix to induce the observed correlation pattern of the data; for larger datasets, no transformation is done. The values are then ordered and the minimum and maximum values determine the envelope boundaries. If ns is set by the NSIMULATIONS
option, the boundaries are calculated with the QUANTILES
function from the ns values generated for each ordered residual. For generalized linear models, ns sets of values of the response variate are generated from the distribution, with parameters estimated from the current fit. The model is refitted to each set, and the residuals extracted and dealt with as for the transformed Normal values above.
Action with RESTRICT
Restrictions applied to vectors used in the regression apply also to the RCHECK
procedure. Values of diagnostic quantities are set to missing for all excluded units.
See also
Procedures: RDESTIMATES
, RGRAPH
, APLOT
, DRESIDUALS
, VPLOT
.
Commands for: Regression analysis.
Example
CAPTION 'RCHECK example',\ !t('Model atmospheric pressure on boiling point',\ '(data from Atkinson, 1985, Plots, Transformations & Regression).');\ STYLE=meta,plain VARIATE [NVALUES=17] Boil,Pressure READ Boil,Pressure 194.5 20.79 194.3 20.79 197.9 22.40 198.4 22.67 199.4 23.15 199.9 23.35 200.9 23.89 201.1 23.99 201.4 24.02 201.3 24.01 203.6 25.14 204.6 26.57 209.5 28.49 208.6 27.76 210.7 29.04 211.9 29.88 212.2 30.06 : CALCULATE LogPressure = 100*LOG10(Pressure) MODEL LogPressure FIT Boil CAPTION '1. Plot composite of four displays of the standardized residuals.' RCHECK CAPTION !t('2. Plot simple residuals against boiling point,',\ 'and display a Normal plot of simple residuals.') RCHECK [RMETHOD=simple; INDEX=Boil] Y=2(residual); XMETHOD=index,Normal CAPTION !t('3. Display a half-Normal plot with a generated envelope,',\ 'that has been smoothed, and display as a shaded area;',\ 'change colours to give dark blue points on cyan background.') PEN 3,4; COLOUR='blue','aqua' RCHECK [ENVELOPE=smooth; SHADE=yes] Y=residual; XMETHOD=Normal CAPTION '4. Print deletion residual, Cook''s statistic and leverage.' VARIATE [VALUES=1...17] observe; DECIMALS=0 RCHECK [PRINT=index,residual,leverage,cook; RMETHOD=deletion;\ INDEX=observe; GRAPHICS=*]