Forms predictions from a linear or generalized linear model.
Options
PRINT = string token |
What to print (description , lsd , predictions , se , sed , vcovariance ); default desc , pred , se |
---|---|
CHANNEL = scalar |
Channel number for output; default * i.e. current output channel |
COMBINATIONS = string token |
Which combinations of factors in the current model to include (full, present , estimable ); default esti |
ADJUSTMENT = string token |
Type of adjustment (marginal, equal ); default marg |
WEIGHTS = table |
Weights classified by some or all of the factors in the model; default * |
OFFSET = scalar |
Value of offset on which to base predictions; default mean of offset variate |
METHOD = string token |
Method of forming margin (mean, total ); default mean |
ALIASING = string token |
How to deal with aliased parameters (fault , ignore ); default faul |
BACKTRANSFORM = string token |
What back-transformation to apply to the values on the linear scale, before calculating the predicted means (link, none ); default link |
SCOPE = string token |
Controls whether the variance of predictions is calculated on the basis of forecasting new observations rather than summarizing the data to which the model has been fitted (data , new ); default data |
NOMESSAGE = string tokens |
Which warning messages to suppress (dispersion , nonlinear ); default * |
DISPERSION = scalar |
Value of dispersion parameter in calculation of s.e.s; default is as set in the MODEL statement |
DMETHOD = string token |
Basis of estimate of dispersion, if not fixed by DISPERSION option (deviance, Pearson ); default is as set in the MODEL statement |
NBINOMIAL = scalar |
Supplies the total number of trials to be used for prediction with a binomial distribution (providing a value n greater than one allows predictions to be made of the number of “successes” out of n, whereas the value one predicts the proportion of successes); default 1 |
PREDICTIONS = tables or scalars |
Saves predictions for each y variate; default * |
SE = tables or scalars |
Saves standard errors of predictions for each y variate; default * |
SED = symmetric matrices |
Saves standard errors of differences between predictions for each y variate; default * |
LSD = symmetric matrices |
Saves least significant differences between predictions for each y variate (models with Normal errors only); default * |
LSDLEVEL = scalar |
Significance level (%) to use in the calculation of least significant differences; default 5 |
VCOVARIANCE = symmetric matrices |
Saves variance-covariance matrices of predictions for each y variate; default * |
SAVE = identifier |
Specifies save structure of model from which to predict; default * i.e. that from latest model fitted |
Parameters
CLASSIFY = vectors |
Variates and/or factors to classify table of predictions |
---|---|
LEVELS = variates, scalars or texts |
To specify values of variates, levels of factors |
PARALLEL = identifiers |
For each vector in the CLASSIFY list, allows you to specify another vector in the CLASSIFY list with which the values of this vector should change in parallel (you then obtain just one dimension in the table of predictions for these vectors) |
NEWFACTOR = identifiers |
Identifiers for new factors that are defined when LEVELS are specified |
Description
The PREDICT
directive can be used after the FIT
directive to summarize the results of the regression, by using the fitted relationship to predict the values of the response variate at particular values of the explanatory variables. CLASSIFY
, the first parameter of PREDICT
, specifies those variates or factors in the current regression model whose effects you want to summarize. Any variate or factor in the current model that you do not include will be standardized in some way, as described below.
The LEVELS
parameter specifies values at which the summaries are to be calculated, for each of the structures in the CLASSIFY
list. For factors, you can select some or all of the levels, while for variates you can specify any set of values. A single level or value is represented by a scalar; several levels or values must be combined into a variate (which may of course be unnamed). Alternatively, if the factor has labels, you can use these to select the levels for the summaries by setting LEVELS
to a text. A missing value in the LEVELS
parameter is taken by Genstat to stand for all the levels of a factor, or for the mean value of a variate.
The PARALLEL
parameter allows you to indicate that a factor or variate should change in parallel to another factor or variate. Both of these should have same number of values specified for it by the LEVELS
parameter of PREDICT
. The predictions are then formed for each corresponding set of values rather than for every combination of these values. For example, suppose we had fitted a quadratic model with explanatory variates X
and Xsquared
. We could then put
PREDICT Xsquared,X; PARALLEL=X,*;\
LEVELS=!(0,4,16,36,64,100),!(0,2,4,6,8,10)
The PARALLEL
parameter specifies that Xsquared
should change in parallel to X
, so that we obtain predictions only for matching values.
When you specify LEVELS
, PREDICT
needs to define a new factor to classify that dimension of the table. By default this will be an unnamed factor, but you can use the NEWFACTOR
parameter to give it an identifier. The EXTRA
attribute of the factor is set to the name of the corresponding factor or variate in the CLASSIFY
list; this will then be used to label that dimension of the table of predictions.
You can best understand how Genstat forms predictions by regarding its calculations as consisting of two steps. The first step, referred to below as Step A, is to calculate the full table of predictions, classified by every factor in the current model. For any variate in the model, the predictions are formed at its mean, unless you have specified some other values using the LEVELS
parameter; if so, these are then taken as a further classification of the table of predictions. The second step, referred to as Step B, is to average the full table of predictions over the classifications that do not appear in the CLASSIFY
parameter: you can control the type of averaging using the COMBINATIONS
, ADJUSTMENT
and WEIGHTS
options. By default, the predictions are made at the mean of any offset variate, but option OFFSET
can be used to specify another value at which the predictions should be made instead.
Printed output is controlled by settings of the PRINT
option:
description |
describes the standardization policies used when forming the predictions, |
---|---|
predictions |
prints the predictions, |
se |
produces predictions and standard errors, |
sed |
prints standard errors for differences between the predictions, |
lsd |
prints least significant differences between the predictions (ordinary linear regression models or generalized linear models with the Normal distibution only), and |
vcovariance |
prints the variance and covariances of the predictions. |
By default descriptions, predictions and standard errors are printed. The standard errors (and sed’s) are relevant for the predictions when considered as means of those data that have been analysed, with the means formed according to the averaging policy defined by the options of PREDICT
. The word prediction is used because these are predictions of what the means would have been if the factor levels been replicated differently in the data; see Lane & Nelder (1982) for more details. The LSDLEVEL
option specifies the significance level (%) to use in the calculation of least significant differences (default 5%).
By default, the standard errors (and sed’s) are not augmented by any component corresponding to the estimated variability of a new observation. However, you can set option SCOPE=new
to request that the variance of predictions should be calculated on the basis of forecasting new observations rather than of summarizing the data to which the model has been fitted. This setting cannot be used if the predictions are to be standardized for the effects of any factors in the model; in other words, all factors in the current model must be listed in the CLASSIFY
parameter of the PREDICT
statement. In addition, it cannot be used when making predictions from generalized linear models with option BACKTRANSFORMATION=none
, nor with weighted regression. The effect of SCOPE=new
is to form variances for each predicted value by combining the variance of the estimated mean value of the prediction (as produced for SCOPE=data
) together with the estimated variance of a new observation with the same values of explanatory variates and factors:
“new” variance = “data” variance + (dispersion × variance function)
The DISPERSION
and DMETHOD
options allow you to change the method by which the variance of the distribution of the response values is obtained for calculating the standard errors. These options operate like the corresponding options of MODEL
(except that they apply only to the current statement). The default is to use the method as originally defined by the MODEL
statement.
The NBINOMIAL
parameter can be used to supply the total number of trials to be used for prediction with a binomial distribution when option BACKTRANSFORMATION
is set to link
. If you provide a value n greater than one, Genstat will predict the number of “successes” out of n. The default, NBINOMIAL=1
, causes Genstat to predict the proportion of successes.
You can send the output to another channel, or to a text structure, by setting the CHANNEL
option.
The COMBINATIONS
option specifies which cells of the full table in Step A are to be filled for averaging in Step B. The default, COMBINATIONS=estimable
, uses all the cells other than those that involve parameters that cannot be estimated, for example because of aliasing. Alternatively, you can set COMBINATIONS=present
to exclude cells for factor combinations that do not occur in the data, or COMBINATIONS=full
to use all the cells. When COMBINATIONS=estimable
or COMBINATIONS=present
the LEVELS
parameter is overruled. Any subsets of factor levels in the LEVELS
parameter are ignored, and predictions are formed for all the factor levels that occur in the data or are estimable. Likewise, the full table cannot then be classified by any sets of values of variates; the LEVELS
parameter must then supply only single values for variates.
The ADJUSTMENT
and WEIGHTS
options define how the averaging is done in Step B. Values in the full table produced in Step A are averaged with respect to all those factors that you have not included in the settings of the CLASSIFY
parameter. By default, the levels of any such factor are combined with what we call marginal weights: that is, by the number of occurrences of each of its levels in the whole dataset. The ADJUSTMENT
and WEIGHTS
options allow you to change the weights. The setting ADJUSTMENT=equal
specifies that the levels are to be weighted equally. (This corresponds to the default weighting used by VPREDICT
.) The WEIGHTS
option is more powerful than the ADJUSTMENT
option, allowing you to specify an explicit table of weights. This table can be classified by any, or all, of the factors over whose levels the predictions are to be averaged; the levels of remaining factors will be weighted according to the ADJUSTMENT
option. Moreover, you can classify the weights by the factors in the CLASSIFY
parameter as well, to provide different weightings for different combinations of levels of these factors. If you supply explicit weights in the WEIGHTS
option, any setting of the COMBINATIONS
option is ignored. You will find explicit weights useful in particular when you have population estimates of the proportions of each level of a factor – proportions which may not be matched well in the available data.
If a model contains any aliased parameters, predicted values cannot be formed for some cells of the full table without assuming a value for the aliased parameters. With the default setting, COMBINATIONS=estimable
, no predictions are formed for these cells. When COMBINATIONS=full
, if the aliased parameters simply represent effects of variates that are correlated with other explanatory variables in the model, it may be sufficient just to ignore them. This can be done by setting the ALIASING
option to ignore
. The aliased parameters are then taken to be zero, and fitted values are calculated for all cells of the table from the remaining parameters in the model. Aliasing can also occur if there are some combinations of factors that do not occur in the data, and here it may be more sensible to set option COMBINATIONS=present
so that these cells are all excluded from the calculation of predictions. The final way to overcome aliasing is to supply explicit weights using the WEIGHTS
option.
Averaging is usually the appropriate way of combining predicted values over levels of a factor. But sometimes summation is needed, for example in the analysis of counts by log-linear models. You can achieve this by setting the METHOD
option to total
. The rules about weights and so on still apply. In a generalized linear model, averaging is done by default on the scale of the original response variable, not on the scale transformed by the link function. In other words, linear predictors are formed for all the combinations of factor levels and variate values specified by PREDICT
, and then transformed by the link function back to the natural scale. This back-transformation may be useful when you are reporting results, since the tables from PREDICT
can then be interpreted as natural averages of means predicted by the fitted model. You can set option BACKTRANSFORM=none
if you want the averaging to be done on the scale of the linear predictor; PREDICT
will then form averages and report predictions on the transformed scale.
PREDICT
calculates the standard errors of predictions from iterative models by using first-order approximations that allow for the effect of the link function. Thus you should interpret them only as a rough guide to the variability of individual predictions.
The PREDICTIONS
, SE
, SED
, LSD
and VCOVARIANCE
options let you save the results of PREDICT
as well as, or instead of, printing them.
The SAVE
option allows you to specify the regression save structure of the analysis on which the predictions are based. If SAVE
is not set, the most recent regression model is used.
The NOMESSAGE
option controls printing of messages. The nonlinear
setting suppresses messages about the approximate nature of standard errors of predictions in generalized linear models, and the dispersion
setting prevents reminders appearing about the basis of the standard errors.
Options: PRINT
, CHANNEL
, COMBINATIONS
, ADJUSTMENT
, WEIGHTS
, OFFSET
, METHOD
, ALIASING
, BACKTRANSFORM
, SCOPE
, NOMESSAGE
, DISPERSION
, DMETHOD
, NBINOMIAL
, PREDICTIONS
, SE
, SED
, LSD
, LSDLEVEL
, VCOVARIANCE
, SAVE
.
Parameters: CLASSIFY
, LEVELS
, PARALLEL
, NEWFACTOR
.
Reference
Lane, P.W. & Nelder, J.A. (1982). Analysis of covariance and standardization as instances of prediction. Biometrics, 38, 613-621.
See also
Directives: FIT
, RDISPLAY
, VPREDICT
.
Procedure: HGPREDICT
.
Commands for: Regression analysis.
Example
" Example PRED-1: Prediction from simple linear regression Attempt to find a linear relationship between the boiling point of water and barometric pressure, to allow prediction of pressure and thus of altitude." " Read and display the data." READ Boiltemp,Pressure 194.5 20.79 194.3 20.79 197.9 22.40 198.4 22.67 199.4 23.15 199.9 23.35 200.9 23.89 201.1 23.99 201.4 24.02 201.3 24.01 203.6 25.14 204.6 26.57 209.5 28.49 208.6 27.76 210.7 29.04 211.9 29.88 212.2 30.06 : DGRAPH Pressure; Boiltemp " Regress pressure on boiling point." MODEL Pressure FIT Boiltemp " Predict pressure when boiling point is 190." PREDICT Boiltemp; LEVEL=190 " Print a chart of predictions for a range of temperatures including standard errors of the predicted means and standard errors for future observations." VARIATE [VALUES=190,192...216] temp PREDICT [PRINT=*; PREDICT=predict; SE=sepred] Boiltemp; LEVEL=temp RKEEP DEVIANCE=rss; DF=rdf CALCULATE sefuture = SQRT(sepred**2 + rss/rdf) PRINT predict,sepred,sefuture; DECIMALS=2