PREDICT directive

Forms predictions from a linear or generalized linear model.

Options

`PRINT` = string token	What to print (`description`, `lsd`, `predictions`, `se`, `sed`, `vcovariance`); default `desc`, `pred`, `se`
`CHANNEL` = scalar	Channel number for output; default `*` i.e. current output channel
`COMBINATIONS` = string token	Which combinations of factors in the current model to include (`full, present`, `estimable`); default `esti`
`ADJUSTMENT` = string token	Type of adjustment (`marginal, equal`); default `marg`
`WEIGHTS` = table	Weights classified by some or all of the factors in the model; default `*`
`OFFSET` = scalar	Value of offset on which to base predictions; default mean of offset variate
`METHOD` = string token	Method of forming margin (`mean, total`); default `mean`
`ALIASING` = string token	How to deal with aliased parameters (`fault`, `ignore`); default `faul`
`BACKTRANSFORM` = string token	What back-transformation to apply to the values on the linear scale, before calculating the predicted means (`link, none`); default `link`
`SCOPE` = string token	Controls whether the variance of predictions is calculated on the basis of forecasting new observations rather than summarizing the data to which the model has been fitted (`data`, `new`); default `data`
`NOMESSAGE` = string tokens	Which warning messages to suppress (`dispersion`, `nonlinear`); default `*`
`DISPERSION` = scalar	Value of dispersion parameter in calculation of s.e.s; default is as set in the `MODEL` statement
`DMETHOD` = string token	Basis of estimate of dispersion, if not fixed by `DISPERSION` option (`deviance, Pearson`); default is as set in the `MODEL` statement
`NBINOMIAL` = scalar	Supplies the total number of trials to be used for prediction with a binomial distribution (providing a value n greater than one allows predictions to be made of the number of “successes” out of n, whereas the value one predicts the proportion of successes); default 1
`PREDICTIONS` = tables or scalars	Saves predictions for each y variate; default `*`
`SE` = tables or scalars	Saves standard errors of predictions for each y variate; default `*`
`SED` = symmetric matrices	Saves standard errors of differences between predictions for each y variate; default `*`
`LSD` = symmetric matrices	Saves least significant differences between predictions for each y variate (models with Normal errors only); default `*`
`LSDLEVEL` = scalar	Significance level (%) to use in the calculation of least significant differences; default 5
`VCOVARIANCE` = symmetric matrices	Saves variance-covariance matrices of predictions for each y variate; default `*`
`SAVE` = identifier	Specifies save structure of model from which to predict; default `*` i.e. that from latest model fitted

Parameters

`CLASSIFY` = vectors	Variates and/or factors to classify table of predictions
`LEVELS` = variates, scalars or texts	To specify values of variates, levels of factors
`PARALLEL` = identifiers	For each vector in the `CLASSIFY` list, allows you to specify another vector in the `CLASSIFY` list with which the values of this vector should change in parallel (you then obtain just one dimension in the table of predictions for these vectors)
`NEWFACTOR` = identifiers	Identifiers for new factors that are defined when `LEVELS` are specified

Description

The PREDICT directive can be used after the FIT directive to summarize the results of the regression, by using the fitted relationship to predict the values of the response variate at particular values of the explanatory variables. CLASSIFY, the first parameter of PREDICT, specifies those variates or factors in the current regression model whose effects you want to summarize. Any variate or factor in the current model that you do not include will be standardized in some way, as described below.

The LEVELS parameter specifies values at which the summaries are to be calculated, for each of the structures in the CLASSIFY list. For factors, you can select some or all of the levels, while for variates you can specify any set of values. A single level or value is represented by a scalar; several levels or values must be combined into a variate (which may of course be unnamed). Alternatively, if the factor has labels, you can use these to select the levels for the summaries by setting LEVELS to a text. A missing value in the LEVELS parameter is taken by Genstat to stand for all the levels of a factor, or for the mean value of a variate.

The PARALLEL parameter allows you to indicate that a factor or variate should change in parallel to another factor or variate. Both of these should have same number of values specified for it by the LEVELS parameter of PREDICT. The predictions are then formed for each corresponding set of values rather than for every combination of these values. For example, suppose we had fitted a quadratic model with explanatory variates X and Xsquared. We could then put

PREDICT Xsquared,X; PARALLEL=X,*;\

LEVELS=!(0,4,16,36,64,100),!(0,2,4,6,8,10)

The PARALLEL parameter specifies that Xsquared should change in parallel to X, so that we obtain predictions only for matching values.

When you specify LEVELS, PREDICT needs to define a new factor to classify that dimension of the table. By default this will be an unnamed factor, but you can use the NEWFACTOR parameter to give it an identifier. The EXTRA attribute of the factor is set to the name of the corresponding factor or variate in the CLASSIFY list; this will then be used to label that dimension of the table of predictions.

You can best understand how Genstat forms predictions by regarding its calculations as consisting of two steps. The first step, referred to below as Step A, is to calculate the full table of predictions, classified by every factor in the current model. For any variate in the model, the predictions are formed at its mean, unless you have specified some other values using the LEVELS parameter; if so, these are then taken as a further classification of the table of predictions. The second step, referred to as Step B, is to average the full table of predictions over the classifications that do not appear in the CLASSIFY parameter: you can control the type of averaging using the COMBINATIONS, ADJUSTMENT and WEIGHTS options. By default, the predictions are made at the mean of any offset variate, but option OFFSET can be used to specify another value at which the predictions should be made instead.

Printed output is controlled by settings of the PRINT option:

`description`	describes the standardization policies used when forming the predictions,
`predictions`	prints the predictions,
`se`	produces predictions and standard errors,
`sed`	prints standard errors for differences between the predictions,
`lsd`	prints least significant differences between the predictions (ordinary linear regression models or generalized linear models with the Normal distibution only), and
`vcovariance`	prints the variance and covariances of the predictions.

By default descriptions, predictions and standard errors are printed. The standard errors (and sed’s) are relevant for the predictions when considered as means of those data that have been analysed, with the means formed according to the averaging policy defined by the options of PREDICT. The word prediction is used because these are predictions of what the means would have been if the factor levels been replicated differently in the data; see Lane & Nelder (1982) for more details. The LSDLEVEL option specifies the significance level (%) to use in the calculation of least significant differences (default 5%).

By default, the standard errors (and sed’s) are not augmented by any component corresponding to the estimated variability of a new observation. However, you can set option SCOPE=new to request that the variance of predictions should be calculated on the basis of forecasting new observations rather than of summarizing the data to which the model has been fitted. This setting cannot be used if the predictions are to be standardized for the effects of any factors in the model; in other words, all factors in the current model must be listed in the CLASSIFY parameter of the PREDICT statement. In addition, it cannot be used when making predictions from generalized linear models with option BACKTRANSFORMATION=none, nor with weighted regression. The effect of SCOPE=new is to form variances for each predicted value by combining the variance of the estimated mean value of the prediction (as produced for SCOPE=data) together with the estimated variance of a new observation with the same values of explanatory variates and factors:

“new” variance = “data” variance + (dispersion × variance function)

The DISPERSION and DMETHOD options allow you to change the method by which the variance of the distribution of the response values is obtained for calculating the standard errors. These options operate like the corresponding options of MODEL (except that they apply only to the current statement). The default is to use the method as originally defined by the MODEL statement.

The NBINOMIAL parameter can be used to supply the total number of trials to be used for prediction with a binomial distribution when option BACKTRANSFORMATION is set to link. If you provide a value n greater than one, Genstat will predict the number of “successes” out of n. The default, NBINOMIAL=1, causes Genstat to predict the proportion of successes.

You can send the output to another channel, or to a text structure, by setting the CHANNEL option.

The COMBINATIONS option specifies which cells of the full table in Step A are to be filled for averaging in Step B. The default, COMBINATIONS=estimable, uses all the cells other than those that involve parameters that cannot be estimated, for example because of aliasing. Alternatively, you can set COMBINATIONS=present to exclude cells for factor combinations that do not occur in the data, or COMBINATIONS=full to use all the cells. When COMBINATIONS=estimable or COMBINATIONS=present the LEVELS parameter is overruled. Any subsets of factor levels in the LEVELS parameter are ignored, and predictions are formed for all the factor levels that occur in the data or are estimable. Likewise, the full table cannot then be classified by any sets of values of variates; the LEVELS parameter must then supply only single values for variates.

The ADJUSTMENT and WEIGHTS options define how the averaging is done in Step B. Values in the full table produced in Step A are averaged with respect to all those factors that you have not included in the settings of the CLASSIFY parameter. By default, the levels of any such factor are combined with what we call marginal weights: that is, by the number of occurrences of each of its levels in the whole dataset. The ADJUSTMENT and WEIGHTS options allow you to change the weights. The setting ADJUSTMENT=equal specifies that the levels are to be weighted equally. (This corresponds to the default weighting used by VPREDICT.) The WEIGHTS option is more powerful than the ADJUSTMENT option, allowing you to specify an explicit table of weights. This table can be classified by any, or all, of the factors over whose levels the predictions are to be averaged; the levels of remaining factors will be weighted according to the ADJUSTMENT option. Moreover, you can classify the weights by the factors in the CLASSIFY parameter as well, to provide different weightings for different combinations of levels of these factors. If you supply explicit weights in the WEIGHTS option, any setting of the COMBINATIONS option is ignored. You will find explicit weights useful in particular when you have population estimates of the proportions of each level of a factor – proportions which may not be matched well in the available data.

If a model contains any aliased parameters, predicted values cannot be formed for some cells of the full table without assuming a value for the aliased parameters. With the default setting, COMBINATIONS=estimable, no predictions are formed for these cells. When COMBINATIONS=full, if the aliased parameters simply represent effects of variates that are correlated with other explanatory variables in the model, it may be sufficient just to ignore them. This can be done by setting the ALIASING option to ignore. The aliased parameters are then taken to be zero, and fitted values are calculated for all cells of the table from the remaining parameters in the model. Aliasing can also occur if there are some combinations of factors that do not occur in the data, and here it may be more sensible to set option COMBINATIONS=present so that these cells are all excluded from the calculation of predictions. The final way to overcome aliasing is to supply explicit weights using the WEIGHTS option.

Averaging is usually the appropriate way of combining predicted values over levels of a factor. But sometimes summation is needed, for example in the analysis of counts by log-linear models. You can achieve this by setting the METHOD option to total. The rules about weights and so on still apply. In a generalized linear model, averaging is done by default on the scale of the original response variable, not on the scale transformed by the link function. In other words, linear predictors are formed for all the combinations of factor levels and variate values specified by PREDICT, and then transformed by the link function back to the natural scale. This back-transformation may be useful when you are reporting results, since the tables from PREDICT can then be interpreted as natural averages of means predicted by the fitted model. You can set option BACKTRANSFORM=none if you want the averaging to be done on the scale of the linear predictor; PREDICT will then form averages and report predictions on the transformed scale.

PREDICT calculates the standard errors of predictions from iterative models by using first-order approximations that allow for the effect of the link function. Thus you should interpret them only as a rough guide to the variability of individual predictions.

The PREDICTIONS, SE, SED, LSD and VCOVARIANCE options let you save the results of PREDICT as well as, or instead of, printing them.

The SAVE option allows you to specify the regression save structure of the analysis on which the predictions are based. If SAVE is not set, the most recent regression model is used.

The NOMESSAGE option controls printing of messages. The nonlinear setting suppresses messages about the approximate nature of standard errors of predictions in generalized linear models, and the dispersion setting prevents reminders appearing about the basis of the standard errors.

Options: PRINT, CHANNEL, COMBINATIONS, ADJUSTMENT, WEIGHTS, OFFSET, METHOD, ALIASING, BACKTRANSFORM, SCOPE, NOMESSAGE, DISPERSION, DMETHOD, NBINOMIAL, PREDICTIONS, SE, SED, LSD, LSDLEVEL, VCOVARIANCE, SAVE.

Parameters: CLASSIFY, LEVELS, PARALLEL, NEWFACTOR.

Reference

Lane, P.W. & Nelder, J.A. (1982). Analysis of covariance and standardization as instances of prediction. Biometrics, 38, 613-621.

Example

" Example PRED-1: Prediction from simple linear regression

  Attempt to find a linear relationship between the boiling point
  of water and barometric pressure, to allow prediction of pressure
  and thus of altitude."

" Read and display the data."
READ Boiltemp,Pressure
194.5 20.79  194.3 20.79  197.9 22.40  198.4 22.67  199.4 23.15
199.9 23.35  200.9 23.89  201.1 23.99  201.4 24.02  201.3 24.01
203.6 25.14  204.6 26.57  209.5 28.49  208.6 27.76  210.7 29.04
211.9 29.88  212.2 30.06  :
DGRAPH Pressure; Boiltemp

" Regress pressure on boiling point."
MODEL Pressure
FIT Boiltemp

" Predict pressure when boiling point is 190."
PREDICT Boiltemp; LEVEL=190

" Print a chart of predictions for a range of temperatures
  including standard errors of the predicted means and
  standard errors for future observations."
VARIATE [VALUES=190,192...216] temp
PREDICT [PRINT=*; PREDICT=predict; SE=sepred] Boiltemp; LEVEL=temp
RKEEP DEVIANCE=rss; DF=rdf
CALCULATE sefuture = SQRT(sepred**2 + rss/rdf)
PRINT predict,sepred,sefuture; DECIMALS=2

Updated on June 19, 2019

Was this article helpful?

Yes No