1. Home
2. GEE procedure

# GEE procedure

Fits models to longitudinal data by generalized estimating equations (D.M. Smith & M.G.Kenward).

### Options

`PRINT` = string token What to display (`estimates`, `correlations`, `scalefactor`, `wald`, `monitoring`); default `esti`, `corr`, `scal` Distribution of response (`normal`, `Poisson`, `binomial`, `gamma`, `inversenormal`, `negativebinomial`); default `*` Link function (`identity`, `logarithm`, `logit`, `reciprocal`, `power`, `squareroot`, `probit`, `complementaryloglog`, `logratio`); default `*` Exponent for power link; default -2 Explanatory variates, factors etc How to treat constant (`estimate`, `omit`); default `esti` Limit for expansion of model terms; default 3 Fixed parameter for negative binomial distribution (parameter k as in variance function var = mean + mean2/k); default 1 Parameter for logratio link, in form log(mean / (mean + k)); default as set in `AGGREGATION` option Whether to use quadratic estimation (`used`, `notused`); default `used` How to calculate the scale factor (`fixed`, `constant`, `varytime`); default varies with distribution, `fixed` for Poisson, binomial and negative binomial, `constant` for rest Value for scale factor when `SCALEFACTOR=fixed`; default 1.0 for Poisson and binomial, missing for rest Form of correlation matrix (`independence`, `unstructured`, `exchangeable`, `autoregressive`, `dependence`, `antedependence`); default `*` Order in dependence and ante-dependence form of correlation matrix; default 1 Whether correlation in dependence model changes with time (`no`, `yes`); default `no`

### Parameters

`Y` = variates Response variate for each analysis Denominator in binomial To store fitted values To store residuals Identifier of subjects Identifier of outcomes Variate of counts of no. outcomes Times of repeated measures variate Weight variate Offset variate Structure to save output variables

### Description

`GEE` implements the General Estimating Equation (GEE) methodology of Liang & Zeger (1986) with quadratic estimation for the covariance structure. In the terminology of Liang et al. (1992) the methodology implemented is a form of GEE1. Full details of the implementation are given in Kenward & Smith (1995a). GEE, as implemented here, is a comparatively simple non-likelihood method for fitting marginal models to repeated measurements that can be used when the response has a distribution in the exponential family. This includes the Gaussian distribution, for which the procedure implemented here reduces to a form of the EM algorithm, and then produces exact ML or REML estimates, or a close approximation to these depending on the particular correlation structure chosen. For other distributions the resulting estimates are not maximum likelihood but can be shown to have asymptotic properties familiar from quasi-likelihood, such as consistency and asymptotic normality.

The standard range of generalized linear models (as in procedure `GLM`) can be fitted involving a variety of covariance or correlation structures over the times of the repeated measurements. The standard links and distributions can be chosen by setting the options `DISTRIBUTION`, `LINK`, `EXPONENT`, `AGGREGATION` and `KLOGRATIO`, as in the `MODEL` directive. Non-standard ones require the definition of auxiliary procedures to carry out the necessary calculations (see the Method section). The terms in the fitted model are specified by the `TERMS` option, which may be set to a formula or left unset to fit a null model. The `FACTORIAL` option (default 3) sets a limit on the number of factors and variates in the terms that are fitted, as in the `FIT` directive. The `CONSTANT` option can be used to omit a constant term. Setting the `QUADESTIMATION` option to `used` requests the use of quadratic estimation for the data-based covariance or correlation matrix (see Kenward & Smith 1995a). The `SCALEFACTOR` option specifies the form of scalefactor to be used (fixed to a value specified by the `SFVALUE` option, constant over times of repeated measurements, or varying over times of repeated measurements). The `CRTYPE` option specifies the structure of the covariance or correlation matrix over the times of the repeated measurements. The `ORDER` option specifies the order of the covariance or correlation structures for the dependence and ante-dependence cases, with option `TIMEDEPENDENT` specifying whether the correlation in a dependence structure changes with the time of the repeated measurement.

The `Y` parameter must be set to specify the response variate. For a binomial distribution the `NBINOMIAL` parameter must also be set; this may be either a variate or a scalar (if it is a scalar, `GEE` maps it out automatically to a variate with the same number of values as `Y`). The `SUBJECT` parameter specifies a factor to identify the subjects. Alternatively, where the data consist of outcomes and numbers with those outcomes, the parameter `OUTCOME` must be set to the identifier of the outcome and the parameter `COUNT` to the number with the outcome. The parameter `TIME` must be set to the times of the repeated measurements. The parameters `WEIGHT` and `OFFSET` specify weight and offset variates that may be involved.

The output from the procedure is controlled by the `PRINT` option; by default estimates, their standard errors, covariances or correlations and scalefactors are given. Two sets of standard errors are provided for the estimates. One is the naive estimate which assumes the specified covariance or correlation structure holds. The other is the sandwich estimate which makes no such assumption. When `PRINT=wald`, Wald tests are produced using both sets of standard errors and correlations.

The fitted values and residuals can be obtained by setting the parameters `FITTEDVALUES` and `RESIDUALS`. The residuals are the Pearson residuals as defined in the Guide to the Genstat Command Language, Part 2, Section 3.1.1.

The `SAVE` parameter can save various details of the analysis, in a pointer with the following suffixes and labels:

`1` or `'scalefactors'` scalefactor(s), correlations or covariances, according to the type of model (and labelled appropriately), the estimates of the linear predictor parameters, naive variance-covariance matrix for the estimates, sandwich variance-covariance matrix for the estimates, Wald tests calculated using the naive variance-covariance matrix, and Wald tests calculated using the sandwich variance-covariance matrix.

The algorithms in the procedure have been set up assuming that the data contain a complete set of observations for each subject. Where there are missing values these must be included explicitly (using the missing value symbol `*`) to create a complete set of observations. Missing values are allowed in both the `Y` variate and the explanatory variates in `TERMS`.

In the case of the Gaussian distribution, a working covariance matrix, rather than correlation matrix, is used. This provides considerable simplification within the algorithm.

This is a complicated algorithm and some examples may take a while to run. If necessary, however, you can set option `PRINT=monitoring` to see what is happening.

Options: `PRINT`, `DISTRIBUTION`, `LINK`, `EXPONENT`, `TERMS`, `CONSTANT`, `FACTORIAL`, `AGGREGATION`, `KLOGRATIO`, `QUADESTIMATION`, `SCALEFACTOR`, `SFVALUE`, `CRTYPE`, `ORDER`, `TIMEDEPENDENT`.

Parameters: `Y`, `NBINOMIAL`, `FITTEDVALUES`, `RESIDUALS`, `SUBJECT`, `OUTCOME`, `COUNT`, `TIME`, `WEIGHT`, `OFFSET`, `SAVE`.

### Method

For full details of the method implemented in this procedure see Kenward & Smith (1995a). A generalized linear model is formulated for the marginal distribution of the observations at each time point using an appropriate link function and error distribution. If the repeated measurements could be assumed to be independent, the well-known iterative weighted least squares fitting procedure could be used to obtain ML estimates of the marginal model parameters. However this ignores the dependence among the repeated measurements. Full likelihood is in general very awkward in this setting so, to avoid a formal introduction of dependence into the model, a working correlation matrix is introduced into the iterative procedure, changing the least squares from a weighted to a generalized form. The correlation matrix can be introduced in various ways. It can be held constant throughout the iterative procedure. An example of this is the use of the identity matrix, leading to the so-called independence estimating equations for which the process reduces back to that of fitting a univariate generalized linear model. Alternatively an estimated correlation matrix can be introduced into the algorithm which is updated at each cycle using quadratic estimation: essentially the correlation structure is estimated from the residuals using the equations that would be appropriate were the residuals normally distributed. On convergence consistent estimates of the marginal linear model parameters are obtained and, if the correlation structure chosen is appropriate, then this will be consistently estimated as well. It is not necessary for the correlation structure to be correct for the consistency of the marginal parameter estimates, at least when the correlation structure is fixed; indeed the common choice of independence is almost certain not to be appropriate. However the estimates of precision of the marginal parameter estimates do need to be adjusted to allow for the true correlation structure. This correction is done in the so-called “sandwich” estimator provided by the procedure.

The procedures have been written so that it is possible to fit models other than the standard ones. An important example of such a model is the application of the GEE methodology to ordinal categorical data. This application requires the data to be arranged in a particular form (as cummulative logits) and a particular correlation matrix (specified in `_GEECORRELATION`). The type of analyses are explained in Kenward et al. (1994) and the methodology described in that paper has been duplicated. Further details are given in Kenward & Smith (1995b).

An option (`SCALEFACTOR`) has been included that allows the user to decide whether or not the scale factor is fixed at its independence distributional default, or is estimated from the scaled residuals as in Liang & Zeger (1986), or is treated as a vector varying over time.

`GEE` calls three subsidiary procedures,` _GEECODI`, `_GEECALLIN` and `_GEECALDIS` to assist with the analysis. There is no need for the user to modify these procedures:

    `_GEECODI` prints out the results of the iterative processes; calculates the fitted values and derivatives for various links; calculates the variance function and deviance for various distributions.

There are also four other procedures, which can be re-written or replaced, to cater for further user-defined distributions, links and correlation structures:

    `_GEEINIT` calculates initial estimates of the linear predictor in the generalized linear model; calculates fitted values and derivatives; calculates the variance function and deviance; calculates the correlation matrix and the sandwich matrix involving the residuals. (For the normal distribution the variance-covariance matrices are used not the correlation matrices.)

If the `LINK` option is unset, the procedure will call `_GEEINIT` and `_GEELINK` instead of using those for the various standard link functions. For a logit link function `_GEEINIT` and `_GEELINK` should be defined as follows.

PROCEDURE ‘_GEEINIT’
“Calculation of initial estimate of linear predictor,
PARAMETER NAME = \
‘Y’, “I: variate; response variate”\
‘LINEARPREDICTOR’,”O: variate; linear predictor”\
‘OFFSET’, “I: variate; offset”\
‘NBINOMIAL’; “I: variate; denominator of binomial”\
SET=3(yes),no;TYPE=4(‘variate’); \
COMPATIBLE=*,3(!T(type,nvalues,restriction));\
PRESENT=yes,no,2(yes)

CALC LINEARPREDICTOR = LOG((Y+0.5)/(NBINOMIAL-Y+0.5)) – OFFSET
ENDPROCEDURE

“Calculation of fitted values and derivatives”
PARAMETER NAME = \
‘LINEARPREDICTOR’, “I: variate; linear predictor”\
‘FITTEDVALUES’, “O: variate; estimate of fitted values”\
‘DERIVATIVES’, “O: variate; estimate of derivatives”\
‘OFFSET’, “I: variate; offset”\
‘NBINOMIAL’; “I: variate; denominator of binomial”\
SET=4(yes),no;TYPE=5(‘variate’); \
COMPATIBLE=*,4(!T(type,nvalues,restriction));\
PRESENT=yes,2(no),2(yes)

GETATTRIBUTE [ATTRIBUTE=NVALUES] LINEARPREDICTOR; SAVE=!P(nobs)

CALC FITTEDVALUES = NBINOMIAL/(1+EXP(-LINEARPREDICTOR – OFFSET))
& DERIVATIVES = 1/FITTEDVALUES+1/(NBINOMIAL-FITTEDVALUES)
ENDPROCEDURE

If the DISTRIBUTION option is unset, the procedure will call _GEEDISTRIBUTION instead of using one of the various standard distributions. For a binomial error distribution _GEEDISTRIBUTION should be defined as follows.

PROCEDURE ‘_GEEDISTRIBUTION’
” Calculation of variance function and deviance”
PARAMETER NAME = \
‘Y’, “I: variate; response variate”\
‘FITTEDVALUES’,”I: variate; fitted values”\
‘VARIANCE’, “O: variate; variance”\
‘DEVIANCE’, “O: scalar; total deviance”\
‘NBINOMIAL’; “I: variate; denominator of binomial”\
SET=4(yes),no;TYPE=3(‘variate’),’scalar’,’variate’; \
COMPATIBLE=*,2(!T(type,nvalues,restriction)),*,\
!T(type,nvalues,restriction); \
PRESENT=2(yes),2(no),yes

CALC VARIANCE = FITTEDVALUES*(NBINOMIAL-FITTEDVALUES)/NBINOMIAL
& DEVIANCE = -2*LLB(Y;NBINOMIAL;(FITTEDVALUES/NBINOMIAL))
ENDPROCEDURE

If the CRTYPE option is unset, the procedure will call _GEECORRELATION instead of using one of the various standard correlation models. For the independence model _GEECORRELATION should be defined as follows. Kenward & Smith (1995b) describe how _GEECORRELATION should be set up for analysing repeated ordinal categorical data.

PROCEDURE ‘_GEECORRELATION’
“
Calculation of correlation matrix

For SANDWICH = NO
input is the R matrix as for UNSPECIFIED
output is the desired R matrix.
For SANDWICH = YES
input is the (Y-MU)*T(Y-MU) matrix
output is the desired modified (Y-MU)*T(Y-MU) matrix.

N.B. For the normal distribution both the input and
output R’s should be variance/covariance matrices
not correlation matrices.
“
OPTION NAME = \
‘CONSTANT’, “I: text; how to treat constant (estimate,
omit); default e”\
‘SANDWICH’; “I; text; whether the sandwich central matrix
product or not) (no,yes); default no”\
MODE=2(T); NVALUES=2(1); SET=yes;\
VALUES=!T(ESTIMATE,OMIT),!T(NO,YES); \
DEFAULT=!T(ESTIMATE),!T(NO);

PARAMETER NAME = \
‘CORRELATIONS’,”I/O: matrix; the correlation matrix”\
ESTIMATES’, “I: variate; estimates of parameters in
model”\
‘Y’, “I: variate; response variate”\
‘RESIDUALS’, “I: variate; residuals”\
‘FITTEDVALUES’,”I: variate; fitted values”\
TIME’, “I: variate; times of repeated measures”\
‘MARKER’, “I: factor; identifier of subject or outcome”\
‘DISTRIBUTION’,”I: text; identifier of distribution”\
‘SCALEFACTOR’, “I: text; scalefactor option in use”\
‘SFVALUE’; “I: scalar; value of scalefactor if FIXED”\
SET=10(yes);DECLARED=10(yes); \
TYPE=’symmetric’,5(‘variate’),’factor’,2(‘text’),’scalar’; \
PRESENT=9(yes),no

GETATTRIBUTE [ATTRIBUTE=NVALUES] ESTIMATES; SAVE=!P(ncol)
& [ATTRIBUTE=NROWS] CORRELATIONS; SAVE=!P(ntime)

DIAGONALMATRIX [ROWS=ntime;MODIFY=yes] done,wkdm; \
VALUES=!(#ntime(1)),*

CALC const = ‘ESTIMATE’ .IN. CONSTANT
& sandw = ‘NO’ .IN. SANDWICH

IF sandw

SCALEFACTOR is as in GEE i.e. FIXED means fixed to SFVALUE
CONSTANT means the scalefactor is estimated but constant
across time, and VARYTIME means the scalefactor is estimated
and varies across time.

The variate TIME in this PROCEDURE represents the 1…ntime
distinct times, it is not a FACTOR of length nobs as in GEE.
It is the levels of the parameter TIME of GEE.

IF DISTRIBUTION.EQS.’NORMAL’
IF SCALEFACTOR.NES.’VARYTIME’
IF SCALEFACTOR.EQS.’FIXED’
CALC wkdm = SFVALUE
ELSE
CALC wkdm = TRACE(CORRELATIONS)/ntime
ENDIF
ELSE
CALC wkdm = CORRELATIONS
ENDIF
CALC CORRELATIONS = 0 + wkdm
ELSE
CALC CORRELATIONS = done
ENDIF
ENDIF
ENDPROCEDURE

If `LINK`, `DISTRIBUTION` or `CRTYPE` are unset, but no user routines are given for `_GEEINIT`, `_GEELINK`, `_GEEDISTRIBUTION` and `_GEECORRELATION`, then those given here (for logit link, binomial error distribution and independence) will be used.

### Action with `RESTRICT`

Input structures must not be restricted, and any existing restrictions will be cancelled.

Kenward, M.G., Lesaffre, E. & Molenberghs, G. (1994). An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random. Biometrics, 50, 945-953.

Kenward, M.G. & Smith, D.M. (1995a). Computing the generalized estimating equations for repeated measurements. Genstat Newsletter, 32, 50-62.

Kenward, M.G. & Smith, D.M. (1995b). Computing the generalized estimating equations for repeated ordinal, categorical measurements. Genstat Newsletter, 32, 63-70.

Liang, K.-Y. & Zeger,S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22.

Liang, K.-Y., Zeger, S.L. & Qaqish, B. (1992). Multivariate regression analyses for categorical data (with discussion). Journal of the Royal Statistical Society, Series B, 54, 3-40.

Procedures: `GLMM`, `HGANALYSE`.

Commands for: Regression analysis, Repeated measurements.

### Example

```CAPTION 'GEE example',!t('Data from Archer',\
'(1987, Fertility & Sterility, 47, 559-564)',\
'about prolactin response to thyroptrin releasing',\
'hormone in women grouped according fertility status;',\
'also see Paik (1992, Biometrics, 48, 19-30).'); STYLE=meta,plain
VARIATE [VALUES=44,45,41,40,72,49,41,31,37,23,15,10,103,79,47,\
39,51,40,32,15,97,98,76,51,59,55,49,36,97,75,\
49,38,88,78,61,43,53,40,29,23,66,35,18,16,60,\
48,32,29,53,47,29,38,111,77,59,58,27,22,18,12,\
51,62,40,37,28,33,20,15,49,39,32,23,59,49,43,\
38,155,126,72,48,82,67,54,44,127,99,58,53,75,59,\
46,29,71,62,49,44,114,110,95,52,172,95,51,43,210,\
156,117,91,100,90,60,50,86,65,57,42,101,93,68,47] Response
&      [VALUES=16,16,16,16,10,10,10,10,7,7,7,7,8,8,8,8,5,5,5,5,\
8,8,8,8,7,7,7,7,9,9,9,9,13,13,13,13,3,3,3,3,\
7,7,7,7,8,8,8,8,17,17,17,17,38,38,38,38,11,11,11,11,\
12,12,12,12,7,7,7,7,7,7,7,7,26,26,26,26,9,9,9,9,\
19,19,19,19,12,12,12,12,20,20,20,20,41,41,41,41,4,4,4,4,\
30,30,30,30,15,15,15,15,36,36,36,36,15,15,15,15,11,11,11,11]\
Baseline
&      [VALUES=(1...4)30] CTime
FACTOR  [LEVELS=30; VALUES=4(1...30)] Woman
&      [LEVELS=3; VALUES=24(1),48(2),48(3)] Group
&      [LEVELS=4; VALUES=(1...4)30] Time