Estimates parameters in Box-Jenkins models for time series.
Options
PRINT = string tokens |
What to print (model , summary , estimates , correlations , monitoring ); default mode,summ,esti |
---|---|
LIKELIHOOD = string token |
Method of likelihood calculation (exact , leastsquares , marginal ); default exac |
CONSTANT = string token |
How to treat the constant (estimate , fix ); default esti |
RECYCLE = string token |
Whether to continue from previous estimation (yes , no ); default no |
WEIGHTS = variate |
Weights; default * |
MVREPLACE = string token |
Whether to replace missing values by their estimates (yes , no ); default no |
FIX = variate |
Defines constraints on parameters (ordered as in each model, tf models first): zeros fix parameters, parameters with equal numbers are constrained to be equal; default * |
METHOD = string token |
Whether to carry out full iterative estimation, to carry out just one iterative step, to perform no steps but still give parameter standard deviations, or only to initialize for forecasting by regenerating residuals (full , onestep , zerostep , initialize ); default full |
MAXCYCLE = scalar |
Maximum number of iterations; default 15 |
TOLERANCE = scalar |
Criterion for convergence; default 0.0004 |
SAVE = identifier |
To name save structure, or supply save structure with transfer-functions; default * i.e. transfer-functions taken from the latest model |
Parameters
SERIES = variate |
Time series to be modelled (output series) |
---|---|
TSM = TSM |
Model for output series |
BOXCOXMETHOD = string token |
How to treat transformation parameter in output series (fix , estimate ); default fix |
RESIDUALS = variate |
To save residual series |
Description
The main use of TFIT
is to fit parameters to time-series models, although you can also use it to initialize for the TFORECAST
directive, even when the model parameters are already known. TFIT
was originally called ESTIMATE
, but was renamed in Release 14 to emphasize its status as a time-series command. The earlier name (ESTIMATE
) was retained to allow previous programs to continue to run, but this may be removed in a future release.
You need to define a TSM structure before using TFIT
, to provide the setting for the TSM
parameter. You may also wish to give a TRANSFERFUNCTION
statement, for example if you wish to specify explanatory variables for regression with ARIMA errors, or to define transfer-function models. In many applications of estimating a univariate ARIMA model, you will need only a simple form of the directive, such as:
TFIT Daylength; TSM=Erp
The SERIES
parameter specifies the variate holding the time series data to which the model is to be fitted.
The TSM
parameter specifies the ARIMA model that is to be fitted to the time-series data. This TSM must already have been declared and its ORDERS
must have been set. If the LAGS
parameter of the TSM has been set, the lags must have been given values. However, if the PARAMETERS
of the TSM model have been set, these need not have been declared previously nor given values. When the parameter values are not set, default values are used: these are all zero, except for the transformation parameter, which is set to 1.0 if it is not to be estimated (see BOXCOXMETHOD
and FIX
below). Any parameter values that you do specify will be used as initial values for the parameters in the model; Genstat replaces any missing values by the default values. If any group of autoregressive or moving-average parameters do not satisfy the required conditions for stationarity or invertibility, all the parameters to be estimated are reset by Genstat to the default values. After TFIT
, the parameters of the TSM contain the estimated parameter values.
The BOXCOXMETHOD
parameter allows you to estimate the transformation parameter λ.
The RESIDUALS
parameter saves the estimated innovations (or residuals). The residuals are calculated for t=t0…N, where t0=1+p+d–q for a simple ARIMA model. If t0>1, missing values will be inserted for t=1…t0-1.
The PRINT
option controls printed output. If you specify monitoring
, then at each cycle of the iterative process of estimation, Genstat prints the deviance for the current fitted model, together with the current estimates of model parameters. The format is simple with the minimum of description, to let you judge easily how quickly the process is converging. The other settings of PRINT
control output at the end of the iterative process. If you specify model
, the model is briefly described, giving the identifier of the series and the time-series model, together with the orders of the model. If you specify summary
, the deviance of the final model is printed, along with the residual number of degrees of freedom. If you specify estimates
, the estimates of the model parameter are printed in a descriptive format, together with their estimated standard errors and reference numbers. If you specify correlations
, the correlations between estimates of parameters are printed, with reference numbers to identify the parameters.
The LIKELIHOOD
option specifies the criterion that Genstat minimizes to obtain the estimates of the parameters: this is described in the next section. The default setting exact
is recommended for most applications.
You can use the CONSTANT
option to specify whether Genstat is to estimate the constant term c in the model. If CONSTANT=fix
, the constant is held at the value given in the initial parameter values; this need not be zero.
The RECYCLE
option allows a previous TFIT
statement to continue; this can save computing time. If RECYCLE=yes
, the most recent TFIT
statement is continued, unless the SAVE
option has been set to the save structure from some other TFIT
statement. The SERIES
and TSM
settings are then taken from this previous TFIT
statement: Genstat ignores any specified in the current statement. Most of the settings of other parameters and options are carried over from the previous statement, and new values are ignored. However, there are some exceptions. You can change the RESIDUALS
variate, you can reset MAXCYCLE
to the number of further iterations you require, and you can change the settings of TOLERANCE
and PRINT
. You can also change the values of the variate in the WEIGHTS
option; you can thus get reweighted estimation. You can change the values of the SERIES
itself, although you cannot change missing values; if the MVREPLACE
option was previously set to yes
, you must put the original missing values back into the SERIES
variate before the new TFIT
statement.
The WEIGHTS
option includes in the likelihood a weighted sum-of-squares term
∑t = t0 … N { wt at2 }
where wt, t=1…N are provided by the WEIGHTS
variate. The values of wt must be strictly positive. If t0<1, where t0=1+d+p–q, then wt is taken as 1 for t<1.
The MVREPLACE
option allows you to request any missing values in the time-series to be replaced by their estimates after estimation. Genstat will always estimate the missing values, irrespective of the setting of MVREPLACE
; so you can also obtain these estimates later from TKEEP
.
The FIX
option allows you to place simple constraints on parameter values throughout the estimation. The units of the FIX
variate correspond to the parameters of the TSM, excluding the innovation variance. The values of the FIX
variate are used to define the parameter constraints and must be integers. If an element of the FIX
variate is set to 0, the corresponding parameter is constrained to remain at its initial setting. If an element is not 0, and the value is unique in the FIX
variate, the parameter is estimated without any special constraint. If two or more values are equal, the corresponding parameters are constrained to be equal throughout the estimation. The number that you give to a parameter by FIX
will appear as the reference number of the parameter in the printed model and correlation matrix. This option overrides any setting of CONSTANT
and BOXCOXMETHOD
.
The MAXCYCLE
option specifies the maximum number of iterations to be performed.
The TOLERANCE
option specifies the convergence criterion. Genstat decides that convergence has occurred if the fractional reduction in the deviance in successive iterations is less than the specified value, provided also that the search is not encountering numerical difficulties that force the step length in the parameter space to be severely limited. You can use monitoring to judge whether, for all practical purposes, the iterations have converged. Genstat gives warnings if the specified number of iterations is completed without convergence, or if the search procedure fails to find a reduced value of the deviance despite a very short step length. Such an outcome may be due to complexities in the likelihood function that make the search difficult, but can be due to your specifying too small a value for TOLERANCE
.
The SAVE
option allows you to save the time-series save structure produced by TFIT
. You can use this in further TFIT
statements with RECYCLE=yes
, or in TFORECAST
statements. It can also be used by the TDISPLAY
and TKEEP
directives. Genstat automatically saves the structure from the most recent TFIT
statement, but this is over-written when the next TFIT
statement is executed, unless you have used SAVE
to give it an identifier of its own. You can access the current time-series save structure by the SPECIAL
option of the GET
directive, and reset it by the TSAVE
option of the SET
directive.
The METHOD
option has four possible settings. The default setting is full
which gives the usual estimation to convergence or until the maximum number of iterations has been reached.
With the setting METHOD=initialize
, TFIT
carries out only the residual regeneration steps (that is, calculation of at for t=t0…N) which are needed before TFORECAST
can be used. If the model has just been estimated using the default full
setting, this is unnecessary. The setting initialize
is useful when the time series is supplied with a known model and a minimal amount of calculation is wanted to prepare or initialize for forecasting. None of the model parameters are changed, and no standard errors of parameter estimates are available. Missing values in the series are estimated so this setting provides an efficient way of getting their values when the time series model is known; they can then be obtained using TKEEP
. The deviance value is also available from TKEEP
. This setting is therefore useful for efficient calculation of deviance values when you want to plot the shape of the deviance as a function of parameter values.
With the setting METHOD=zerostep
the effect is the same as for initialize
except that TFIT
also calculates the standard errors of the parameters as if they had just been estimated. These can be used together with other quantities available from TKEEP
to construct confidence intervals and carry out tests on the parameter values, which remain unchanged except that the innovation variance in the ARIMA model is replaced by its estimate conditional on all other parameters.
The setting METHOD=onestep
gives the same results as specifying the option MAXCYCLE=1
in TFIT
. It is convenient for carrying out quick tests of model parameters.
To explain the LIKELIHOOD
option, we need to describe the estimation of ARIMA models in more detail. You may want to skip this if you are doing fairly routine work.
The first step in deriving the likelihood for a simple model is to calculate
wt = ∇dyt – c ,t = 1+d … N
This has a multivariate Normal distribution with dispersion matrix Vσa2, where V depends only on the autoregressive and moving-average parameters. The likelihood is then proportional to
{ σa2m│V│ } -½ exp{ –w′V-1w/2σa2 }
where m=N–d. In practice Genstat evaluates this by using the formula
w′ V-1 w = W + ∑t = t0 … N { at2 } = S
where t0=1+d+p–q. The term W is a quadratic form in the p values w1+d–q … wp+d–q: it takes account of the starting-value problem for regenerating the innovations at, and avoids losing information as would happen if the process used only a conditional sum-of-squares function. If q>0, Genstat introduces unobserved values of w1+d–q … wd in order to calculate the sum S. Genstat uses linear least-squares to calculate these q starting values for w, thus minimizing S. We shall call them back-forecasts, though if p>0 they are actually computationally convenient linear functions of the proper back-forecasts. We shall call S the sum-of-squares function: it is the sum of the quadratic form and the sum-of-squares term, and is identical to the value expressed by Box & Jenkins (1970) as
∑t = -∞ … N { at2 }
using infinite back-forecasting; that is, using:
W = ∑t = -∞ … t0-1 { at2 }
The values at for t=t0…N agree precisely with those of Box and Jenkins.
To clarify all this, consider examples with no differencing; that is, d=0. If p=0 and q=1 then W=0 and t0=0, and one back-forecast w0 is introduced. If p=1 and q=0 then W=(1-φ12)w12 and t0=2, and no back-forecasts are needed. If p=q=1 then W=(1-φ12)w02 and t0=1, and so one back-forecast w0 is needed. In this case the proper back-forecast is in fact w0 /(1-θ1φ1).
The value of │V│ is a by-product of calculating W and the back-forecast. For example, if p=0 and q=1, then
│V│ = (1 + θ12 + … + θ12N)
If p=1 and q=0,
│V│ = 1 / (1 – φ12)
and if p=q=1,
│V│ = 1 + (φ1 – θ1)2 (1 + θ12 + … + θ12N-2) / (1 – φ12)
Concentrating the likelihood over σa2 by setting σa2=S/m yields a value proportional to { │V│1/m S }–m/2.
The default setting of the LIKELIHOOD
option is exact
. In this case the concentrated likelihood is maximized, by minimizing the quantity
D = │V│1/m S
which is called the deviance.
The setting leastsquares
specifies that Genstat is to minimize only the sum-of-squares term S. This criterion corresponds to the back-forecasting sum-of-squares used by Box & Jenkins (1970), and will in many cases give estimates close to those of the exact likelihood. However, some discrepancy arises if the series is short or the model is close to the invertibility boundary. This is because of limitations on the back-forecasting procedure, as described in the algorithms of Box & Jenkins (1970). The deviance value D that Genstat prints is, with this setting, simply S.
When you use exact likelihood, the factor │V│1/m reduces bias in the estimates of the parameter; you would get bias if you used leastsquares
instead. However, │V│1/m is generally close to one, unless the series is short or the model is either seasonal or close to the boundaries of invertibility or stationarity. The leastsquares
setting is therefore adequate for most long, non-seasonal sets of data; using it may reduce the computation time by up to 50%. When you specify that Genstat is to estimate the parameter λ of the Box-Cox transformation, Genstat also includes the Jacobian of the transformation in the likelihood function. The result is an extra factor G-2(λ-1) in the definition of the deviance, G being the geometric mean of the data,
G = ( ∏t = 1 … N { yt } ) ** (1 / N)
Note that this is not included unless λ is being estimated, even if λ≠1.
You can treat differences in Nlog(D) as a chi-square variable in order to test nested models: this is supported by asymptotic theory, and by experience with models that have moderately large sample sizes. Similarly, you can select between different models by using Nlog(D)+2k as an information criterion, k being the number of estimated parameters. But both of these test procedures are questionable if the estimated models are close to the boundaries of invertibility or stationarity. Provided all the models that are being compared have the same orders of differencing, with the differenced series being of length m, it is recommended that mlog(D) be used rather than Nlog(D) in these tests since mlog(D) is precisely minus two multiplied by the log-likelihood as defined above.
The setting marginal
is relevant mainly when TFIT
is used for regression with ARIMA errors. (This requires a TRANSFERFUNCTION
statement beforehand to specify the explanatory variables.) The likelihood for the model is defined as that of the univariate error series et which is defined in general by
et = yt – b1x1,t – … – bmxm,t
(the xi being m explanatory variables). The constant term therefore appears in the model after any differencing of et; for example
∇et = c + (1 – θ1B )at
You can get bias in the estimates of the parameters of an ARIMA model because the regression is estimated at the same time. You can guard against this by specifying LIKELIHOOD=marginal
. This can be particularly important if the series are short or if you use many explanatory variables (Tunnicliffe Wilson 1989). The deviance is now defined as
D = S (│X′V-1X│ │V│)1/m
where m is reduced by the number of regressors (including the constant term) and the columns of X are the differenced explanatory series: the other terms are as in the exact likelihood.
You can use the marginal
setting also for univariate ARIMA modelling, when the constant term is the only explanatory term. Furthermore, Genstat deals with missing values in the response variate by doing a regression on indicator variates; these too are included in the X matrix. However, you cannot use marginal likelihood and estimate a transformation parameter in either the transfer-function model or an ARIMA model. Neither can you use it if you set the FIX
option in TFIT
. In these cases Genstat automatically resets the LIKELIHOOD
option to exact
.
At every iteration with the setting LIKELIHOOD=marginal
, the regression coefficients are the maximum-likelihood estimates conditional upon the estimated values of the parameters of the ARIMA model: these are also the generalized least-squares estimates, conditioned in the same way. This is so even if MAXCYCLE=0
; that is, the coefficients of the regression are re-estimated even at iteration 0. Therefore you must not use the marginal
setting with the option METHOD=initialize
to initialize for TFORECAST
. You can compare deviance values that were obtained using marginal likelihood only for models with the same explanatory variables and the same differencing structure in the error model.
Options: PRINT
, LIKELIHOOD
, CONSTANT
, RECYCLE
, WEIGHTS
, MVREPLACE
, FIX , METHOD
, MAXCYCLE
, TOLERANCE
, SAVE
.
Parameters: SERIES
, TSM
, BOXCOXMETHOD
, RESIDUALS
.
Action with RESTRICT
The SERIES
variate can be restricted, but this must be to a contiguous set of units.
References
Box, G.E.P. & Jenkins, G.M. (1970). Time Series Analysis, Forecasting and Control. Holden-Day, San Francisco.
Tunnicliffe Wilson, G. (1989). On the use of marginal likelihood in time-series model estimation. Journal of the Royal Statistical Society, Series B, 51, 15-27.
See also
Directives: TSM
, FTSM
, TRANSFERFUNCTION
, TDISPLAY
, TFILTER
, TFORECAST
, TKEEP
, TSUMMARIZE
, CORRELATE
, FOURIER
.
Procedures: BJESTIMATE
, BJFORECAST
, BJIDENTIFY
, MOVINGAVERAGE
, PERIODTEST
, PREWHITEN
, REPPERIODOGRAM
, SMOOTHSPECTRUM
.
Commands for: Time series.
Example
" Example TFIT-1: Fitting a seasonal ARIMA model" VARIATE time; VALUES=!(1...120) FILEREAD [NAME='%gendir%/examples/TFIT-1.DAT'] apt " Display the correlation structure of the logged data" CALCULATE lapt = LOG(apt) BJIDENTIFY [GRAPHICS=high; WINDOWS=!(5,6,7,8)] lapt " Calculate the autocorrelations of the differences and seasonally differenced series" CALCULATE ddslapt = DIFFERENCE(DIFFERENCE(lapt; 12); 1) CORRELATE [PRINT=auto; MAXLAG=48] ddslapt; AUTO=ddsr " Define a model for the series: IMA(1) (that is, a model with a single moving-average parameter applied to the differences of the series) plus a seasonal IMA(1) component" TSM [MODELTYPE=arima] airpass; ORDERS=!((0,1,1)2,12) " Form preliminary estimates of the parameters, using a log transformation (BOXCOX=0 is equivalent to log)" FTSM [PRINT=model] airpass; ddsr; BOXCOX=0 " Get the best estimates, fixing the constant" TFIT [CONSTANT=fix] SERIES=apt; TSM=airpass " Graph the residuals against time" TKEEP RESID=resids DGRAPH [WINDOW=3; KEYWINDOW=0; TITLE='Residuals vs Time'] resids; time " Test the independence of the residuals" CORRELATE [GRAPH=auto; MAXLAG=48] resids; TEST=S PRINT 'Test statistic for independence of the residuals',S