CORRELATE directive

Forms correlations between variates, autocorrelations of variates, and lagged cross-correlations between variates.

Options

`PRINT` = string tokens	What to print (`correlations`, `autocorrelations`, `partialcorrelations`, `crosscorrelations`); default `*`
`GRAPH` = string tokens	What to display with graphs (`autocorrelations`, `partialcorrelations`, `crosscorrelations`); default `*`
`MAXLAG` = scalar	Maximum lag for results; default `*` i.e. value inferred from variates to save results
`CORRELATIONS` = symmetric matrix	Stores the correlations between the variates specified by the `SERIES` parameter

Parameters

`SERIES` = variates	Variates from which to form correlations
`LAGGEDSERIES` = variates	Series to be lagged to form crosscorrelations with first series
`AUTOCORRELATIONS` = variates	To save autocorrelations, or to provide them to form partial autocorrelations if `SERIES`=*
`PARTIALCORRELATIONS` = variates	To save partial autocorrelations
`CROSSCORRELATIONS` = variates	To save crosscorrelations
`TEST` = scalars	To save test statistics
`VARIANCES` = variates	To save prediction error variances
`COEFFICIENTS` = variates or matrices	To save prediction coefficients: in a variate to keep only those for the maximum lag, or in a matrix to keep the coefficients for all lags up to the maximum

Description

The most straightforward use of the CORRELATE directive is to calculate correlation coefficients between a set of variates. For example this would display the correlations between the variates Age, Height and Weight as a lower-triangular matrix.

CORRELATE [PRINT=correlations; CORRELATIONS=Corr]\

Age,Height,Weight

The correlations are also saved in the symmetric matrix Corr using the CORRELATIONS option. Note that, if there are missing values, CORRELATE uses only those units where none of the variates is missing.

CORRELATE can also be used to obtain autocorrelations of a time series, that is the correlations between values in the series lagged by particular time intervals. The set of autocorrelations for all possible lags is the autocorrelation function. You can derive the partial autocorrelation function from these. To look at the relationship between two series, you should use the cross-correlation function between one series and the other lagged by the various intervals. The sample autocorrelation function of a series can be displayed either as a table of numbers, or as a graph – called a correlogram. In either case, you must specify the maximum lag for which the autocorrelation is to be calculated, m say. You can do this either by setting the MAXLAG option to m, or by pre-defining the length of a variate to be m+1 and including it in the AUTOCORRELATIONS parameter to store the calculated values. Genstat includes the autocorrelation at lag 0 in the autocorrelation function; this is always unity. The formula used for the sample autocorrelation at lag k is

r_k = (1 – k/n) × C_k / C₀

where

C_k = (1 / n_k) ∑_{i = 1 … n–k} {(y_t – mean(y)) (y_t+_k – mean(y))}

The number n_k is the number of terms included in the sum. The series can contain missing values, but the calculation excludes any product that involves any missing values at all. You can restrict a series, but the restricted set must consist of a contiguous set of units. Thus, you can look at the autocorrelation function derived from just the first section of a series, or from just the last section, or from a section in the middle; but you cannot use restriction to exclude a section from the middle of the series, or to exclude just individual observations.

The AUTOCORRELATIONS parameter allows you to save the calculated autocorrelations. If you want to display a correlogram in a different form from the standard one produced by the GRAPH option, you must save the autocorrelations and plot them explicitly using either the GRAPH or DGRAPH directives. You will then need to define the variate of lags from 0 to m.

The TEST parameter of CORRELATE allows you to save a statistic that can be used to test the hypothesis that the true autocorrelation is zero for positive lags. It is defined as

S = n ∑_{k=1 … m} { r_k² )

Provided n (the number of data values) is large and m (the maximum lag) is much smaller than n, then under the null hypothesis, the statistic has a chi-square distribution with m degrees of freedom. Thus, a large value provides evidence of autocorrelation in a time series.

You can calculate autocorrelation functions for several series in one statement by specifying several variates with the SERIES parameter.

Genstat forms partial autocorrelations from an autocorrelation function. The value at lag k is defined as

corr( y_t, y_t-k │ y_t-1, y_t-2 … y_t-k+1 )

representing the excess correlation between values separated by k timepoints that is not accounted for by the intermediate points; it is denoted by φ_k,k because it is also the value of the last in the set of coefficients in the autoregressive prediction equation:

y_t = c + φ_k,1y_t–₁ + … + φ_k,ky_t-k + e_k_,t

Genstat calculates these coefficients recursively for k=1…m by

φ_k,k = ( r_k – φ_k-1,1r_k–₁ – … – φ_k-1,k-1r₁ ) / v_k–₁

φ_k,j = φ_k-1,j – φ_k,kφ_k-1,k–j , j=1…k-1

v_k = v_k–₁ (1 – φ_k,k²)

It starts with v₀=1, the quantity v_k being the kth order prediction error variance ratio

variance(e_k_,t) / variance(y_t).

Partial correlations provide a valuable alternative way of displaying the autocorrelation structure of a series. You can display the partial autocorrelation function either as a table of numbers, or as a graph. Two methods are available for doing this. You can supply the series using the SERIES parameter, in which case the autocorrelations are formed first, automatically, and the partial autocorrelations are then derived from them. Alternatively, you can set SERIES=*, and provide the autocorrelations using the AUTOCORRELATIONS parameter. You can specify the maximum lag, either by setting the MAXLAG option, or by pre-defining the length of a variate specified for either the AUTOCORRELATIONS or the PARTIALCORRELATIONS parameter.

You can save the partial autocorrelation function using the PARTIALCORRELATIONS parameter. You can set the VARIANCES and COEFFICIENTS parameters to variates to save the prediction-error variances v₀…v_m, and the prediction coefficients 1, φ_m,1 … φ_m,m for the maximum lag m. Genstat sets the first coefficient to 1, and also the first element of the partial autocorrelation sequence to 1: you should find this to be a useful convention for the lag 0 values. Alternatively, if the COEFFICIENTS parameter is set to a matrix structure, the rows of this matrix will be used to save the prediction coefficients for all the orders up to the maximum lag.

CORRELATE will print a warning if you include missing values in an autocorrelation function that you have supplied, or if for some other reason the autocorrelations are invalid. In particular, if a partial autocorrelation value is obtained outside the range (-1, 1), Genstat will truncate the sequence at the previous lag.

You can calculate cross-correlations between two series by specifying one series with the SERIES parameter and the other with the LAGGEDSERIES parameter. You must define the maximum lag, as for autocorrelations, and you can again plot or tabulate the resulting function. Missing values are allowed, as for autocorrelations. Genstat calculates the sample cross-correlation between the first series x_t and the lagged series y_t at lag k using:

r_k = (1 – k/n) C_k / (s_x s_y)

where

C_k = (1 / n_k) ∑_{i = 1 … n–k} {(x_t – mean(x)) (y_t+_k – mean(y))}

The series x_t and y_t may be of different lengths. The summation includes all possible terms, but excludes any product containing missing values; the number n_k is the number of terms included in the sum. The values and are the sample means, and s_x, s_y are the sample standard deviations. The number n is the minimum of the number of values of x and of y, excluding missing values. You can restrict either series to a set of contiguous units: if both are restricted, their restrictions must match.

You can save the cross-correlation function using the CROSSCORRELATIONS parameter. You can also save a test statistic using the TEST parameter; this is used similarly to the statistic to test for lack of lagged cross-correlation in one direction of the relationship between two series. However the test is valid only if each of the series has a zero autocorrelation function. Cross-correlations take precedence in the storage. Thus if you request both autocorrelations and cross-correlations in a single CORRELATE statement, the stored test statistic will relate to the cross-correlations: that for the autocorrelations will not be stored.

Options: PRINT, GRAPH, MAXLAG, CORRELATIONS.

Parameters: SERIES, LAGGEDSERIES, AUTOCORRELATIONS, PARTIALCORRELATIONS, CROSSCORRELATIONS, TEST, VARIANCES, COEFFICIENTS.

Action with `RESTRICT`

You can restrict the units involved in the calculation of the correlations by restricting either the SERIES variate, or the LAGGEDSERIES variate (if present). For the calculation of autocorrelations, partial-correlations or cross-correlations, the restriction must define a contiguous set of units. If SERIES and LAGGEDSERIES are both restricted, they must be restricted in exactly the same way.

Example

" Example CORR-1: Calculate the acf and pacf of a series"

FILEREAD [NAME='%gendir%/examples/CORR-1.DAT'] Y
CALCULATE n = NVALUES(Y)
VARIATE [VALUES=1...n] X1

" Form the differenced and doubly differenced series"
CALCULATE Dy,Ddy = DIFF(Y,Dy; 1)

" Display the acf and pacf of the series"
VARIATE [VAL=0...50] Lag
UNIT Lag
CORRELATE [MAX=50; GRAPH=a,p] Y

" Print and save the acf and pacf of the diff and 2-diff series"
CORRELATE [MAX=50; PRINT=a,p] SERIES=Dy,Ddy; AUTO=Acfdy,Acfddy;\
  PARTIAL=Pacfdy,Pacfddy

Updated on June 20, 2019

Was this article helpful?

Yes No