Fits a Generalized Pareto distribution to the observations in a variate above a given threshold. (D.B. Baird)
The Generalized Pareto distribution asymptotically models the conditional upper tail of any distribution with a maximum-stable distribution. The conditional expectation of the observations that exceed a given threshold is modelled. The choice of a suitable threshold is an important part of the modelling of the tail. The threshold needs to be chosen large enough, so that the asymptotic approximation to the tail is good, but small enough so that there is enough information to estimate the parameters reliably (i.e. enough observations lie above the threshold.) The mean residual life plot and parameter stability plots can be used to help in the choice of the threshold to be used. A time varying threshold can also be used, to account for non-stationary sequences. Then if the excess over the threshold is stationary, the Generalized Pareto distribution can be successfully used.
The modelling of tail is used to answered questions about how often in the future values larger than a certain value may occur. By modelling the distribution, if the assumptions are true, then a smaller data series may be extrapolated to an event occurring more rarely than the length of the series. For example, if you want to estimate a 1 in 100 year event, but only have 40 years of data, you must extrapolate. This approach also provides some smoothing of the distribution so that the information is taken from all values, and not just the most extreme few events that have occurred in a short series. The value which is exceeded 1 in n years is known as the return level, and 1/n is the return probability.
It is assumed that the observations are identically and independently distributed, and the process generating the observations is stationary. Departures from these assumptions, for example, that there is a trend or correlation over the series of observations, can affect the validity of the analysis. Dependence can be allowed for in the analysis by setting the DEPENDENCE option.
The Generalized Pareto distribution has two parameters, a scale parameter, sigma, and the shape parameter, eta. The Generalized Pareto cumulative distribution function (CDF) is given as:
G(x) = 1 - [1 + eta*x/sigma]**(-1/eta), for {x: 1 + eta*x/sigma > 0}.
For eta = 0, this is undefined, but in the limit, as eta tends to 0, this gives the usual Exponential CDF of:
G(x) = 1 - exp[-x/sigma], for x > 0.
The lower tail can also be modelled by changing the sign of the variate.
Options
PRINT = strings | What to print (model, estimates, tests, fittedvalues, monitoring, all); default – (model, estimates, tests) |
---|---|
PLOT = strings | Which graphs to plot, (reslife, stability, qq, pp, etaprofile, density, returns, rprofile, clusters, all); default, (qq, returns, density) |
DEPENDENCE = string | Whether to allow for dependencies (yes, no) in the series; default no. If yes, data will be clustered, and the clusters replaced by the maximum value in the cluster |
ENVELOPE = string | Exclude confidence envelopes on plots (yes, no); default, yes |
CIMETHOD = string | How to calculate confidence limits (profilelikelihood, quadratic); default – quadratic |
ALPHA = scalar | Significance Level on confidence bands; default 0.95 |
ETA = scalar | Value of eta to use in the distribution; default * – estimate eta by maximum likelihood |
TITLE = text | Text containing the title for each plot (each row is a title) |
WINDOW = scalar | Which window to plot the graphs in; default 3 |
Parameters
DATA = variate | Variate containing Data to fit the GEV distribution to |
---|---|
THRESHOLD = scalar or variate | Variate or scalar giving threshold over which the exceedances are to be modelled |
PERIODLENGTH = scalar | The number of units to group together, when interpreting or estimating the return probabilities given in PROBABILITY |
GAPLENGTH = scalar | The number of units that must be below the threshold for a new cluster to be started, when DEPENDENCE=yes |
ESTIMATES = variate | Estimated values for MU,SIGMA and ETA and TREND and GROUPS parameters |
SE = variate | Standard errors for MU,SIGMA and ETA and TREND and GROUPS parameters |
RETURNS = variate | If SET for input, the values to estimate the return probabilities for, otherwise the estimated return periods for values in PROBABILITY |
PROBABILITY = variate | If SET for input, the values to estimate the return levels for, otherwise the estimated probabilities for values in RETURNS |
LOWER = variate | Lower limit of ALPHA confidence interval for either the estimated values of RETURNS or PROBABILITY depending which values have been input |
UPPER = variate | Upper limit of ALPHA confidence interval for either the estimated values of RETURNS or PROBABILITY depending which values have been input |
LOGLIKELIHOOD = scalar | Log-likelihood of fitted distribution |
EXIT = scalar | The exit code from FITNONLINEAR used to obtain the maximum likelihood estimates |
Options: PRINT, PLOT, DEPENDENCE, ENVELOPE, CIMETHOD, ALPHA, ETA, TITLE, WINDOW
Parameters: DATA, THRESHOLD, PERIODLENGTH, GAPLENGTH, ESTIMATES, SE, RETURNS, PROBABILITY, LOWER, UPPER, LOGLIKELIHOOD, EXIT
Description
The GPARETO procedure fits the Generalized Pareto distribution by maximum likelihood.
The PRINT option controls what results are displayed from the analysis (model, estimates, tests, fittedvalues, monitoring). Setting PRINT=all will display all available results. The model setting gives the function for the CDF and details of the model being fitted; the estimates gives the estimated values for mu, sigma and eta; the tests setting gives a likelihood ratio test of eta being zero and a goodness of fit test for data following the Generalized Pareto distribution; the fittedvalues setting a table of the observed return levels and the fitted values for each of these, and the monitoring setting gives a trace of the parameters and likelihood from any profile likelihoods being estimated (which can be quite slow).
The PLOT option controls which graphs are displayed from the analysis (reslife, stability, qq, pp, etaprofile, density, returns, rprofile, clusters). Setting PLOT=all will generate all available graphs. The reslife setting produces the mean residual life plot which is a plot of the mean of the values above a given threshold minus the threshold, over the range of possible threshold values. If the tail follows a Generalized Pareto distribution, then above some threshold level where the asymptotics begin to apply, the mean residual life graph should be roughly linear. The stability setting produces a plot of eta and an adjusted form of sigma (sigma* = sigma + eta*threshold) for a range of possible threshold values. These parameters should be constant above some threshold level where the asymptotics begin to apply. Thus use of these two plots can assist in selecting the threshold value. The quantile-quantile (qq) and probability-probability (pp) graphs are produced using the DPROBABILITY directive. The etaprofile and rprofile settings give profile likelihood plots for eta and the return levels respectively. The density setting gives a combined histogram, dot plot and probability density function on the same graph. The returns setting gives a plot of the data and estimated return levels against the return period. The return period is the reciprocal of the return probability (an n year return period = a 1 in year return probability). The clusters setting produces a graph of theta, the reciprocal of the mean cluster size for a range of thresholds. If there is a large departure between the observed value of theta and its expected value this may be a indication that there is dependence/correlation in the series. Normally as the threshold is increased, the level of dependence in the series decreases. The TITLE option provides titles for the selected plots. Each line in the specified text structure provides a title for the plots. The WINDOW option specifies the FRAME to plot the graphs within.
The THRESHOLD parameter is used to set the threshold, above which the tail is modelled. This may be a variate for a time dependent threshold or scalar for a constant value. For example, with a series of temperatures over a year, a time dependent threshold may be a periodic curve giving the expected value of a quantile over the year. If you are not certain of the threshold to use, a missing value, *, can be entered, in which case the analysis will produce the mean residual life and stability plots for sigma and eta to help you in your selection of the threshold. Where a variable threshold is used, all return levels will be interpreted or calculated as the excess over the threshold.
The DEPENDENCE option can be set to yes when the series you are modelling has positive auto-correlation or dependencies between adjacent elements. The presence of dependence can bias the estimates of return values or probabilities. When this occurs, the values above the threshold will tend to occur in clusters. This bias can be overcome by replacing clusters of values above the threshold with a single value, the maximum of the cluster. The division between clusters is defined using the GAPLENGTH parameter. Units above the threshold are grouped together into a single cluster, until the gap between values above the threshold exceeds the specified gap size. For example the series: 7, 9, 1, 9, 5, 2, 7, 8, 7, 4, 3 using a threshold of 6, has the status of being above the threshold of: 1,1,0,1,0,0,1,1,1,0,0 and the clusters above the threshold would be (7,9), (9), (7,8,7) using a gap size of 1, giving the series 9,9,7 when the maximum of the clusters is taken. If a gap size of 2 was used, the first two clusters would coalesce to give cluster of (7,9,9) as the gap between the two 9s is only one unit (< 2), giving the series 9,8 when the maxima are taken. When DEPENDENCE=yes, the value Theta is printed by the analysis. Theta is the reciprocal of the mean cluster size. A value of theta close to 1 indicates all observations over the threshold occur in isolation, and a small value of theta indicates that the observations exceeding the threshold are strongly clustered together.
The ENVELOPE option can be set to control whether confidence bands are include on the graphs. The ALPHA option controls the confidence limits used (default 95%) for the bands and for the estimated return levels, probabilities and profile likelihoods. The CIMETHOD option controls whether approximate or profilelikelihood confidence limits are estimated for the return levels. If profile likelihood limits are chosen a profile likelihood curve for the return levels is calculated. Profile likelihood confidence limits can be very slow to calculate, but are more accurate. The PLOT=rprofile setting is only active if CIMETHOD=exact. Profile likelihood confidence limits are never available for estimated return probabilities.
The ETA option can be used to force a given value of eta to be used for the distribution. A common setting for this would be ETA=0 which forces the Exponential distribution to be used.
The EXIT parameter allows the exit code from FITNONLINEAR to be returned in a scalar so that you can check whether the estimation has converged. The estimated parameters and their standard errors can be saved in the structures set for ESTIMATES and SE respectively.
The RETURNS and PROBABILITIES parameters allow you to specify values that you want the corresponding return probabilities or levels estimated for, respectively. If the units for return probabilities are not on a per observation basis, the PERIODLENGTH parameter can be set to indicate how many observations are to be grouped for each period. For example, PERIODLENGTH=365 would group days into years, so that return periods would be expressed in years rather than days. If the values in the variate for RETURNS are set, then the estimated return probabilities are given in PROBABILITIES and the lower and upper confidence limits for the return probabilities are in given in LOWER and UPPER respectively. If the values in the variate for PROBABILITIES are set, then the estimated return levels are given in RETURNS and the lower and upper confidence limits for the return levels are in given in LOWER and UPPER respectively.
Method
The log-likelihood function is set up using EXPRESSION statements and maximum likelihood estimates formed by using FITNONLINEAR. Firstly, an optimal value of eta is found by forming a profile likelihood for values of eta varying over a range of values, and then the full likelihood is maximised using all parameters, starting from the optimum found from the profile likelihood. To form profile likelihood confidence limits, the likelihood is reparameterised using the return level as a parameter, and a profile likelihood formed by varying the return level, and optimising over the remaining parameters.
Action with RESTRICT
Restricted units are left out of the analysis.
References
- Coles, Stuart (2001). An introduction to statistical modelling of extreme values. Springer-Verlag: London.
- Reiss R and Thomas M (2001). Statistical Analysis of Extreme Values, 2nd edition. Birkhauser: Basel.
Example
The file RainFall.GSH contains daily rainfall (in mm) for a site in Southern England from 1914 to 1962. The distribution of positive rainfalls is shown in the histogram below.
A suitable threshold must be chosen for the analysis of this data. By examining the mean residual life plot and the parameter stability plot of sigma, it was decided that a value of 30 seemed reasonable, as the mean residual life plot was approximately linear for a threshold > 30 and sigma was stable for values of a threshold > 30 (see the graphs below).
Genstat Code
SPLOAD '%GENDIR%\\EXAMPLES\\RainFall.GSH' "Fit a Generalized Pareto Distribution to tail of rainfall > 30 mm" GPARETO [PRINT=model,estimates,tests; PLOT=reslife,stability,clusters,qq,\ etaprofile,density,returns; DEPENDENCE=no; CIMETHOD=profile;\ ENVELOPE=yes; ALPHA=0.95; WINDOW=3]\ DATA=RainFall; THRESHOLD=30; PERIODLENGTH=365; PROBABILITY=0.05
Output
Generalized Pareto Distribution: CDF(x) = 1 - (Eta*(x-t)/Sigma]**(-1/Eta)), Eta0 = 1 - EXP(-(x-t)/Sigma), Eta==0 for x > t - the threshold Threshold = 30 Proportion > Threshold = 0.00867 *** Estimates of GPareto parameters *** estimate "s.e." Sigma 7.440 1.355 Eta 0.1845 0.1431 Maximum Log-Likelihood = -485.09 Maximum value of G. Pareto Distribution is Infinite (Eta >= 0) Significance Test that Eta = 0 (ie RainFall follows an Exponential distribution) Likelihood Ratio test statistic: 4.600 Chi-Square Probability of test: 0.0320 Goodness of Fit Test for RainFall following a Generalized Pareto distribution ----------------------------------------------------------------------------- Critical values of test statistics (MARGINAL tests) --------------------------------------------------------- Significance level ------------------------------------- Test statistic 15% 10% 5% 2.5% 1% --------------------------------------------------------- Anderson-Darling 0.576 0.656 0.787 0.918 1.092 Cramer-von Mises 0.091 0.104 0.126 0.148 0.178 Watson 0.085 0.096 0.116 0.136 0.163 --------------------------------------------------------- --------------------------------------------------------------- Test statistic --------------------------------- Type of Anderson- Cramer- test Variate(s) Darling von Mises Watson --------------------------------------------------------------- Marginal 1 0.447 0.046 0.043 --------------------------------------------------------------- ?, *, ** indicate significance at 10%, 5% and 1% levels respectively 95.0 % Profile Likelihood Interval for Eta: ( 0.01367 0.4154 ) 95.0 % Profile Likelihood Intervals for Return Periods --------------------------------------------------------- Probability Return Period Level Lower Upper 0.05000 20.00 76.36 65.18 102.9
Graphs
The mean residual life plot shows approximate linearity above a threshold of 30. The decline at the end of the series is due to only the last few data points, and can be safely ignored | |
The parameter estimate stability plot for sigma shows a consistent estimate for sigma above a threshold of 15. The stability plot for eta (not shown) is consistent across all values of the threshold. Thus for any threshold above 15, the fitted distribution will be approximately the same | |
The reciprocal of the mean cluster size, theta, is reasonably close to its expected value, so it does not look as if there is dependence in the series | |
All the data points lie roughly along the line in the Q-Q plot with no points lying outside the 95% confidence bands | |
The value of eta lies within the confidence interval (0.01, 0.41). The value of eta is significantly different from 0 at the 5% level, as 0 lies outside the 95% confidence limits | |
The fitted density shows a nice fit to the observed data | |
This graph gives the return periods in years (as the PERIODLENGTH was given as 365). Approximate confidence limits for the return periods can be read off the bands | |
The profile likelihood for the 1 in 20 year return level shows an asymmetrical confidence limit, with the under limit being much wider than the lower limit. The profile likelihood 95% confidence limit is around the estimated return level of 73.3 is (65, 103) |
See also
- Fit a Generalized Pareto Distribution
- GRGPARETO for generating random GPARETO deviates
- Fit a Generalized Extreme Value Distribution
- GEV