1. Home
  2. SVMFIT procedure

SVMFIT procedure

Fits a support vector machine (D. B. Baird).

Options

PRINT = string tokens Printed output from the analysis (summary, predictions, allocations, debug); default summ, alloc
SVMTYPE = string token Type of support vector machine to fit (svc, svr, nusvc, nusvr, lsvc, lsvr, lcs, svm1); default svc
KERNEL = string token Type of kernel to use (linear, polynomial, radialbasis, sigmoid); default radi
PENALTY = scalar or variate Penalty or cost for points on the wrong side of the boundary; default 1
GAMMA = scalar or variate Gamma parameter for types with non-linear kernels; default 1
NU = scalar or variate Nu parameter for types nusvc, nusvr, and svm1; default 0.5
EPSILON = scalar or variate Epsilon parameter for types svr and lsvr; default 0.1
BIAS = scalar Bias for allocations to groups for types lsvc and lsvr; default -1 i.e. no bias
DEGREE = scalar Degree for polynomial kernel; default 3
CONSTANTVALUE = scalar Constant for polynomial or sigmoid kernel; default 0
LOWER = scalar or variate Lower limit for scaling data variates; default -1
UPPER = scalar or variate Upper limit for scaling data variates; default 1
SCALING = string token Type of scaling to use (none, uniform, given); default unif
NOSHRINK =string token Whether to suppress the shrinkage of attributes to exclude unused ones (no, yes); default no
OPTMETHOD =string token Whether to optimize probabilities or allocations (allocations, probabilities); default allo
REGULARIZATIONMETHOD = string token Regularization method for SMVTYPE = lsvc or lsvr (l1, l2); default l2
LOSSMETHOD = string token Loss method for SMVTYPE = lsvc or lsvr (logistic, l1, l2); default logi
DUALMETHOD = string token Whether to use the dual algorithm for SMVTYPE = lsvc or lsvr (yes, no); default no
NCROSSVALIDATIONGROUPS = scalar Number of groups for cross-validation; default 10
SEED = scalar Seed for random number generation; default 0
TOLERANCE = scalar Tolerance for termination criterion; default 0.001
WORKSPACE = scalar Size of workspace needed for data; default is to calculate this from the number of observations and variates

Parameters

Y = factors or variates Define groupings for the units in each training set y-variate to be predicted via regression, with missing values in the units to be allocated or predicted
X = pointers Each pointer contains a set of explanatory variates or factors
WEIGHTS = variates Weights to multiply penalties for each group when SMVTYPE = svc, nusvc, lsvc or lcs
PREDICTIONS = factors or variates Saves allocations to groups or predictions from regression
ERRORRATE = scalars, variates or matrices Saves the error rate for the combinations of parameters specified for the support vector machine
OPTPENALTY = scalars Saves the optimal value of penalty parameter
OPTGAMMA = scalars Saves the optimal value of gamma parameter
OPTNU = scalars Saves the optimal value of nu parameter
OPTEPSILON = scalars Saves the optimal value of epsilon parameter
OPTERRORRATE = scalars Saves the minimum error rate
SCALE = texts or pointers Saves the scaling used for the X variates, in a file if a text is given, or otherwise in a pointer to a pair of variates
SAVEFILE = texts File in which to save the model, for use by SVMPREDICT

Description

SVMFIT fits a support vector machine (Cortes & Vapnik 1995), which defines multivariate boundaries to separate groups, or predict values. It provides a Genstat interface to the libraries LIBSVM (Chang & Lin 2001) and LIBLINEAR (Fan et al. 2008), which are made available subject to the conditions listed in the Method section.

Unlike linear discriminant analysis, a support vector machine assumes no statistical model for the distribution of individuals within a group. The method is thus less affected by outliers. The method chooses boundaries to maximize the separation between groups. The reason why this is known as a support vector machine, is that there is a small set of data points that define the boundaries, and these are known as the support vectors. If individuals lie on the wrong side of the boundary, the distance from the boundary, multiplied by a penalty, is added to the separation criterion.

The type of support vector machine to fit is specified by the SVMTYPE option, with settings:

    svc a multi-class support vector classifier with a range of kernels for discriminating between groups;
    svr support vector regression with a range of kernels for predicting the values of a y-variate as in a regression;
    nusvc Nu classification – a multi-class support vector classifier with a range of kernels for discriminating between groups with a parameter NU that controls the fraction of support vectors used;
    nusvr Nu regression – support vector regression with a range of kernels for predicting the values of a y-variate as in a regression with a parameter NU that controls the fraction of support vectors used;
    lsvc Fast linear classification – a fast regularized linear support vector for discriminating between groups;
    lsvr Fast linear regression – a fast regularized linear support vector regression for predicting the values of a y-variate as in a regression;
    lcs a fast linear support vector machine for discriminating between groups using the approach of Cramer & Singer (2000), where a direct method for training multi-class predictors is used, rather than dividing the multi-class classification into a set of binary classifications; and
    svm1 Consistent group SVM – a support vector machine which attempts to identify a consistent group of observations.

The shape of the boundary is controlled by the KERNEL option which specifies the metric used to measure distance between multi-dimensional points u and v. The settings are:

    linear the linear function uv;
    polynomial the polynomial function γ (uv + c)d;
    radialbasis the radial basis function exp(-γ |uv|2); and
    sigmoid the sigmoid function tanh(γ uv + c).

With a linear kernel, the boundaries are multi-dimensional planes. For the other types they are curved surfaces. The kernel is ignored for SMVTYPE=lsvc, lsvr and lcs as these always use a linear kernel.

The data set is supplied in a pointer of explanatory variates or factors, specified by the X parameter, and a response variate or factor specified by the Y parameter. The Y parameter need not be set if SMVTYPE=svm1, as this searches for a consistent group of individuals in the data set, ignoring the Y parameter. Explanatory factors are converted to variates, using the levels of the factor concerned. Any unit with a missing value in an explanatory variate takes a zero value for that attribute. With the default, uniform, scaling this puts them in the centre of the range of the variate concerned. Units can also be excluded from the analysis by restricting the factor or variates; any such restrictions must be consistent.

The response factor specifies the pre-defined groupings of the units from which the allocation is derived (the “training set”); the units to be allocated by the analysis have missing values for Y. A response variate supplies training values for a regression-type support vector machine. (These are requested by SMVTYPE settings svr, nusvr and lsvr.) Units to be predicted by the regression have missing values in the y-variate.

The support vector machine solutions depend on the scale of the attributes. It is usually recommended that all attributes are put on the same scale, so that they all have the same influence. This is controlled by the SCALING option, with settings:

    none the attributes are used as supplied, with no scaling;
    uniform all the attributes are centred, and scaled to have the same minimum and maximum (default); and
    given the variates are scaled using the LOWER and UPPER options.

The LOWER and UPPER options can be set to a scalar, to apply a uniform scaling, where all the variates are given the same minimum (LOWER) and maximum (UPPER) value; alternatively, they can be variates specifying the minimum and maximum value for each variate, respectively.

The PENALTY option defines the penalty that is applied to the sum of distances for the points on the wrong side of the boundary when calculating the optimal boundaries; default 1. Larger values apply more weight to points that are on the wrong side of the discrimination boundaries, and can be investigated to optimize performance. However, linear support vector machines are generally insensitive to the choice of the penalty. The WEIGHTS parameter can be used to change the penalty for mis-assigning a case to a particular group, and should be a variate with the same length as the number of levels in Y. The penalty for each group is then corresponding value of PENALTY*WEIGHTS.

The GAMMA option (γ in the equations for the kernels) controls the smoothness of the boundary for non-linear kernels, with larger values giving a rougher surface.

With SVMTYPE=nusvc and nusvr, the parameter NU controls the number of support vectors used; default 0.5. With larger values of NU, smaller numbers of support vectors are used, giving a sparser solution that may be more robust and thus perform better in future prediction.

With the regression cases SVMTYPE=svr and lsvr, the parameter EPSILON controls the sensitivity of the loss function being optimized; default 0.1. A range of parameter values for PENALTY, GAMMA, NU or EPSILON are usually tried, to optimize the discrimination between groups or predictions of the y-variate. These parameters also accept a variate, in which case all the values in the variate are tried and the one that minimizes the error rate is selected. Up to two of these parameters can be variates at once. A grid of error rates is then calculated using every combination of the two sets of parameters, and the optimal combination is selected. If three or more of these parameters are set to variates, a warning is given, and only the first values of the third and fourth variates are selected.

When KERNEL=polynomial, the DEGREE option defines the degree of the polynomial (d in the equation for the polynomial kernel). The CONSTANTVALUE option gives the constant (c in the equations for the kernels), for KERNEL=polynomial and sigmoid.

The TOLERANCE option supplies a small positive value that controls the precision used for the termination criterion. Decreasing this may provide a better solution, but will increase the time taken until convergence.

The NOSHRINK option controls whether unnecessary attributes are dropped from the fitting process; by default, these are dropped, thus increasing the speed to find a solution when there are many iterations (e.g. when TOLERANCE has been made smaller). If few iterations are required to find a solution, it may be faster to set NOSHRINK=yes.

The OPTMETHOD option controls the criterion that is optimized when the SVMTYPE is set to svc, svr, nusvc or nusvr, with settings:

    allocations for the accuracy of allocating individuals to groups; or
    probabilities for sum of the probabilities of allocating an individual to the correct group.

The SYMTYPEs lsvc, lsvr and lcs fit regularized linear support vector machines using the algorithms in the LIBLINEAR library of Fan et al. (2008). This is much faster than the default algorithm, allowing much bigger data sets to be analysed. The REGULARIZATIONMETHOD, LOSSMETHOD and DUALMETHOD options specify which LIBLINEAR algorithm is used for SYMTYPEs lsvc and lsvr.

The REGULARIZATIONMETHOD option allows you to create sparser sets of support vectors, with the L1 setting giving a smaller set of support vectors than L2. The LOSSMETHOD option controls the loss function being minimized: the L2 setting minimizes the sum of the squared distances of points on the wrong side of the boundary, the L1 setting minimizes the sum of the distances, and the logistic setting uses a logistic regression loss function. Setting option DUALMETHOD=yes may be faster when there are a large number of attributes. Not all combinations of REGULARIZATIONMETHOD, LOSSMETHOD and DUALMETHOD options are available.

When SVMTYPE=lsvc, you can use the BIAS option to attempt to achieve a more optimal discrimination between groups. When BIAS is set to a non-negative value, an extra constant attribute is added to the end of each individual. This extra attribute is given a weight that controls the origin of the separating hyper-plane (the origin is where all attributes have value of 0). A BIAS of 0 forces the separating hyper-plane to go through the origin, and a non-zero value moves the plane away from the origin. The BIAS thus acts as a tuning parameter, that changes the hyper-plane’s origin. A range of values can be investigated, to try to improve the discrimination.

Printed output is controlled by the option PRINT with settings:

    summary tables giving the number of units in each group with a complete set of observations;
    allocations tables of counts of allocations; and
    debug details of the parameters set when calling the libraries.

The error rate is worked out by cross-validation, which works by randomly splitting the units into a number of groups specified by the NCROSSVALIDATIONGROUPS option. It then omits each of the groups, in turn, and predicts how the omitted units are allocated to the discrimination groups.

The SEED option provides the seed for the random numbers used for allocating individuals to the cross-validation groups. The default value of 0 continues an existing sequence of random numbers. If none have been used in the current Genstat job, it initializes the seed automatically using the computer clock.

The WORKSPACE option can be set if the problem requires more memory than the default settings.

Results from the analysis can be saved using the parameters PREDICTIONS, ERRORRATE, OPTPENALTY, OPTGAMMA, OPTNU, OPTEPSILON and OPTERRORRATE. The structures specified for these parameters need not be declared in advance. If one of the options PENALTY, GAMMA, NU or EPSILON has been set to a variate, ERRORRATE will be a variate indexed by that variate. Alternatively, if two of these options have been set to variates, ERRORRATE will be a matrix with rows and columns indexed by those variates. The OPT parameters contain the values of the parameters, that give the minimum error rate (returned in OPTERRORRATE).

The support vector machine model can be saved in an external file, using the SAVEFILE parameter, so that it can be used later with SVMPREDICT. As the scaling on the attributes must be the same in future data sets, the scaling can be saved with the SCALE parameter. This can supply either a filename (ending in .gsh) to keep these permanently, or a pointer so that these can be applied to the attributes used in SVMPREDICT later in the same program. The file or pointer contains two variates, which give the slope and intercept (in that order) for the linear transform applied to each attribute.

Options: PRINT, SVMTYPE, KERNEL, PENALTY, GAMMA, NU, EPSILON, BIAS, DEGREE, CONSTANTVALUE, LOWER, UPPER, SCALING, NOSHRINK, OPTMETHOD, REGULARIZATIONMETHOD, LOSSMETHOD, DUALMETHOD, NCROSSVALIDATIONGROUPS, SEED, TOLERANCE, WORKSPACE.

Parameters: Y, X, WEIGHTS, PREDICTIONS, ERRORRATE, OPTPENALTY, OPTGAMMA, OPTNU, OPTEPSILON, OPTERRORRATE, SCALE, SAVEFILE.

Method

SVMFIT provides a Genstat interface to the C++ libraries LIBSVM (Chang & Lin 2001) and LIBLINEAR (Fan et al. 2008), that have been compiled into the GenSVM dynamic link library. A user guide by Hsu et al. (2003) gives details on their use.

LIBSVM is provided subject to the following copyright notice.

Copyright © 2000-2014 Chih-Chung Chang and Chih-Jen Lin. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1.   Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2.   Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3.   Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holders and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the regents or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

LIBLINEAR is provided subject to the following copyright notice.

Copyright © 2007-2013 The LIBLINEAR Project. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1.   Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2.   Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3.   Neither name of copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holders and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the regents or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

Action with RESTRICT

The input variates and factor may be restricted. The restrictions must be identical.

References

Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297.

URL: http://link.springer.com/article/10.1007%2FBF00994018

Chang, C.C. & Lin, C.J. (2001). LIBSVM: A library for support vector machines.

URL: http://www.csie.ntu.edu.tw/~cjlin/libsvm

Cramer, K. & Singer, Y. (2000). On learnability and design of output codes for multi-class problems. In Computational Learning Theory, 35-46.

Fan, R.E., Chang, K.W, Hsieh, X.R., Wang, X.R. & Lin C.J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871-1874.

URL: http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf

Hsu, C.W., Chang, C.C. & Lin, C.J. (2003). A practical guide to support vector classification. (Technical report). Department of Computer Science and Information Engineering, National Taiwan University.

URL: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

See also

Directive: CVA.

Procedures: SVMPREDICT, DISCRIMINATE, QDISCRIMINATE, SDISCRIMINATE.

Example

CAPTION 'SVMFIT for classification: Fisher Iris data'; STYLE=meta
SPLOAD  [PRINT=*] '%DATA%/Iris.gsh'
POINTER [VALUES=Sepal_Length,Sepal_Width,Petal_Length,Petal_Width] Var
" Default - radialbasis kernel with scaling."
SVMFIT  [PRINT=summary,allocations; SEED=726454] Y=Species; X=Var
" Unscaled with linear kernel."
SVMFIT  [PRINT=summary,allocations; KERNEL=linear; SCALING=none;\
        SEED=143038] Y=Species; X=Var

CAPTION 'SVMFIT for regression: Los Angeles Ozone data'; STYLE=meta
SPLOAD  [PRINT=*] '%DATA%/Ozone.gsh'; ISAVE=Data
SUBSET  [Ozone /= !s(*)] Data[]
POINTER [VALUES=Data[1,2,(5...10)]] OZVars
" Find optimal values for penalty and gamma."
SVMFIT  [PRINT=summary; SVMTYPE=svr; PENALTY=!(1,10,100,500,1000);\
        GAMMA=!(0.05,0.1,0.2,0.4); SEED=562011] Y=Ozone; X=OZVars;\
        PREDICTIONS=POzone
DGRAPH  [TITLE='Los Angeles Ozone levels 1976 ~{epsilon}-regression';\
        KEY=0;WIND=3] Y=POzone; X=Ozone
Updated on June 18, 2019

Was this article helpful?