Consider a linear auto-regressive model for time-series, where is a linear function of
This writes , with the ‘‘feature vectors’’
We can fit this model based on historical data via least-squares:
The associated prediction rule is
We can introduce a non-linear version, where is a quadratic function of
This writes , with the augmented feature vectors
It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors. How do we do it in a computationally efficient manner?
We exploit a simple fact: in the least-squares problem
the optimal lies in the span of the data points :
for some vector . Indeed, from the fundamental theorem of linear algebra, every can be written as the sum of two orthogonal vectors:
where (that is, is in the nullspace ).
Hence the least-squares problem depends only on :
The matrix is called the Kernel matrix. It is equal to the Gram matrix of the data points , and provides a measure of similarity between them, in the sense that if they all have the same (unit) Euclidean norm, then is the cosine angle between and .
The prediction rule depends on the scalar products between new point and the data points :
Once is formed (this takes ), then the training problem has only variables. When , this leads to a dramatic reduction in problem size.
In the nonlinear case, we simply replace the feature vectors by some ‘‘augmented’’ feature vectors , with a non-linear mapping.
This leads to the modified kernel matrix
The kernel function associated with mapping is
It provides information about the metric in the feature space, eg:
The computational effort involved in
solving the training problem;
making a prediction,
depends only on our ability to quickly evaluate such scalar products. We can't choose arbitrarily; it has to satisfy the above for some .
A variety of kernels are available. Some are adapted to the structure of data, for example text or images. Here are a few popular choices.
In fact, given two vectors , we have
More generally when is the vector formed with all the products between the components of , up to degree , and inserting the right coefficients (like the factor in the above definition of ), then for any two vectors ,
The computational effort grows linearly in .
This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:
Form , ;
evaluate . In the above approach the computational effort grows as .
where is a scale parameter. This allows to ignore points that are too far apart. Conrresponds to a non-linear mapping to infinite-dimensional feature space.
Kernels need to be chosen by the user.
The choice is not always obvious; Gaussian or polynomial kernels are popular.
We control over-fitting via cross validation (to choose, say, the scale parameter of Gaussian kernel, or degree of polynomial kernel).