Intro to Gaussian Processes

machine-learning
statistics
Author

Christian Testa

Published

January 16, 2021

In this article I’m going to try to provide some intuition around how multivariate Gaussian distributions and Gaussian process models can be useful for modeling smooth functions.

A helpful definition

A random variable (X) in (d) dimensions is multivariate normally distributed e.g. (X N_d(, )) if and only if (_i _i X_i N(', ')) for any choice of ({ _i }) and (i) taken across any subset of ({1, …, d}) for some (') and ('). In other words, the sum or linear combination of any subset of a multivariate normal distribution’s components is a normal distribution.

Understanding the Role of the Covariance Matrix

It’s useful to consider the extremal situations first.

Here’s a draw from a multivariate normal distribution where the mean is 0 and the covariance matrix is all 1s:

The covariance matrix being full of ones forces all elements of the vector to be exactly the same.

The other extreme is when the covariance matrix is 1 on the diagonal entries and 0 everywhere else. Here’s a draw from that distribution, again with mean 0:

This results in a very noisy vector which is still roughly centered at 0.

Another easy covariance matrix to specify is where the off-diagonal elements are all some fixed number, say .999:

This results in a noisy vector, but where all the entries are correlated with each other, resulting in reduced deviation within the components of the draw.

Using Gaussian Kernels

Now that we understand how the multivariate normal distribution acts given some examples of covariance matrices, we can think about more practical situations such as when the covariance matrix is defined based off a kernel function like the Gaussian kernel.

The Gaussian kernel is defined as (K(x_1, x_2) = (-(x_1-x_2)^2)) for some choice of smoothing parameter ().

Given (X = [0, 0.1, 0.2, …, 10]) we can define our covariance matrix as ([K(X_i,X_j)]_{{i,j}}). Essentially this results in a covariance matrix where neighboring components (and nearby components) are expected to be highly correlated while more distant components are expected to have less correlation.

First, here’s an example with ():

If we reduce () to .1 we get more wiggles:

Function Approximation using MVNs and the Gaussian Kernel

It’s useful to note that throughout the Bayesian Statistics literature one can find frequent references to “Gaussian Processes” which are defined as having a prior distribution given by a multivariate normal distribution. Gaussian Process models have made quite the splash in areas of machine learning and statistics as a flexible means of modeling smooth surfaces. Sometimes Gaussian Process Models also go by the name of “kriging” which commonly comes up in geospatial statistics.

To create an example scenario in which we can apply some Gaussian Process modeling, let’s define a wiggly function from which we will take some sample observations. Then we will fit the data using a Gaussian Process model and see how the model compares to the true function. We won’t take model assessment too seriously here, but we will qualitatively compare the Gaussian Process model’s performance to other common techniques like cubic splines and locally weighted smoothing (loess).

h_func <- function(x) { cos(x^2) + x }
n <- 25
X <- runif(n,0,4)
h_vals <- h_func(X)
h_mean <- mean(h_vals)
h_vals <- h_vals - h_mean

Note that we’re centering the data so that it’s compatible with our later assumption that (= 0) for our Gaussian Process model.

Now we specify the Gaussian Process model below using Stan.

data {
  int<lower=1> N1;
  real x1[N1];
  vector[N1] y1;
  int<lower=1> N2;
  real x2[N2];
}
transformed data {
  real delta = 1e-9;
  int<lower=1> N = N1 + N2;
  real x[N];
  for (n1 in 1:N1) x[n1] = x1[n1];
  for (n2 in 1:N2) x[N1 + n2] = x2[n2];
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
}
transformed parameters {
  vector[N] f;
  {
    matrix[N, N] L_K;
    matrix[N, N] K = cov_exp_quad(x, alpha, rho);

    // diagonal elements
    for (n in 1:N)
      K[n, n] = K[n, n] + delta;

    L_K = cholesky_decompose(K);
    f = L_K * eta;
  }
}
model {
  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();
  eta ~ std_normal();

  y1 ~ normal(f[1:N1], sigma);
}
generated quantities {
  vector[N2] y2;
  for (n2 in 1:N2)
    y2[n2] = f[N1 + n2];
}

We can see the model does fairly well in the area [0,3.75] range, but the model is definitely not great for extrapolating when the function does not decay to 0 outside the observed area.

We can compare this to how cubic splines perform:

The cubic spline regression seems to underperform in some of the less observed areas within the [0,3.75] range compared to the Gaussian Process model.

And then with loess:

Loess seems to apply a lot more smoothing and less interpolation compared to the cubic spline and Gaussian Process model approaches. It would be interesting to see if this property means that loess performs better on more noisy data.

Next Steps

All in all, I think the details provided here provide a helpful intuition about the basic behavior of multivariate normal distributions and how they can be used to define Gaussian Process models in 1 dimension.

I think further intuition about how they perform when faced with data that includes noise as well as higher dimensions would be valuable and enlightening.

References

Surrogates, Gaussian process modeling, design and optimization for the applied sciences, Chapter 5. Robert B. Gramacy. https://bookdown.org/rbg/surrogates/chap5.html

Stan User’s Guide, Section 10.3 Fitting a Gaussian Process. https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html