Effect of contraining the nonlinear correction term in inflow GP

Contents

# Copyright (C) 2018 Juan Pablo Carbajal
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

# Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com>
# Created: 2018-25-11

Introduction

The goal of this script is to show the effect of constraining a nonlinear correction of a linear model. The linear model has the following structure

$$ y_l = y_o + w_1 x_1 + w_2 x_2 + \xi$$

where $x_1$ and $x_2$ are the logarithm of number of users (NUsers) and the container volume (CoVol), respectively. The output variable $y_l$ is the approximation of the logarithm of the observed container emptying frequency (the inverse of SEmptyW), given only by the linear model. The last term $\xi$ is an independent t-distributed noise term with zero mean. The variance of the noise term subject to optimization during the regression.

This model does not capture all the variability in the data set, hence one can propose a nonparametric nonlinear correction term:

$$ \hat{y} = y_l + \epsilon f(x_1, \ldots, x_5) $$

The correction depends on the already mentioned variables, and on three additional variables. This extra term causes the reduction of the noise term, i.e. smaller variance. However if the intensity of the correction ($\epsilon$) is not constrained, this new model can reduce the variance of the noise to zero, i.e. it can interpolate the given data. Therefore, to avoid this unrealistic overfitting, two measures can be taken:

  1. Constrain the variance of the noise term from below.
  2. Constrain the amplitude of the correction $\xi$ from above.

The first measure makes sense when we have an estimate of the observation noise. The second measure is operative and allows us to learn something about the variables that would be useful to reduce noise variance, i.e. to improve the goodness of the fit.

In the following study we use both measures. The bound on the noise variance is estimated from the data and is fixed. The amplitude of the correction is set to different values, and we identify the variables that the correction uses to improve the regression.

Dependencies

pkg load gpml

Load data

The data we will use is the subset corresponding to septic tanks in Kampala. We use CoTyp to select this subset.

Xname = {'NUsers','CoVol', 'CoAge', 'TrVol', 'IC', 'CoTyp'};
Yname = 'SEmptyW';
[X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala');
Xcat = X(:, isXcat);
ncat = size (Xcat, 2);
X = X(:,!isXcat);
Xname_cat = Xname(isXcat);
Xname = Xname(!isXcat);

% Indexes to partition output using the categorical variable
idx_cat = categorypartition (Xcat);

Select second category, corresponding to 'Septic tank'

X              = log10 (X(idx_cat{2},:));
[X Xmean Xstd] = zscore (X);
Y              = -log10 (Y(idx_cat{2})); % -log Period = log Freq

% Indexes of variables used for mean function
[~, imean] = ismember ({'NUsers', 'CoVol'}, Xname);

GP regressor

First we specify the regressor structure, i.e. linear model plus a nonparametric correction term.

linear_mean = {@meanSum, {@meanLinear, @meanConst}};
meanf       = {@meanMask, imean, linear_mean};
covf        = {@covMaternard, 1};
likf        = {@likT};

N     = length (Y);  % number of samples
numax = log (N - 1); % log of the maximum nu parameter for t-distribution
dimX  = columns (X); % dimension of input

hyp0.lik  = [numax; 0];                 % t-distributed noise
hyp0.mean = [X(:,imean) ones(N,1)] \ Y; % mean function == linear regression
hyp0.cov  = zeros(dimX + 1, 1);         % ARD lengths and variance set to 1

The noise in the measured emptying period is bounded between 1 day and 100 weeks

Ferror = sort (log (abs ([-log10(1/7) -log10(100)])));
% The nu parameter of the t-distribution is constrained to avoid
% amplification of the variance.
prior.lik = {{@priorSmoothBox1, log(3-1), numax, 40}, {}};
prior.lik{end} = {@priorSmoothBox1, Ferror(1)-numax, Ferror(2)-numax, 40};

% The covariance intensity is set to desired value
prior.cov      = cell (dimX + 1, 1);
prior.cov{end} = {@priorDelta, 0};

% Inference of hyper-parameters is done using the Variational Bayes method
infe = {@infPrior, @infVB, prior};

% Group all arguments
args = {infe, meanf, covf, likf, X, Y};

Define the vector of values for the correction amplitude. A value of 0.14 gives a maximum correction that is about about 10% of the value of linear model, i.e. coefficient of variation of ~10%.

nC   = 20;
MaxC = sort ([linspace(5e-2, 0.3, nC).'; 0.14]);
nC   = length (MaxC);

Now we regress the data with each value of the correction amplitude

% Verbosity is true, define the variable verbose in the command line to override
% Make sure verbose is false when generating a html report with publish
if ~exist ('verbose', 'var')
  verbose = false;
endif

if !exist('HYP', 'var')
  % Pre-allocate
  HYP = ARG = cell (nC, 1);
  SStot = sumsq (Y - mean (Y)); % prop. to variance, used in goodness of fit
  Rsq = zeros (nC, 2);

  for i = nC:-1:1
    maxcov = log (MaxC(i));
    % set prior: parameter for priorDelta in covariance amplitude
    args{1}{3}.cov{end}{2} = maxcov;
    % set hyp0 as initial guess, and covariance amplitude
    hyp                          = hyp0;
    hyp.cov(end)                 = maxcov;
    hyp                          = train_gp (args, hyp, [], verbose);
    [Yhat dYhat2 mf df2 lp post] = gp (hyp, args{:}, X);

    printf ('--\nCovariance amplitude: %.2f\n', MaxC(i));

    % Coefficient of variation
    yl = feval (meanf{:}, hyp.mean, X);                    % linear model alone
    cf = feval(covf{:}, hyp.cov, X) * post.alpha;          % correction function
    [mincv maxcv] = bounds (abs (cf ./ yl));
    printf ('Bounds coeff variation (%%): %.1f %.1f\n', mincv * 100, maxcv * 100);

    % Variance of the noise
    nu    = exp (hyp.lik(1)) + 1;
    sn    = exp (hyp.lik(2));
    noise = nu * sn^2 / (nu - 2);
    printf ('Variance of noise: %.2f\n', noise);

    % Compute coefficient of determination
    Rsq(i, :) = 1 - [sumsq(Y - Yhat) sum(dYhat2)] ./ SStot;
    printf ('Goodness of fit: %.2f\nVariance ratio: %.2f\n', Rsq(i,:));

    % Likelihood of model given data
    printf ('Negative log marginal likelihod: %.2f\n', gp (hyp, args{:}));

    % Store results for further plotting
    HYP{i} = hyp;
    ARG{i} = args;
  endfor % over categories
endif
--
Covariance amplitude: 0.30
Bounds coeff variation (%): 2.7 96.7
Variance of noise: 0.00
Goodness of fit: 1.00
Variance ratio: 0.99
Negative log marginal likelihod: 70.01
--
Covariance amplitude: 0.29
Bounds coeff variation (%): 2.7 96.4
Variance of noise: 0.00
Goodness of fit: 1.00
Variance ratio: 0.99
Negative log marginal likelihod: 75.57
--
Covariance amplitude: 0.27
Bounds coeff variation (%): 2.7 96.0
Variance of noise: 0.00
Goodness of fit: 1.00
Variance ratio: 0.99
Negative log marginal likelihod: 82.05
--
Covariance amplitude: 0.26
Bounds coeff variation (%): 0.6 38.6
Variance of noise: 0.01
Goodness of fit: 0.70
Variance ratio: 0.96
Negative log marginal likelihod: 84.65
--
Covariance amplitude: 0.25
Bounds coeff variation (%): 0.7 38.1
Variance of noise: 0.01
Goodness of fit: 0.67
Variance ratio: 0.96
Negative log marginal likelihod: 87.90
--
Covariance amplitude: 0.23
Bounds coeff variation (%): 0.2 41.5
Variance of noise: 0.01
Goodness of fit: 0.66
Variance ratio: 0.96
Negative log marginal likelihod: 87.20
--
Covariance amplitude: 0.22
Bounds coeff variation (%): 0.1 40.6
Variance of noise: 0.01
Goodness of fit: 0.64
Variance ratio: 0.96
Negative log marginal likelihod: 91.36
--
Covariance amplitude: 0.21
Bounds coeff variation (%): 1.1 39.2
Variance of noise: 0.01
Goodness of fit: 0.62
Variance ratio: 0.96
Negative log marginal likelihod: 96.33
--
Covariance amplitude: 0.19
Bounds coeff variation (%): 0.4 22.3
Variance of noise: 0.04
Goodness of fit: 0.52
Variance ratio: 0.89
Negative log marginal likelihod: 101.27
--
Covariance amplitude: 0.18
Bounds coeff variation (%): 0.3 17.7
Variance of noise: 0.06
Goodness of fit: 0.47
Variance ratio: 0.86
Negative log marginal likelihod: 102.52
--
Covariance amplitude: 0.17
Bounds coeff variation (%): 0.2 15.2
Variance of noise: 0.07
Goodness of fit: 0.43
Variance ratio: 0.84
Negative log marginal likelihod: 103.45
--
Covariance amplitude: 0.16
Bounds coeff variation (%): 0.1 13.1
Variance of noise: 0.07
Goodness of fit: 0.39
Variance ratio: 0.83
Negative log marginal likelihod: 104.19
--
Covariance amplitude: 0.14
Bounds coeff variation (%): 0.0 11.2
Variance of noise: 0.08
Goodness of fit: 0.36
Variance ratio: 0.82
Negative log marginal likelihod: 104.79
--
Covariance amplitude: 0.14
Bounds coeff variation (%): 0.1 10.9
Variance of noise: 0.08
Goodness of fit: 0.35
Variance ratio: 0.82
Negative log marginal likelihod: 104.87
--
Covariance amplitude: 0.13
Bounds coeff variation (%): 0.2 9.5
Variance of noise: 0.09
Goodness of fit: 0.33
Variance ratio: 0.81
Negative log marginal likelihod: 105.27
--
Covariance amplitude: 0.12
Bounds coeff variation (%): 0.3 7.8
Variance of noise: 0.09
Goodness of fit: 0.30
Variance ratio: 0.81
Negative log marginal likelihod: 105.69
--
Covariance amplitude: 0.10
Bounds coeff variation (%): 0.4 6.2
Variance of noise: 0.10
Goodness of fit: 0.28
Variance ratio: 0.80
Negative log marginal likelihod: 106.02
--
Covariance amplitude: 0.09
Bounds coeff variation (%): 0.4 4.8
Variance of noise: 0.10
Goodness of fit: 0.25
Variance ratio: 0.80
Negative log marginal likelihod: 106.30
--
Covariance amplitude: 0.08
Bounds coeff variation (%): 0.3 3.5
Variance of noise: 0.10
Goodness of fit: 0.23
Variance ratio: 0.80
Negative log marginal likelihod: 106.52
--
Covariance amplitude: 0.06
Bounds coeff variation (%): 0.2 2.4
Variance of noise: 0.10
Goodness of fit: 0.22
Variance ratio: 0.80
Negative log marginal likelihod: 106.70
--
Covariance amplitude: 0.05
Bounds coeff variation (%): 0.1 1.5
Variance of noise: 0.10
Goodness of fit: 0.20
Variance ratio: 0.80
Negative log marginal likelihod: 106.84

Plot results

This plots show the performance of the model for different values of the correction amplitude. As mentioned before, the bigger the amplitude the higher the goodness of fit (likely due to overfitting)

figure (1); clf
  plot (MaxC, Rsq(:,1), '-o');
  xlabel ('Covariance amplitude')
  ylabel ('Goodness of fit')
  ylim ([min(Rsq(:,1)), 1])
  xlim (MaxC([1 end]))
  grid on
s_inflow_constrained_correction-1.png

The following plot show the relative relevance of each variable in the correction term. This weights change rapidly and the results are not very stable for lower values of the amplitude. This might be due to the small size of the dataset.

RelC = cell2mat (cellfun (@(x)exp (-x.cov(1:end-1)), HYP, 'unif', 0).');
RelC = RelC ./ sum (RelC);
figure (2); clf
  plot (MaxC, RelC, '-o');
  %plot (Rsq(:,1), RelC, '-o');
  legend (Xname)
  axis tight
  xlabel ('Covariance amplitude')
  %xlabel ('Goodness of fit')
  ylabel ('Relative relevance c.f.')
s_inflow_constrained_correction-2.png

The following plot shows the relative weights of the linear model.

RelM = cell2mat (cellfun (@(x)x.mean(1:end-1), HYP, 'unif', 0).');
RelM = RelM ./ sum (abs (RelM));
figure (3); clf
  plot (MaxC, RelM, '-o');
  %plot (Rsq(:,1), RelM, '-o');
  legend (Xname{imean})
  line (xlim, 0);
  axis tight
  xlabel ('Covariance amplitude')
  %xlabel ('Goodness of fit')
  ylabel ('Relative relevance m.f.')
s_inflow_constrained_correction-3.png