Gaussian process regression TS from Hanoi households


# Copyright (C) 2018 Juan Pablo Carbajal
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <>.

# Author: Juan Pablo Carbajal <>
# Created: 2018-01-09


pkg load gpml

Load data

Xname contains the variables used as predictor Yname contains the output variable

Xname = {'NUsers','CoVol', 'SludgeAge', 'Vpumped'};
Yname = 'TS';
[X Y] = dataset (Xname, Yname, 'Hanoi');

GP regressor

We use the data directly, without non-linear mappings. We only re-scale the inputs such that they scales are comparable.

$$ x_i = \frac{X_i - \bar{X}_i}{\sigma_{X_i}} $$

y           = Y;
x           = X;
x           = zscore (x);
assert (all(isfinite(x)))

Variables that are particularly not useful for prediction are removed from the mean and covariance functions, respectively. imean and icov are used as a selection mask. The indices inside these variables refer to positions in Xname

imean       = setdiff (1:length(Xname), 4); % remove Vpumped from mean
icov        = setdiff (1:length(Xname), 2); % remove CoVol from mean
N           = length (y);

Since we do not have a prior model structure hyper-parameters are difficult to constraint. For the exploration of the hyper-parameters we split the data in training and test sets. After good parameters are found, we use the whole dataset for training. This hopefully increases the significance of the regressed parameters.

%comment and clear _idx_ to have a test set.
idx = struct ('trn', 1:N, 'tst', []);
if !exist ('idx', 'var')
  N   = length (x);
  idx = struct ('trn', [], 'tst', []);
  [~,~,idx.tst,idx.trn] = splitdata ([x y], 0.2);
  % add min and max values
  [~,imin] = min (y);
  [~,imax] = max (y);
  idx.tst = sort (setdiff (idx.tst, [imin imax]));
  idx.trn = sort ([idx.trn imin imax]);

% Verbosity is true, define the variable verbose in the command line to override
% Make sure verbose is false when generating a html report with publish
if !exist ('verbose', 'var')
  verbose = false;
if !exist('hyp', 'var')
  hyp = [];

The GP structure is defined in the function TSgp.m, refer to it to know more details.

% Train
[hyp args] = TSgp (x(idx.trn,:), y(idx.trn), {imean, icov}, hyp, ...
  [log(1) log(10)], log(1e1), verbose);
% Test
[y_, dy2_, f_, df2_, lp, post] = gp (hyp, args{:}, x, y);
dy_ = 1.96 * sqrt (dy2_);

Summary of the results

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction. Since we do not have a prior model for TS, this correction indicates nonlinear terms that could be used to improve prediction.

report_gp (hyp, args);
** Reports of results
Negative log marginal likelihod: 177.29
Mean function parameters
	-2.57	-9.47	4.97	30.63
Min of inputs
	-2.01	-1.24	-1.83
Max of inputs
	2.64	3.47	1.70
Bounds mean fun: -6.78 49.70
Covariance amplitude: 90.26
Bounds cov fun: -14.98 17.50
Bounds coeff variation (%): 0.89 105.68
Deviations: 11.55 11.55

Plot results

These plots illustrate the performance of the model

yname = 'TS [mg/L]';
plotresults_gp (1, hyp, args, X(idx.trn,:), Y(idx.trn,:), Xname, {'',Yname});
  ylim ([0 ylim()(2)]);
if !isempty (idx.tst)
  hold on
  errorbar (y(idx.tst), y_(idx.tst), dy_(idx.tst), '~or')
  hold off
  axis tight

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (4, hyp, Xname,{imean, icov});