Gaussian process regression TS from Kampala separated by origin category


# Copyright (C) 2018 Juan Pablo Carbajal
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <>.

# Author: Juan Pablo Carbajal <>
# Created: 2018-01-09

pkg load gpml

Gaussian Process regression of TS

We load continuous site variables to build a model for TS. We also load the categorical variable container type to segment the dataset. The latter is then removed from the input variables Xname. Yname contains the output variable to be predicted.

Xname = {'NUsers','CoVol', 'SEmptyW', 'IC', 'CoTyp'};
Yname = 'TS';
[X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala');
Xcat = X(:, isXcat);
ncat = size (Xcat, 2);
X = X(:,!isXcat);
Xname_cat = Xname(isXcat);
Xname = Xname(!isXcat);

% Indexes to partition output using the categorical variable
idx_cat = categorypartition (Xcat);
cotyp = Xcat_str.(Xname_cat{1}); %{'Pit latrine', 'Septic tank'};

ICOV = IMEAN = struct();
IMEAN.(cotyp{1}) = setdiff (1:length(Xname), []);
IMEAN.(cotyp{2}) = setdiff (1:length(Xname), []);
ICOV.(cotyp{1}) = setdiff (1:length(Xname), []);
ICOV.(cotyp{2}) = setdiff (1:length(Xname), []);

GP regressor

We use the data directly, without non-linear mappings. We only re-scale the inputs such that they scales are comparable.

$$ x_i = \frac{X_i - \bar{X}_i}{\sigma_{X_i}} $$

The GP structure is defined in the function TSgp.m, refer to it to know more details.

X_      = zscore (X);
[N dim] = size (X_);
assert (all(isfinite(X_)))

if !exist('HYP', 'var')
  HYP = ARG = struct();

% Verbosity is true, define the variable verbose in the command line to override
% Make sure verbose is false when generating a html report with publish
if ~exist ('verbose', 'var')
  verbose = false;

% Loop over origin categories and build a model for each
for icat = 1:2
  cat_name = cotyp{icat};

  x = X_(idx_cat{icat},:);
  y = Y(idx_cat{icat});

  if !isfield (HYP, cat_name)
    hyp = [];
    hyp = HYP.(cat_name);

  % Choose hyper-parameter constraints for each category
  TSerror = [log(1) log(10)]; % log of the error bounds, here 1-10 mg/L
  switch cat_name
    case 'Pit latrine'
      maxcov = log (1.5e1);
    case 'Septic tank'
      maxcov = log (1e1);
  ifun = {IMEAN.(cat_name), ICOV.(cat_name)};
  [hyp args] = TSgp (x, y, ifun, hyp, TSerror, maxcov, verbose);

  % Store results for further plotting
  y_data.(cat_name) = y;
  HYP.(cat_name)    = hyp;
  ARG.(cat_name)    = args;
  XX.(cat_name)     = X(idx_cat{icat},:);
  YY.(cat_name)     = Y(idx_cat{icat},:);
endfor % over categories

Summary of the results

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since we do not have a prior model for TS, this correction indicates nonlinear terms that could be used to improve prediction.

for icat = 1:2
  cat_name = cotyp{icat};
  printf ('\n-- %s --\n', cat_name);
  report_gp (HYP.(cat_name), ARG.(cat_name));
-- Pit latrine --
** Reports of results
Negative log marginal likelihod: 250.03
Mean function parameters
	0.40	-13.56	22.35	1.48	37.06
Min of inputs
	-0.97	-0.56	-0.65	-1.43
Max of inputs
	5.15	3.35	1.86	0.66
Bounds mean fun: -21.75 83.76
Covariance amplitude: 224.51
Bounds cov fun: -24.41 66.15
Bounds coeff variation (%): 1.37 191.77
Deviations: 12.51 12.51
-- Septic tank --
** Reports of results
Negative log marginal likelihod: 164.58
Mean function parameters
	-2.61	-1.68	5.18	-1.51	17.25
Min of inputs
	-1.10	-0.54	-0.68	-0.39
Max of inputs
	1.96	7.38	4.73	1.71
Bounds mean fun: -2.16 40.86
Covariance amplitude: 104.07
Bounds cov fun: -20.11 50.07
Bounds coeff variation (%): 8.22 178.97
Deviations: 12.16 12.16

Plot results

These plots illustrate the performance of the model

yname = 'TS [mg/L]';
plotresults_gp (1, HYP, ARG, XX, YY, Xname, {'', yname});

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (7, HYP.(cotyp{1}), Xname, {IMEAN.(cotyp{1}), ICOV.(cotyp{1})});
subplot (2,1,1); title (cotyp{1})
plothypARD (8, HYP.(cotyp{2}), Xname, {IMEAN.(cotyp{2}), ICOV.(cotyp{2})});
subplot (2,1,1); title (cotyp{2})