Gaussian process regression inflow from Kampala separated by origin category


# Copyright (C) 2018 Juan Pablo Carbajal
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <>.

# Author: Juan Pablo Carbajal <>
# Created: 2018-01-09

pkg load gpml

Load data

We load continuous site variables to build a model for the inverse of the emptying period SEmptyW. We also load the categorical variable container type to segment the dataset. The latter is then removed from the input variables Xname. Yname contains the output variable to be predicted.

Since the origin category OrCat can only be a cause of the observed variables we believe that we do not introduce bias in this way, in any case we are avoiding confounding.

Xname = {'NUsers','CoVol', 'CoAge', 'TrVol', 'IC', 'CoTyp'};
Yname = 'SEmptyW';
[X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala');
Xcat = X(:, isXcat);
ncat = size (Xcat, 2);
X = X(:,!isXcat);
Xname_cat = Xname(isXcat);
Xname = Xname(!isXcat);

% Indexes to partition output using the categorical variable
idx_cat = categorypartition (Xcat);
cotyp = Xcat_str.(Xname_cat{1}); %{'Pit latrine', 'Septic tank'};

% Indexes of variables used for mean function
[~, imean] = ismember ({'NUsers', 'CoVol'}, Xname);

GP regressor

All the regression is performed on logarithmic transformed variables. We take the negative logarithm of SEmptyW to get the frequency. After we are in the space where the regression will take place we normalize the input variables to put them all in similar scales:

$$ y = -\log(Y) $$ $$ x_i = log (X_i) $$ $$ x_i = \frac{x_i - \tilde{x}_i}{\alpha_{x_i}} $$

The GP structure is defined in the function inflowgp.m, refer to it to know more details.

if !exist('HYP', 'var')
  HYP = ARG = struct();

% Verbosity is true, define the variable verbose in the command line to override
% Make sure verbose is false when generating a html report with publish
if ~exist ('verbose', 'var')
  verbose = false;

% Loop over origin categories and build a model for each
for icat = 1:2
  cat_name = cotyp{icat};

  x = log10 (X(idx_cat{icat},:));
  x = x - median (x);
  x = x ./ mean (abs (x));
  y = -log10 (Y(idx_cat{icat})); % -log Period = log Freq

  if !isfield (HYP, cat_name)
    hyp = [];
    hyp = HYP.(cat_name);

  % Choose hyper-parameter constraints for each category
  % log of the error bounds: 1/7-100 week
  Ferror = sort (log (abs ([-log10(1/7) -log10(100)])));
  switch cat_name
    case 'Pit latrine'
      maxcov = log (0.06); % Max correction should be about 10% of mean
    case 'Septic tank'
      maxcov = log (0.122); % Max correction should be about 10% of mean
  [hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose);

  % Store results for further plotting
  y_data.(cat_name) = y;
  HYP.(cat_name)    = hyp;
  ARG.(cat_name)    = args;
  XX.(cat_name)     = X(idx_cat{icat},:);
  YY.(cat_name)     = 1./Y(idx_cat{icat},:);
endfor % over categories

Summary of results

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

for icat = 1:2
  cat_name = cotyp{icat};
  printf ('\n-- %s --\n', cat_name);
  report_gp (HYP.(cat_name), ARG.(cat_name), @(x)10.^(x));
-- Pit latrine --
** Reports of results
Negative log marginal likelihod: 102.94
Mean function parameters
	0.15	0.03	-1.61
Min of inputs
	-2.38	-2.72
Max of inputs
	2.78	2.72
Bounds mean fun: -1.95 -1.18
Covariance amplitude: 0.00
Bounds cov fun: -0.12 0.09
Bounds coeff variation (%): 0.29 10.54
t-distribution: 3.06 0.15
Deviations: 0.26 1.81
Corr coeff: 0.54
-- Septic tank --
** Reports of results
Negative log marginal likelihod: 108.88
Mean function parameters
	0.35	-0.10	-1.93
Min of inputs
	-2.59	-2.11
Max of inputs
	2.59	4.12
Bounds mean fun: -2.69 -1.02
Covariance amplitude: 0.02
Bounds cov fun: -0.17 0.11
Bounds coeff variation (%): 0.13 10.01
t-distribution: 3.04 0.17
Deviations: 0.31 2.05
Corr coeff: 0.58

Plot results

These plots illustrate the performance of the model

yname = 'Frequency [1/week]';
xname = {'# users', 'Containment V.', 'Containment age', ...
  'Income level'};
plotresults_gp (1, HYP, ARG, XX, YY, xname, {'log10', yname}, @(x)10.^(x));
for fig =3:4
  h = get(figure(fig), 'children');
  for i=1:length(h)
    set (h(i), 'yscale', 'log', 'ygrid', 'on');
    set (h(i), 'xscale', 'log', 'xgrid', 'on');
    axis tight
    xticks ([]); xticks ("auto"); % force recalculation of ticks
    yticks ([]); yticks ("auto"); % force recalculation of ticks

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (7, HYP.(cotyp{1}), xname, imean);
subplot (2,1,1); title (cotyp{1})
plothypARD (8, HYP.(cotyp{2}), xname, imean);
subplot (2,1,1); title (cotyp{2})