Gaussian process regression inflow from households


# Copyright (C) 2018 Juan Pablo Carbajal
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <>.

# Author: Juan Pablo Carbajal <>
# Created: 2018-01-09


pkg load gpml

Load data

We load continuous site variables to build a model for the inverse of the emptying period for the households in both, Hanoi and Kampala. Xname contains the input variables. Yname contains the output variable to be predicted.

The model uses the input variables present in both cities. CoAge is ignored, otherwise we will have to drop most of Hanoi data, because for many entries CoAge is equal to SludgeAge.

Xname = {'NUsers', 'CoVol', 'TrVol', 'OrCat'};
Yname = {'SEmptyW', 'SludgeAge'};
City  = {'Kampala', 'Hanoi'};

% Loop over cities
for c=1:2
  [tmpX tmpY isXcat Xcat_str] = dataset (Xname, Yname{c}, City{c});
  Xcat    = tmpX(:, isXcat);
  idx_cat = categorypartition (Xcat);

  # Select only Households
  [tf,i] = ismember ({'Household', 'Multiple Household'}, Xcat_str.OrCat);
  if length(i(tf)) > 1
    idx_household = cat (idx_cat(i(tf)){:});
    idx_household = idx_cat{i(tf)};

  X{c} = tmpX(idx_household, !isXcat);
  Y{c} = tmpY(idx_household);
Xname = Xname(!isXcat);
Y{2} = Y{2} *  52.1429; # convert years to weeks in Hanoi data

Merge the data from both cities

Yhousehold = cell2mat (Y.');
Xhousehold = cell2mat (X.');

GP regressor

All the regression is performed on logarithmic transformed variables. We take the negative logarithm of the emptying period to get the frequency.

After we are in the space where the regression will take place we normalize the input variables to put them all in similar scales:

$$ y = -\log(Y) $$ $$ x_i = log (X_i) $$ $$ x_i = \frac{x_i - \tilde{x}_i}{\alpha_{x_i}} $$

y  = -log10 (Yhousehold); % -log Period = log Freq
x  = log10 (Xhousehold);
x  = x - median (x);
x  = x ./ mean (abs (x));

assert (all(isfinite(x)))
assert (all(isfinite(y)))

[~, imean] = ismember ({'NUsers', 'CoVol'}, Xname);

if !exist ('hyp', 'var')
  hyp = [];

% Verbosity is true, define the variable verbose in the command line to override
% Make sure verbose is false when generating a html report with publish
if ~exist ('verbose', 'var')
  verbose = false;

% log of the error bounds: 1/7-100 week
Ferror = sort (log (abs ([-log10(1/7) -log10(100)])));
maxcov = log (0.085);               % Max correction should be about 10% of mean

The GP structure is defined in the function inflowgp.m, refer to it to know more details.

[hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose);

Summary of results

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

report_gp (hyp, args, @(x)10.^x);
** Reports of results
Negative log marginal likelihod: 198.08
Mean function parameters
	0.38	0.07	-2.02
Min of inputs
	-2.00	-1.95
Max of inputs
	2.86	4.02
Bounds mean fun: -2.90 -0.96
Covariance amplitude: 0.01
Bounds cov fun: -0.24 0.11
Bounds coeff variation (%): 0.02 10.20
t-distribution: 3.17 0.24
Deviations: 0.41 2.57
Corr coeff: 0.80

Plot results

These plots illustrate the performance of the model

yname = 'Frequency [1/week]';
xname = {'# users', 'Containment V.', 'Truck V.'};
plotresults_gp (2, hyp, args, Xhousehold, 1./Yhousehold, xname, ...
    {'log10', yname}, @(x)10.^(x));
h = get(figure(4), 'children');
for i=1:length(h)
  set (h(i), 'yscale', 'log', 'ygrid', 'on');
  set (h(i), 'xscale', 'log', 'xgrid', 'on');
  axis tight
  xticks ([]); xticks ("auto"); % force recalculation of ticks
  yticks ([]); yticks ("auto"); % force recalculation of ticks
warning: Non-positive limit for logarithmic axis ignored

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (6, hyp, xname, imean);

Plot data from both cities

Plot emptying frequency vs. each variable in the mean function independently

figure (5), clf
  np = 15;
  sp = {3:2:2*np, 4:2:2*np};
  for i=1:2
    h1 = loglog (X{1}(:,i), 1./Y{1}, 'o','markerfacecolor','auto');
    hold on
    h2 = loglog (X{2}(:,i), 1./Y{2}, 'x');
    axis tight
    grid on
    xlabel (xname{i})
    if i == 1
      ylabel (yname);
    hold off
  subplot(np,2,[1 2])
  axis off
  legend([h1,h2], City,'Location','North','Orientation','Horizontal');