# Gaussian process regression inflow from Hanoi households

## Contents

# Copyright (C) 2018 Juan Pablo Carbajal
#
# This program is free software; you can redistribute it and/or modify
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

# Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com>
# Created: 2018-01-09

## Dependencies

We load continuous site variables to build a model for the inverse of the emptying period SludgeAge. Xname contains the input variables. Yname contains the output variable to be predicted.

CoAge is ignored, otherwise we will have to drop most data, because for many entries CoAge is equal to SludgeAge.

Xname = {'NUsers','CoVol', 'TrVol', 'WaterV', 'Vpumped'};
Yname = 'SludgeAge';
[X Y] = dataset (Xname, Yname, 'Hanoi');

Y *= 52.1429; % convert years to weeks

## GP regressor

All the regression is performed on logarithmic transformed variables. We take the negative logarithm of SludgeAge to get the frequency. Since some of the input variables have zeros, we add one before taking the logarithm. After we are in the space where the regression will take place we normalize the input variables to put them all in similar scales:

$$y = -\log(Y)$$ $$x_i = log (X_i)$$ $$x_i = \frac{x_i - \tilde{x}_i}{\alpha_{x_i}}$$

y           = -log10 (Y); % -log Period = log Freq
[~, imean]  = ismember ({'NUsers', 'CoVol'}, Xname);
iother      = setdiff (1:numel(Xname), imean);
x           = X;
x(:,imean)  = log10 (x(:,imean));
x(:,iother) = log10 (x(:,iother)+1); % data has zeros
x           = x - median (x);
x           = x ./ mean (abs (x));
assert (all (isfinite (x)))

if !exist ('hyp', 'var')
hyp = [];
endif

% Verbosity is true, define the variable verbose in the command line to override
% Make sure verbose is false when generating a html report with publish
if ~exist ('verbose', 'var')
verbose = false;
endif

% log of the error bounds: 1/7-100 week
Ferror = sort (log (abs ([-log10(1/7) -log10(100)])));
maxcov = log (0.075);               % Max correction should be about 10% of mean

The GP structure is defined in the function inflowgp.m, refer to it to know more details.

[hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose);

## Summary of results

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})}$$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

report_gp (hyp, args, @(x)10.^x);
** Reports of results
Negative log marginal likelihod: 27.28
Mean function parameters
-0.05	0.01	-2.66
Min of inputs
-2.99	-2.24
Max of inputs
2.99	4.48
Bounds mean fun: -2.77 -2.51
Covariance amplitude: 0.01
Bounds cov fun: -0.24 0.21
Bounds coeff variation (%): 0.34 9.66
t-distribution: 3.06 0.05
Deviations: 0.10 1.25
Corr coeff: 0.76

## Plot results

These plots illustrate the performance of the model

yname = 'Frequency [1/week]';
xname = {'# users', 'Containment V.', 'Truck V.', ...
'Water added V.', 'Sludge emptied V.'};
plotresults_gp (1, hyp, args, X, 1./Y, xname, {'log10', yname}, @(x)10.^(x));
figure (2);
title ('Septic tank');

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (4, hyp, xname, imean);