# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09 pkg load gpml

We load continuous site variables to build a model for the inverse of the
emptying period `SEmptyW`

.
We also load the categorical variable container type to segment the dataset.
The latter is then removed from the input variables `Xname`

.
`Yname`

contains the output variable to be predicted.

Since the origin category `OrCat`

can only be a cause of the observed variables
we believe that we do not introduce bias in this way, in any case we are
avoiding confounding.

Xname = {'NUsers','CoVol', 'CoAge', 'TrVol', 'IC', 'CoTyp'}; Yname = 'SEmptyW'; [X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala'); Xcat = X(:, isXcat); ncat = size (Xcat, 2); X = X(:,!isXcat); Xname_cat = Xname(isXcat); Xname = Xname(!isXcat); % Indexes to partition output using the categorical variable idx_cat = categorypartition (Xcat); cotyp = Xcat_str.(Xname_cat{1}); %{'Pit latrine', 'Septic tank'}; % Indexes of variables used for mean function [~, imean] = ismember ({'NUsers', 'CoVol'}, Xname);

All the regression is performed on logarithmic transformed variables.
We take the negative logarithm of `SEmptyW`

to get the frequency.
After we are in the space where the regression will take place we normalize
the input variables to put them all in similar scales:

$$ y = -\log(Y) $$ $$ x_i = log (X_i) $$ $$ x_i = \frac{x_i - \tilde{x}_i}{\alpha_{x_i}} $$

The GP structure is defined in the function
`inflowgp.m`

,
refer to it to know more details.

if !exist('HYP', 'var') HYP = ARG = struct(); endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif % Loop over origin categories and build a model for each for icat = 1:2 cat_name = cotyp{icat}; x = log10 (X(idx_cat{icat},:)); x = x - median (x); x = x ./ mean (abs (x)); y = -log10 (Y(idx_cat{icat})); % -log Period = log Freq if !isfield (HYP, cat_name) hyp = []; else hyp = HYP.(cat_name); endif % Choose hyper-parameter constraints for each category % log of the error bounds: 1/7-100 week Ferror = sort (log (abs ([-log10(1/7) -log10(100)]))); switch cat_name case 'Pit latrine' maxcov = log (0.06); % Max correction should be about 10% of mean case 'Septic tank' maxcov = log (0.122); % Max correction should be about 10% of mean endswitch [hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose); % Store results for further plotting y_data.(cat_name) = y; HYP.(cat_name) = hyp; ARG.(cat_name) = args; XX.(cat_name) = X(idx_cat{icat},:); YY.(cat_name) = 1./Y(idx_cat{icat},:); endfor % over categories

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

for icat = 1:2 cat_name = cotyp{icat}; printf ('\n-- %s --\n', cat_name); report_gp (HYP.(cat_name), ARG.(cat_name), @(x)10.^(x)); endfor

-- Pit latrine -- ** Reports of results Negative log marginal likelihod: 102.94 Mean function parameters 0.15 0.03 -1.61 Min of inputs -2.38 -2.72 Max of inputs 2.78 2.72 Bounds mean fun: -1.95 -1.18 Covariance amplitude: 0.00 Bounds cov fun: -0.12 0.09 Bounds coeff variation (%): 0.29 10.54 t-distribution: 3.06 0.15 Deviations: 0.26 1.81 Corr coeff: 0.54 -- Septic tank -- ** Reports of results Negative log marginal likelihod: 108.88 Mean function parameters 0.35 -0.10 -1.93 Min of inputs -2.59 -2.11 Max of inputs 2.59 4.12 Bounds mean fun: -2.69 -1.02 Covariance amplitude: 0.02 Bounds cov fun: -0.17 0.11 Bounds coeff variation (%): 0.13 10.01 t-distribution: 3.04 0.17 Deviations: 0.31 2.05 Corr coeff: 0.58

These plots illustrate the performance of the model

yname = 'Frequency [1/week]'; xname = {'# users', 'Containment V.', 'Containment age', ... 'Income level'}; plotresults_gp (1, HYP, ARG, XX, YY, xname, {'log10', yname}, @(x)10.^(x)); for fig =3:4 h = get(figure(fig), 'children'); for i=1:length(h) axes(h(i)); set (h(i), 'yscale', 'log', 'ygrid', 'on'); set (h(i), 'xscale', 'log', 'xgrid', 'on'); axis tight xticks ([]); xticks ("auto"); % force recalculation of ticks yticks ([]); yticks ("auto"); % force recalculation of ticks drawnow endfor endfor

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (7, HYP.(cotyp{1}), xname, imean); subplot (2,1,1); title (cotyp{1}) plothypARD (8, HYP.(cotyp{2}), xname, imean); subplot (2,1,1); title (cotyp{2})