# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09 pkg load gpml

We load continuous site variables to build a model for the inverse of the
emptying period `SEmptyW`

.
We also load the categorical variable container type to segment the dataset.
The latter is then removed from the input variables `Xname`

.
`Yname`

contains the output variable to be predicted.

Since the origin category `OrCat`

can only be a cause of the observed variables
we believe that we do not introduce bias in this way, in any case we are
avoiding confounding.

Xname = {'NUsers','CoVol', 'CoAge', 'TrVol', 'IC', 'CoTyp'}; Yname = 'SEmptyW'; [X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala'); Xcat = X(:, isXcat); ncat = size (Xcat, 2); X = X(:,!isXcat); Xname_cat = Xname(isXcat); Xname = Xname(!isXcat); % Indexes to partition output using the categorical variable idx_cat = categorypartition (Xcat); cotyp = Xcat_str.(Xname_cat{1}); %{'Pit latrine', 'Septic tank'}; % Indexes of variables used for mean function [~, imean] = ismember ({'NUsers', 'CoVol'}, Xname);

All the regression is performed on logarithmic transformed variables.
We take the negative logarithm of `SEmptyW`

to get the frequency.
After we are in the space where the regression will take place we normalize
the input variables to put them all in similar scales:

$$ y = -\log(Y) $$ $$ x_i = log (X_i) $$ $$ x_i = \frac{x_i - \bar{x}_i}{\sigma_{x_i}} $$

The GP structure is defined in the function
`inflowgp.m`

,
refer to it to know more details.

X_ = log10 (X); X_ = zscore (X_); assert (all(isfinite(X_))) if !exist('HYP', 'var') HYP = ARG = struct(); endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif % Loop over origin categories and build a model for each for icat = 1:2 cat_name = cotyp{icat}; x = X_(idx_cat{icat},:); y = -log10 (Y(idx_cat{icat})); % -log Period = log Freq if !isfield (HYP, cat_name) hyp = []; else hyp = HYP.(cat_name); endif % Choose hyper-parameter constraints for each category % log of the error bounds: 1/7-100 week Ferror = sort (log (abs ([-log10(1/7) -log10(100)]))); switch cat_name case 'Pit latrine' maxcov = log (0.068); % Max correction should be about 10% of mean case 'Septic tank' maxcov = log (0.14); % Max correction should be about 10% of mean endswitch [hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose); % Store results for further plotting y_data.(cat_name) = y; HYP.(cat_name) = hyp; ARG.(cat_name) = args; XX.(cat_name) = X(idx_cat{icat},:); YY.(cat_name) = 1./Y(idx_cat{icat},:); endfor % over categories

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

for icat = 1:2 cat_name = cotyp{icat}; printf ('\n-- %s --\n', cat_name); report_gp (HYP.(cat_name), ARG.(cat_name), @(x)10.^(x)); endfor

-- Pit latrine -- ** Reports of results Negative log marginal likelihod: 102.90 Mean function parameters 0.20 0.04 -1.61 Min of inputs -1.77 -2.22 Max of inputs 2.10 1.66 Bounds mean fun: -1.95 -1.17 Covariance amplitude: 0.00 Bounds cov fun: -0.12 0.09 Bounds coeff variation (%): 0.30 10.64 t-distribution: 3.06 0.15 Deviations: 0.26 1.81 Corr coeff: 0.54 -- Septic tank -- ** Reports of results Negative log marginal likelihod: 109.43 Mean function parameters 0.38 -0.12 -1.73 Min of inputs -2.70 -1.38 Max of inputs 1.89 3.56 Bounds mean fun: -2.66 -1.04 Covariance amplitude: 0.02 Bounds cov fun: -0.15 0.16 Bounds coeff variation (%): 0.00 10.23 t-distribution: 3.04 0.18 Deviations: 0.32 2.09 Corr coeff: 0.57

These plots illustrate the performance of the model

yname = 'Frequency [1/week]'; plotresults_gp (1, HYP, ARG, XX, YY, Xname, {'log10', yname}, @(x)10.^(x)); for fig =3:4 h = get(figure(fig), 'children'); for i=1:length(h) axes(h(i)); set (h(i), 'yscale', 'log', 'ygrid', 'on'); set (h(i), 'xscale', 'log', 'xgrid', 'on'); axis tight xticks ([]); xticks ("auto"); % force recalculation of ticks yticks ([]); yticks ("auto"); % force recalculation of ticks drawnow endfor endfor

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (7, HYP.(cotyp{1}), Xname, imean); subplot (2,1,1); title (cotyp{1}) plothypARD (8, HYP.(cotyp{2}), Xname, imean); subplot (2,1,1); title (cotyp{2})