# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09

pkg load gpml

We load continuous site variables to build a model for the inverse of the
emptying period for the households in both, Hanoi and Kampala.
`Xname`

contains the input variables.
`Yname`

contains the output variable to be predicted.

The model uses the input variables present in both cities.
`CoAge`

is ignored, otherwise we will have to drop most of Hanoi data,
because for many entries `CoAge`

is equal to `SludgeAge`

.

Xname = {'NUsers', 'CoVol', 'TrVol', 'OrCat'}; Yname = {'SEmptyW', 'SludgeAge'}; City = {'Kampala', 'Hanoi'}; % Loop over cities for c=1:2 [tmpX tmpY isXcat Xcat_str] = dataset (Xname, Yname{c}, City{c}); Xcat = tmpX(:, isXcat); idx_cat = categorypartition (Xcat); # Select only Households [tf,i] = ismember ({'Household', 'Multiple Household'}, Xcat_str.OrCat); if length(i(tf)) > 1 idx_household = cat (idx_cat(i(tf)){:}); else idx_household = idx_cat{i(tf)}; endif X{c} = tmpX(idx_household, !isXcat); Y{c} = tmpY(idx_household); endfor Xname = Xname(!isXcat); Y{2} = Y{2} * 52.1429; # convert years to weeks in Hanoi data

Merge the data from both cities

Yhousehold = cell2mat (Y.'); Xhousehold = cell2mat (X.');

All the regression is performed on logarithmic transformed variables. We take the negative logarithm of the emptying period to get the frequency.

After we are in the space where the regression will take place we normalize the input variables to put them all in similar scales:

$$ y = -\log(Y) $$ $$ x_i = log (X_i) $$ $$ x_i = \frac{x_i - \bar{x}_i}{\sigma_{x_i}} $$

y = -log10 (Yhousehold); % -log Period = log Freq x = log10 (Xhousehold); x = zscore (x); assert (all(isfinite(x))) assert (all(isfinite(y))) [~, imean] = ismember ({'NUsers', 'CoVol'}, Xname); if !exist ('hyp', 'var') hyp = []; endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif % log of the error bounds: 1/7-100 week Ferror = sort (log (abs ([-log10(1/7) -log10(100)]))); maxcov = log (0.08); % Max correction should be about 10% of mean

The GP structure is defined in the function
`inflowgp.m`

,
refer to it to know more details.

[hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose);

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

report_gp (hyp, args, @(x)10.^x);

** Reports of results Negative log marginal likelihod: 196.46 Mean function parameters 0.43 0.08 -1.98 Min of inputs -1.77 -1.72 Max of inputs 2.43 3.18 Bounds mean fun: -2.85 -0.97 Covariance amplitude: 0.01 Bounds cov fun: -0.26 0.08 Bounds coeff variation (%): 0.30 11.37 t-distribution: 3.14 0.24 Deviations: 0.41 2.54 Corr coeff: 0.79

These plots illustrate the performance of the model

yname = 'Frequency [1/week]'; plotresults_gp (2, hyp, args, Xhousehold, 1./Yhousehold, Xname, ... {'log10', yname}, @(x)10.^(x)); h = get(figure(4), 'children'); for i=1:length(h) axes(h(i)); set (h(i), 'yscale', 'log', 'ygrid', 'on'); set (h(i), 'xscale', 'log', 'xgrid', 'on'); axis tight xticks ([]); xticks ("auto"); % force recalculation of ticks yticks ([]); yticks ("auto"); % force recalculation of ticks drawnow endfor

warning: Non-positive limit for logarithmic axis ignored

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (6, hyp, Xname, imean);

Plot emptying frequency vs. each variable in the mean function independently

figure (5), clf np = 15; sp = {3:2:2*np, 4:2:2*np}; for i=1:2 subplot(np,2,sp{i}) h1 = loglog (X{1}(:,i), 1./Y{1}, 'o','markerfacecolor','auto'); hold on h2 = loglog (X{2}(:,i), 1./Y{2}, 'x'); axis tight grid on xlabel (Xname{i}) if i == 1 ylabel (yname); endif hold off endfor subplot(np,2,[1 2]) title('Households'); axis off legend([h1,h2], City,'Location','North','Orientation','Horizontal');