# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09 pkg load gpml

We load continuous site variables to build a model for TS.
We also load the categorical variable container type to segment the dataset.
The latter is then removed from the input variables *Xname*.
*Yname* contains the output variable to be predicted.

Xname = {'NUsers','CoVol', 'SEmptyW', 'IC', 'CoTyp'}; Yname = 'TS'; [X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala'); Xcat = X(:, isXcat); ncat = size (Xcat, 2); X = X(:,!isXcat); Xname_cat = Xname(isXcat); Xname = Xname(!isXcat); % Indexes to partition output using the categorical variable idx_cat = categorypartition (Xcat); cotyp = Xcat_str.(Xname_cat{1}); %{'Pit latrine', 'Septic tank'}; ICOV = IMEAN = struct(); IMEAN.(cotyp{1}) = setdiff (1:length(Xname), []); IMEAN.(cotyp{2}) = setdiff (1:length(Xname), []); ICOV.(cotyp{1}) = setdiff (1:length(Xname), []); ICOV.(cotyp{2}) = setdiff (1:length(Xname), []);

We use the data directly, without non-linear mappings. We only re-scale the inputs such that they scales are comparable.

$$ x_i = \frac{X_i - \bar{X}_i}{\sigma_{X_i}} $$

The GP structure is defined in the function
*TSgp.m*,
refer to it to know more details.

X_ = zscore (X); [N dim] = size (X_); assert (all(isfinite(X_))) if !exist('HYP', 'var') HYP = ARG = struct(); endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif % Loop over origin categories and build a model for each for icat = 1:2 cat_name = cotyp{icat}; x = X_(idx_cat{icat},:); y = Y(idx_cat{icat}); if !isfield (HYP, cat_name) hyp = []; else hyp = HYP.(cat_name); endif % Choose hyper-parameter constraints for each category TSerror = [log(1) log(10)]; % log of the error bounds, here 1-10 mg/L switch cat_name case 'Pit latrine' maxcov = log (1.5e1); case 'Septic tank' maxcov = log (1e1); endswitch ifun = {IMEAN.(cat_name), ICOV.(cat_name)}; [hyp args] = TSgp (x, y, ifun, hyp, TSerror, maxcov, verbose); % Store results for further plotting y_data.(cat_name) = y; HYP.(cat_name) = hyp; ARG.(cat_name) = args; XX.(cat_name) = X(idx_cat{icat},:); YY.(cat_name) = Y(idx_cat{icat},:); endfor % over categories

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since we do not have a prior model for *TS*, this correction indicates
nonlinear terms that could be used to improve prediction.

for icat = 1:2 cat_name = cotyp{icat}; printf ('\n-- %s --\n', cat_name); report_gp (HYP.(cat_name), ARG.(cat_name)); endfor

-- Pit latrine -- ** Reports of results Negative log marginal likelihod: 250.03 Mean function parameters 0.40 -13.56 22.35 1.48 37.06 Min of inputs -0.97 -0.56 -0.65 -1.43 Max of inputs 5.15 3.35 1.86 0.66 Bounds mean fun: -21.75 83.76 Covariance amplitude: 224.51 Bounds cov fun: -24.41 66.15 Bounds coeff variation (%): 1.37 191.77 Deviations: 12.51 12.51 -- Septic tank -- ** Reports of results Negative log marginal likelihod: 164.58 Mean function parameters -2.61 -1.68 5.18 -1.51 17.25 Min of inputs -1.10 -0.54 -0.68 -0.39 Max of inputs 1.96 7.38 4.73 1.71 Bounds mean fun: -2.16 40.86 Covariance amplitude: 104.07 Bounds cov fun: -20.11 50.07 Bounds coeff variation (%): 8.22 178.97 Deviations: 12.16 12.16

These plots illustrate the performance of the model

yname = 'TS [mg/L]'; plotresults_gp (1, HYP, ARG, XX, YY, Xname, {'', yname});

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (7, HYP.(cotyp{1}), Xname, {IMEAN.(cotyp{1}), ICOV.(cotyp{1})}); subplot (2,1,1); title (cotyp{1}) plothypARD (8, HYP.(cotyp{2}), Xname, {IMEAN.(cotyp{2}), ICOV.(cotyp{2})}); subplot (2,1,1); title (cotyp{2})