# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09 pkg load gpml

We load continuous site variables to build a model for TS.
We also load the categorical variable container type to segment the dataset.
The latter is then removed from the input variables `Xname`

.
`Yname`

contains the output variable to be predicted.

Xname = {'NUsers','CoVol', 'SEmptyW', 'IC', 'CoTyp'}; Yname = 'TS'; [X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala'); Xcat = X(:, isXcat); ncat = size (Xcat, 2); X = X(:,!isXcat); Xname_cat = Xname(isXcat); Xname = Xname(!isXcat); % Indexes to partition output using the categorical variable idx_cat = categorypartition (Xcat); cotyp = Xcat_str.(Xname_cat{1}); %{'Pit latrine', 'Septic tank'}; ICOV = IMEAN = struct(); IMEAN.(cotyp{1}) = setdiff (1:length(Xname), []); IMEAN.(cotyp{2}) = setdiff (1:length(Xname), []); ICOV.(cotyp{1}) = setdiff (1:length(Xname), []); ICOV.(cotyp{2}) = setdiff (1:length(Xname), []);

if !exist('HYP', 'var') HYP = ARG = struct(); endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif % Loop over origin categories and build a model for each for icat = 1:2 cat_name = cotyp{icat}; % We use the data directly, without non-linear mappings. % We only center to the median and rescale with the mean absolute deviation % from the median % % $$ x_i = \frac{X_i - \tilde{X}_i}{\alpha_{X_i}} $$ % % The GP structure is defined in the function % <https://bitbucket.org/KaKiLa/fsludge_pub/src/tip/mfiles/TSgp.m _TSgp.m_>, % refer to it to know more details. x = X(idx_cat{icat},:); x = ( x - median (x) ); x = x ./ mean (abs (x)); assert (all (isfinite (x))) y = Y(idx_cat{icat}); if !isfield (HYP, cat_name) hyp = []; else hyp = HYP.(cat_name); endif % Choose hyper-parameter constraints for each category TSerror = [log(1) log(10)]; % log of the error bounds, here 1-10 mg/L switch cat_name case 'Pit latrine' maxcov = log (1.5e1); case 'Septic tank' maxcov = log (1e1); endswitch ifun = {IMEAN.(cat_name), ICOV.(cat_name)}; [hyp args] = TSgp (x, y, ifun, hyp, TSerror, maxcov, verbose); % Store results for further plotting y_data.(cat_name) = y; HYP.(cat_name) = hyp; ARG.(cat_name) = args; XX.(cat_name) = X(idx_cat{icat},:); YY.(cat_name) = Y(idx_cat{icat},:); endfor % over categories

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since we do not have a prior model for `TS`

, this correction indicates
nonlinear terms that could be used to improve prediction.

for icat = 1:2 cat_name = cotyp{icat}; printf ('\n-- %s --\n', cat_name); report_gp (HYP.(cat_name), ARG.(cat_name)); endfor

-- Pit latrine -- ** Reports of results Negative log marginal likelihod: 250.05 Mean function parameters 0.31 -4.35 6.21 0.43 31.00 Min of inputs -0.99 -0.69 -0.75 -3.38 Max of inputs 7.61 11.47 8.29 3.38 Bounds mean fun: -21.69 83.64 Covariance amplitude: 224.50 Bounds cov fun: -24.37 66.23 Bounds coeff variation (%): 1.59 192.48 Deviations: 12.51 12.51 Corr coeff: 0.98 -- Septic tank -- ** Reports of results Negative log marginal likelihod: 164.58 Mean function parameters -1.60 -1.00 4.93 -1.25 16.18 Min of inputs -0.84 -0.68 -0.66 -2.54 Max of inputs 4.18 12.60 5.02 0.00 Bounds mean fun: -2.16 40.86 Covariance amplitude: 104.07 Bounds cov fun: -20.11 50.07 Bounds coeff variation (%): 8.22 178.97 Deviations: 12.16 12.16 Corr coeff: 0.96

These plots illustrate the performance of the model

yname = 'TS [mg/L]'; plotresults_gp (1, HYP, ARG, XX, YY, Xname, {'', yname});

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (7, HYP.(cotyp{1}), Xname, {IMEAN.(cotyp{1}), ICOV.(cotyp{1})}); subplot (2,1,1); title (cotyp{1}) plothypARD (8, HYP.(cotyp{2}), Xname, {IMEAN.(cotyp{2}), ICOV.(cotyp{2})}); subplot (2,1,1); title (cotyp{2})