# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09

pkg load gpml

We load continuous site variables to build a model for the inverse of the
emptying period `SludgeAge`

.
`Xname`

contains the input variables.
`Yname`

contains the output variable to be predicted.

`CoAge`

is ignored, otherwise we will have to drop most data,
because for many entries `CoAge`

is equal to `SludgeAge`

.

Xname = {'NUsers','CoVol', 'TrVol', 'WaterV', 'Vpumped'}; Yname = 'SludgeAge'; [X Y] = dataset (Xname, Yname, 'Hanoi'); Y *= 52.1429; % convert years to weeks

All the regression is performed on logarithmic transformed variables.
We take the negative logarithm of `SludgeAge`

to get the frequency.
Since some of the input variables have zeros, we add one before taking the
logarithm.
After we are in the space where the regression will take place we normalize
the input variables to put them all in similar scales:

$$ y = -\log(Y) $$ $$ x_i = log (X_i) $$ $$ x_i = \frac{x_i - \tilde{x}_i}{\alpha_{x_i}} $$

y = -log10 (Y); % -log Period = log Freq [~, imean] = ismember ({'NUsers', 'CoVol'}, Xname); iother = setdiff (1:numel(Xname), imean); x = X; x(:,imean) = log10 (x(:,imean)); x(:,iother) = log10 (x(:,iother)+1); % data has zeros x = x - median (x); x = x ./ mean (abs (x)); assert (all (isfinite (x))) if !exist ('hyp', 'var') hyp = []; endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif % log of the error bounds: 1/7-100 week Ferror = sort (log (abs ([-log10(1/7) -log10(100)]))); maxcov = log (0.075); % Max correction should be about 10% of mean

The GP structure is defined in the function
`inflowgp.m`

,
refer to it to know more details.

[hyp args] = inflowgp (x, y, imean, hyp, Ferror, maxcov, verbose);

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.

Since for emptying frequency we have a prior model, the correction was constrained to produce a maximum coefficient of variation of about 10%.

report_gp (hyp, args, @(x)10.^x);

** Reports of results Negative log marginal likelihod: 27.28 Mean function parameters -0.05 0.01 -2.66 Min of inputs -2.99 -2.24 Max of inputs 2.99 4.48 Bounds mean fun: -2.77 -2.51 Covariance amplitude: 0.01 Bounds cov fun: -0.24 0.21 Bounds coeff variation (%): 0.34 9.66 t-distribution: 3.06 0.05 Deviations: 0.10 1.25 Corr coeff: 0.76

These plots illustrate the performance of the model

yname = 'Frequency [1/week]'; xname = {'# users', 'Containment V.', 'Truck V.', ... 'Water added V.', 'Sludge emptied V.'}; plotresults_gp (1, hyp, args, X, 1./Y, xname, {'log10', yname}, @(x)10.^(x)); figure (2); title ('Septic tank');

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (4, hyp, xname, imean);