# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-01-09

pkg load gpml

*Xname* contains the variables used as predictor
*Yname* contains the output variable

Xname = {'NUsers','CoVol', 'SludgeAge', 'Vpumped'}; Yname = 'TS'; [X Y] = dataset (Xname, Yname, 'Hanoi');

We use the data directly, without non-linear mappings. We only re-scale the inputs such that they scales are comparable.

$$ x_i = \frac{X_i - \bar{X}_i}{\sigma_{X_i}} $$

y = Y; x = X; x = zscore (x); assert (all(isfinite(x)))

Variables that are particularly not useful for prediction are removed from the
mean and covariance functions, respectively.
*imean* and *icov* are used as a selection mask. The indices inside these variables
refer to positions in *Xname*

imean = setdiff (1:length(Xname), 4); % remove Vpumped from mean icov = setdiff (1:length(Xname), 2); % remove CoVol from mean N = length (y);

Since we do not have a prior model structure hyper-parameters are difficult to constraint. For the exploration of the hyper-parameters we split the data in training and test sets. After good parameters are found, we use the whole dataset for training. This hopefully increases the significance of the regressed parameters.

%comment and clear _idx_ to have a test set. idx = struct ('trn', 1:N, 'tst', []); if !exist ('idx', 'var') N = length (x); idx = struct ('trn', [], 'tst', []); [~,~,idx.tst,idx.trn] = splitdata ([x y], 0.2); % add min and max values [~,imin] = min (y); [~,imax] = max (y); idx.tst = sort (setdiff (idx.tst, [imin imax])); idx.trn = sort ([idx.trn imin imax]); endif % Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if !exist ('verbose', 'var') verbose = false; endif if !exist('hyp', 'var') hyp = []; endif

The GP structure is defined in the function
*TSgp.m*, refer to
it to know more details.

% Train [hyp args] = TSgp (x(idx.trn,:), y(idx.trn), {imean, icov}, hyp, ... [log(1) log(10)], log(1e1), verbose); % Test [y_, dy2_, f_, df2_, lp, post] = gp (hyp, args{:}, x, y); dy_ = 1.96 * sqrt (dy2_);

The coefficient of variation is computed as the ratio between the predictive standard deviation and the predictive mean.

$$ c(\vec{x}) = \frac{\sigma_y(\vec{x})}{\bar{y}(\vec{x})} $$

It is used to quantify the amount of correction.
Since we do not have a prior model for *TS*, this correction indicates
nonlinear terms that could be used to improve prediction.

report_gp (hyp, args);

** Reports of results Negative log marginal likelihod: 177.29 Mean function parameters -2.57 -9.47 4.97 30.63 Min of inputs -2.01 -1.24 -1.83 Max of inputs 2.64 3.47 1.70 Bounds mean fun: -6.78 49.70 Covariance amplitude: 90.26 Bounds cov fun: -14.98 17.50 Bounds coeff variation (%): 0.89 105.68 Deviations: 11.55 11.55

These plots illustrate the performance of the model

yname = 'TS [mg/L]'; plotresults_gp (1, hyp, args, X(idx.trn,:), Y(idx.trn,:), Xname, {'',Yname}); figure(2) ylim ([0 ylim()(2)]); if !isempty (idx.tst) hold on errorbar (y(idx.tst), y_(idx.tst), dy_(idx.tst), '~or') hold off axis tight endif

This plot shows the relative weight of each variable in the mean function and the relevance in the covariance function.

plothypARD (4, hyp, Xname,{imean, icov});