# Copyright (C) 2018 Juan Pablo Carbajal # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. # Author: Juan Pablo Carbajal <ajuanpi+dev@gmail.com> # Created: 2018-25-11

The goal of this script is to show the effect of constraining a nonlinear correction of a linear model. The linear model has the following structure

$$ y_l = y_o + w_1 x_1 + w_2 x_2 + \xi$$

where $x_1$ and $x_2$ are the logarithm of number of users (`NUsers`

) and the
container volume (`CoVol`

), respectively.
The output variable $y_l$ is the approximation of the logarithm of the observed
container emptying frequency (the inverse of `SEmptyW`

), given only by the linear model.
The last term $\xi$ is an independent t-distributed noise term with zero mean.
The variance of the noise term subject to optimization during the regression.

This model does not capture all the variability in the data set, hence one can propose a nonparametric nonlinear correction term:

$$ \hat{y} = y_l + \epsilon f(x_1, \ldots, x_5) $$

The correction depends on the already mentioned variables, and on three additional variables. This extra term causes the reduction of the noise term, i.e. smaller variance. However if the intensity of the correction ($\epsilon$) is not constrained, this new model can reduce the variance of the noise to zero, i.e. it can interpolate the given data. Therefore, to avoid this unrealistic overfitting, two measures can be taken:

- Constrain the variance of the noise term from below.
- Constrain the amplitude of the correction $\xi$ from above.

The first measure makes sense when we have an estimate of the observation noise. The second measure is operative and allows us to learn something about the variables that would be useful to reduce noise variance, i.e. to improve the goodness of the fit.

In the following study we use both measures. The bound on the noise variance is estimated from the data and is fixed. The amplitude of the correction is set to different values, and we identify the variables that the correction uses to improve the regression.

pkg load gpml

The data we will use is the subset corresponding to septic tanks in Kampala.
We use `CoTyp`

to select this subset.

Xname = {'NUsers','CoVol', 'CoAge', 'TrVol', 'IC', 'CoTyp'}; Yname = 'SEmptyW'; [X Y isXcat Xcat_str] = dataset (Xname, Yname, 'Kampala'); Xcat = X(:, isXcat); ncat = size (Xcat, 2); X = X(:,!isXcat); Xname_cat = Xname(isXcat); Xname = Xname(!isXcat); % Indexes to partition output using the categorical variable idx_cat = categorypartition (Xcat);

Select second category, corresponding to `'Septic tank'`

X = log10 (X(idx_cat{2},:)); [X Xmean Xstd] = zscore (X); Y = -log10 (Y(idx_cat{2})); % -log Period = log Freq % Indexes of variables used for mean function [~, imean] = ismember ({'NUsers', 'CoVol'}, Xname);

First we specify the regressor structure, i.e. linear model plus a nonparametric correction term.

linear_mean = {@meanSum, {@meanLinear, @meanConst}}; meanf = {@meanMask, imean, linear_mean}; covf = {@covMaternard, 1}; likf = {@likT}; N = length (Y); % number of samples numax = log (N - 1); % log of the maximum nu parameter for t-distribution dimX = columns (X); % dimension of input hyp0.lik = [numax; 0]; % t-distributed noise hyp0.mean = [X(:,imean) ones(N,1)] \ Y; % mean function == linear regression hyp0.cov = zeros(dimX + 1, 1); % ARD lengths and variance set to 1

The noise in the measured emptying period is bounded between 1 day and 100 weeks

Ferror = sort (log (abs ([-log10(1/7) -log10(100)]))); % The nu parameter of the t-distribution is constrained to avoid % amplification of the variance. prior.lik = {{@priorSmoothBox1, log(3-1), numax, 40}, {}}; prior.lik{end} = {@priorSmoothBox1, Ferror(1)-numax, Ferror(2)-numax, 40}; % The covariance intensity is set to desired value prior.cov = cell (dimX + 1, 1); prior.cov{end} = {@priorDelta, 0}; % Inference of hyper-parameters is done using the Variational Bayes method infe = {@infPrior, @infVB, prior}; % Group all arguments args = {infe, meanf, covf, likf, X, Y};

Define the vector of values for the correction amplitude. A value of 0.14 gives a maximum correction that is about about 10% of the value of linear model, i.e. coefficient of variation of ~10%.

```
nC = 20;
MaxC = sort ([linspace(5e-2, 0.3, nC).'; 0.14]);
nC = length (MaxC);
```

Now we regress the data with each value of the correction amplitude

% Verbosity is true, define the variable verbose in the command line to override % Make sure verbose is false when generating a html report with publish if ~exist ('verbose', 'var') verbose = false; endif if !exist('HYP', 'var') % Pre-allocate HYP = ARG = cell (nC, 1); SStot = sumsq (Y - mean (Y)); % prop. to variance, used in goodness of fit Rsq = zeros (nC, 2); for i = nC:-1:1 maxcov = log (MaxC(i)); % set prior: parameter for priorDelta in covariance amplitude args{1}{3}.cov{end}{2} = maxcov; % set hyp0 as initial guess, and covariance amplitude hyp = hyp0; hyp.cov(end) = maxcov; hyp = train_gp (args, hyp, [], verbose); [Yhat dYhat2 mf df2 lp post] = gp (hyp, args{:}, X); printf ('--\nCovariance amplitude: %.2f\n', MaxC(i)); % Coefficient of variation yl = feval (meanf{:}, hyp.mean, X); % linear model alone cf = feval(covf{:}, hyp.cov, X) * post.alpha; % correction function [mincv maxcv] = bounds (abs (cf ./ yl)); printf ('Bounds coeff variation (%%): %.1f %.1f\n', mincv * 100, maxcv * 100); % Variance of the noise nu = exp (hyp.lik(1)) + 1; sn = exp (hyp.lik(2)); noise = nu * sn^2 / (nu - 2); printf ('Variance of noise: %.2f\n', noise); % Compute coefficient of determination Rsq(i, :) = 1 - [sumsq(Y - Yhat) sum(dYhat2)] ./ SStot; printf ('Goodness of fit: %.2f\nVariance ratio: %.2f\n', Rsq(i,:)); % Likelihood of model given data printf ('Negative log marginal likelihod: %.2f\n', gp (hyp, args{:})); % Store results for further plotting HYP{i} = hyp; ARG{i} = args; endfor % over categories endif

-- Covariance amplitude: 0.30 Bounds coeff variation (%): 2.7 96.7 Variance of noise: 0.00 Goodness of fit: 1.00 Variance ratio: 0.99 Negative log marginal likelihod: 70.01 -- Covariance amplitude: 0.29 Bounds coeff variation (%): 2.7 96.4 Variance of noise: 0.00 Goodness of fit: 1.00 Variance ratio: 0.99 Negative log marginal likelihod: 75.57 -- Covariance amplitude: 0.27 Bounds coeff variation (%): 2.7 96.0 Variance of noise: 0.00 Goodness of fit: 1.00 Variance ratio: 0.99 Negative log marginal likelihod: 82.05 -- Covariance amplitude: 0.26 Bounds coeff variation (%): 0.6 38.6 Variance of noise: 0.01 Goodness of fit: 0.70 Variance ratio: 0.96 Negative log marginal likelihod: 84.65 -- Covariance amplitude: 0.25 Bounds coeff variation (%): 0.7 38.1 Variance of noise: 0.01 Goodness of fit: 0.67 Variance ratio: 0.96 Negative log marginal likelihod: 87.90 -- Covariance amplitude: 0.23 Bounds coeff variation (%): 0.2 41.5 Variance of noise: 0.01 Goodness of fit: 0.66 Variance ratio: 0.96 Negative log marginal likelihod: 87.20 -- Covariance amplitude: 0.22 Bounds coeff variation (%): 0.1 40.6 Variance of noise: 0.01 Goodness of fit: 0.64 Variance ratio: 0.96 Negative log marginal likelihod: 91.36 -- Covariance amplitude: 0.21 Bounds coeff variation (%): 1.1 39.2 Variance of noise: 0.01 Goodness of fit: 0.62 Variance ratio: 0.96 Negative log marginal likelihod: 96.33 -- Covariance amplitude: 0.19 Bounds coeff variation (%): 0.4 22.3 Variance of noise: 0.04 Goodness of fit: 0.52 Variance ratio: 0.89 Negative log marginal likelihod: 101.27 -- Covariance amplitude: 0.18 Bounds coeff variation (%): 0.3 17.7 Variance of noise: 0.06 Goodness of fit: 0.47 Variance ratio: 0.86 Negative log marginal likelihod: 102.52 -- Covariance amplitude: 0.17 Bounds coeff variation (%): 0.2 15.2 Variance of noise: 0.07 Goodness of fit: 0.43 Variance ratio: 0.84 Negative log marginal likelihod: 103.45 -- Covariance amplitude: 0.16 Bounds coeff variation (%): 0.1 13.1 Variance of noise: 0.07 Goodness of fit: 0.39 Variance ratio: 0.83 Negative log marginal likelihod: 104.19 -- Covariance amplitude: 0.14 Bounds coeff variation (%): 0.0 11.2 Variance of noise: 0.08 Goodness of fit: 0.36 Variance ratio: 0.82 Negative log marginal likelihod: 104.79 -- Covariance amplitude: 0.14 Bounds coeff variation (%): 0.1 10.9 Variance of noise: 0.08 Goodness of fit: 0.35 Variance ratio: 0.82 Negative log marginal likelihod: 104.87 -- Covariance amplitude: 0.13 Bounds coeff variation (%): 0.2 9.5 Variance of noise: 0.09 Goodness of fit: 0.33 Variance ratio: 0.81 Negative log marginal likelihod: 105.27 -- Covariance amplitude: 0.12 Bounds coeff variation (%): 0.3 7.8 Variance of noise: 0.09 Goodness of fit: 0.30 Variance ratio: 0.81 Negative log marginal likelihod: 105.69 -- Covariance amplitude: 0.10 Bounds coeff variation (%): 0.4 6.2 Variance of noise: 0.10 Goodness of fit: 0.28 Variance ratio: 0.80 Negative log marginal likelihod: 106.02 -- Covariance amplitude: 0.09 Bounds coeff variation (%): 0.4 4.8 Variance of noise: 0.10 Goodness of fit: 0.25 Variance ratio: 0.80 Negative log marginal likelihod: 106.30 -- Covariance amplitude: 0.08 Bounds coeff variation (%): 0.3 3.5 Variance of noise: 0.10 Goodness of fit: 0.23 Variance ratio: 0.80 Negative log marginal likelihod: 106.52 -- Covariance amplitude: 0.06 Bounds coeff variation (%): 0.2 2.4 Variance of noise: 0.10 Goodness of fit: 0.22 Variance ratio: 0.80 Negative log marginal likelihod: 106.70 -- Covariance amplitude: 0.05 Bounds coeff variation (%): 0.1 1.5 Variance of noise: 0.10 Goodness of fit: 0.20 Variance ratio: 0.80 Negative log marginal likelihod: 106.84

This plots show the performance of the model for different values of the correction amplitude. As mentioned before, the bigger the amplitude the higher the goodness of fit (likely due to overfitting)

figure (1); clf plot (MaxC, Rsq(:,1), '-o'); xlabel ('Covariance amplitude') ylabel ('Goodness of fit') ylim ([min(Rsq(:,1)), 1]) xlim (MaxC([1 end])) grid on

The following plot show the relative relevance of each variable in the correction term. This weights change rapidly and the results are not very stable for lower values of the amplitude. This might be due to the small size of the dataset.

RelC = cell2mat (cellfun (@(x)exp (-x.cov(1:end-1)), HYP, 'unif', 0).'); RelC = RelC ./ sum (RelC); figure (2); clf plot (MaxC, RelC, '-o'); %plot (Rsq(:,1), RelC, '-o'); legend (Xname) axis tight xlabel ('Covariance amplitude') %xlabel ('Goodness of fit') ylabel ('Relative relevance c.f.')

The following plot shows the relative weights of the linear model.

RelM = cell2mat (cellfun (@(x)x.mean(1:end-1), HYP, 'unif', 0).'); RelM = RelM ./ sum (abs (RelM)); figure (3); clf plot (MaxC, RelM, '-o'); %plot (Rsq(:,1), RelM, '-o'); legend (Xname{imean}) line (xlim, 0); axis tight xlabel ('Covariance amplitude') %xlabel ('Goodness of fit') ylabel ('Relative relevance m.f.')