Gaussian Process Regression (sklearn implementation)#

Gaussian Process Regression model (using sklearn).

Use mlr_model_type: gpr_sklearn to use this MLR model in the recipe.

Classes:

AdvancedGaussianProcessRegressor([kernel, ...])

Expand sklearn.gaussian_process.GaussianProcessRegressor.

SklearnGPRModel(input_datasets, **kwargs)

Gaussian Process Regression model (sklearn implementation).

class esmvaltool.diag_scripts.mlr.models.gpr_sklearn.AdvancedGaussianProcessRegressor(kernel=None, *, alpha=1e-10, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, normalize_y=False, copy_X_train=True, n_targets=None, random_state=None)[source]#

Bases: GaussianProcessRegressor

Expand sklearn.gaussian_process.GaussianProcessRegressor.

Methods:

fit(X, y)

Fit Gaussian process regression model.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

log_marginal_likelihood([theta, ...])

Return log-marginal likelihood of theta for training data.

predict(x_data[, return_var, return_cov])

Expand predict() to accept return_var.

sample_y(X[, n_samples, random_state])

Draw samples from Gaussian process and evaluate at X.

score(X, y[, sample_weight])

Return the coefficient of determination of the prediction.

set_params(**params)

Set the parameters of this estimator.

set_predict_request(*[, return_cov, ...])

Request metadata passed to the predict method.

set_score_request(*[, sample_weight])

Request metadata passed to the score method.

fit(X, y)#

Fit Gaussian process regression model.

Parameters:
  • X (array-like of shape (n_samples, n_features) or list of object) – Feature vectors or other representations of training data.

  • y (array-like of shape (n_samples,) or (n_samples, n_targets)) – Target values.

Returns:

self – GaussianProcessRegressor class instance.

Return type:

object

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

log_marginal_likelihood(theta=None, eval_gradient=False, clone_kernel=True)#

Return log-marginal likelihood of theta for training data.

Parameters:
  • theta (array-like of shape (n_kernel_params,) default=None) – Kernel hyperparameters for which the log-marginal likelihood is evaluated. If None, the precomputed log_marginal_likelihood of self.kernel_.theta is returned.

  • eval_gradient (bool, default=False) – If True, the gradient of the log-marginal likelihood with respect to the kernel hyperparameters at position theta is returned additionally. If True, theta must not be None.

  • clone_kernel (bool, default=True) – If True, the kernel attribute is copied. If False, the kernel attribute is modified, but may result in a performance improvement.

Returns:

  • log_likelihood (float) – Log-marginal likelihood of theta for training data.

  • log_likelihood_gradient (ndarray of shape (n_kernel_params,), optional) – Gradient of the log-marginal likelihood with respect to the kernel hyperparameters at position theta. Only returned when eval_gradient is True.

predict(x_data, return_var=False, return_cov=False)[source]#

Expand predict() to accept return_var.

sample_y(X, n_samples=1, random_state=0)#

Draw samples from Gaussian process and evaluate at X.

Parameters:
  • X (array-like of shape (n_samples_X, n_features) or list of object) – Query points where the GP is evaluated.

  • n_samples (int, default=1) – Number of samples drawn from the Gaussian process per query point.

  • random_state (int, RandomState instance or None, default=0) – Determines random number generation to randomly draw samples. Pass an int for reproducible results across multiple function calls. See Glossary.

Returns:

y_samples – Values of n_samples samples drawn from Gaussian process and evaluated at query points.

Return type:

ndarray of shape (n_samples_X, n_samples), or (n_samples_X, n_targets, n_samples)

score(X, y, sample_weight=None)#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score\(R^2\) of self.predict(X) w.r.t. y.

Return type:

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_predict_request(*, return_cov: bool | None | str = '$UNCHANGED$', return_var: bool | None | str = '$UNCHANGED$', x_data: bool | None | str = '$UNCHANGED$') AdvancedGaussianProcessRegressor#

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • return_cov (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for return_cov parameter in predict.

  • return_var (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for return_var parameter in predict.

  • x_data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for x_data parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') AdvancedGaussianProcessRegressor#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

class esmvaltool.diag_scripts.mlr.models.gpr_sklearn.SklearnGPRModel(input_datasets, **kwargs)[source]#

Bases: MLRModel

Gaussian Process Regression model (sklearn implementation).

Attributes:

categorical_features

Categorical features.

data

Input data of the MLR model.

features

Features of the input data.

features_after_preprocessing

Features of the input data after preprocessing.

features_types

Types of the features.

features_units

Units of the features.

fit_kwargs

Keyword arguments for fit().

group_attributes

Group attributes of the input data.

label

Label of the input data.

label_units

Units of the label.

mlr_model_type

MLR model type.

numerical_features

Numerical features.

parameters

Parameters of the complete MLR model pipeline.

random_state

Random state instance.

Methods:

create(mlr_model_type, *args, **kwargs)

Create desired MLR model subclass (factory method).

efecv(**kwargs)

Perform exhaustive feature elimination using cross-validation.

export_prediction_data([filename])

Export all prediction data contained in self._data.

export_training_data([filename])

Export all training data contained in self._data.

fit()

Fit MLR model.

get_ancestors([label, features, ...])

Return ancestor files.

get_data_frame(data_type[, impute_nans])

Return data frame of specified type.

get_x_array(data_type[, impute_nans])

Return x data of specific type.

get_y_array(data_type[, impute_nans])

Return y data of specific type.

grid_search_cv(param_grid, **kwargs)

Perform exhaustive parameter search using cross-validation.

plot_1d_model([filename, n_points])

Plot lineplot that represents the MLR model.

plot_partial_dependences([filename])

Plot partial dependences for every feature.

plot_prediction_errors([filename])

Plot predicted vs.

plot_residuals([filename])

Plot residuals of training and test (if available) data.

plot_residuals_distribution([filename])

Plot distribution of residuals of training and test data (KDE).

plot_residuals_histogram([filename])

Plot histogram of residuals of training and test data.

plot_scatterplots([filename])

Plot scatterplots label vs.

predict([save_mlr_model_error, ...])

Perform prediction using the MLR model(s) and write *.nc files.

print_correlation_matrices()

Print correlation matrices for all datasets.

print_kernel_info()

Print information of the fitted kernel of the GPR model.

print_regression_metrics([logo])

Print all available regression metrics for training data.

register_mlr_model(mlr_model_type)

Add MLR model (subclass of this class) (decorator).

reset_pipeline()

Reset regressor pipeline.

rfecv(**kwargs)

Perform recursive feature elimination using cross-validation.

test_normality_of_residuals()

Perform Shapiro-Wilk test to normality of residuals.

update_parameters(**params)

Update parameters of the whole pipeline.

property categorical_features#

Categorical features.

Type:

numpy.ndarray

classmethod create(mlr_model_type, *args, **kwargs)#

Create desired MLR model subclass (factory method).

property data#

Input data of the MLR model.

Type:

dict

efecv(**kwargs)#

Perform exhaustive feature elimination using cross-validation.

Parameters:

**kwargs (keyword arguments, optional) – Additional options for esmvaltool.diag_scripts.mlr. custom_sklearn.cross_val_score_weighted().

export_prediction_data(filename=None)#

Export all prediction data contained in self._data.

Parameters:

filename (str, optional (default: '{data_type}_{pred_name}.csv')) – Name of the exported files.

export_training_data(filename=None)#

Export all training data contained in self._data.

Parameters:

filename (str, optional (default: '{data_type}.csv')) – Name of the exported files.

property features#

Features of the input data.

Type:

numpy.ndarray

property features_after_preprocessing#

Features of the input data after preprocessing.

Type:

numpy.ndarray

property features_types#

Types of the features.

Type:

pandas.Series

property features_units#

Units of the features.

Type:

pandas.Series

fit()#

Fit MLR model.

Note

Specifying keyword arguments for this function is not allowed here since features_after_preprocessing might be altered by that. Use the keyword argument fit_kwargs during class initialization instead.

property fit_kwargs#

Keyword arguments for fit().

Type:

dict

get_ancestors(label=True, features=None, prediction_names=None, prediction_reference=False)#

Return ancestor files.

Parameters:
  • label (bool, optional (default: True)) – Return label files.

  • features (list of str, optional (default: None)) – Features for which files should be returned. If None, return files for all features.

  • prediction_names (list of str, optional (default: None)) – Prediction names for which files should be returned. If None, return files for all prediction names.

  • prediction_reference (bool, optional (default: False)) – Return prediction_reference files if available for given prediction_names.

Returns:

Ancestor files.

Return type:

list of str

Raises:

ValueError – Invalid feature or prediction_name given.

get_data_frame(data_type, impute_nans=False)#

Return data frame of specified type.

Parameters:
  • data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.

  • impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns:

Desired data.

Return type:

pandas.DataFrame

Raises:

TypeErrordata_type is invalid or data does not exist (e.g. test data is not set).

get_x_array(data_type, impute_nans=False)#

Return x data of specific type.

Parameters:
  • data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.

  • impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns:

Desired data.

Return type:

numpy.ndarray

Raises:

TypeErrordata_type is invalid or data does not exist (e.g. test data is not set).

get_y_array(data_type, impute_nans=False)#

Return y data of specific type.

Parameters:
  • data_type (str) – Data type to be returned. Must be one of 'all', 'train' or 'test'.

  • impute_nans (bool, optional (default: False)) – Impute nans if desired.

Returns:

Desired data.

Return type:

numpy.ndarray

Raises:

TypeErrordata_type is invalid or data does not exist (e.g. test data is not set).

grid_search_cv(param_grid, **kwargs)#

Perform exhaustive parameter search using cross-validation.

Parameters:
  • param_grid (dict or list of dict) – Parameter names (keys) and ranges (values) for the search. Have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.

  • **kwargs (keyword arguments, optional) – Additional options for sklearn.model_selection.GridSearchCV.

Raises:

ValueError – Final regressor does not supply the attributes best_estimator_ or best_params_.

property group_attributes#

Group attributes of the input data.

Type:

numpy.ndarray

property label#

Label of the input data.

Type:

str

property label_units#

Units of the label.

Type:

str

property mlr_model_type#

MLR model type.

Type:

str

property numerical_features#

Numerical features.

Type:

numpy.ndarray

property parameters#

Parameters of the complete MLR model pipeline.

Type:

dict

plot_1d_model(filename=None, n_points=1000)#

Plot lineplot that represents the MLR model.

Note

This only works for a model with a single feature.

Parameters:
  • filename (str, optional (default: '1d_mlr_model')) – Name of the plot file.

  • n_points (int, optional (default: 1000)) – Number of sampled points for the single feature (using linear spacing between minimum and maximum value).

Raises:
plot_partial_dependences(filename=None)#

Plot partial dependences for every feature.

Parameters:

filename (str, optional (default: 'partial_dependece_{feature}')) – Name of the plot file.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_prediction_errors(filename=None)#

Plot predicted vs. true values.

Parameters:

filename (str, optional (default: 'prediction_errors')) – Name of the plot file.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals(filename=None)#

Plot residuals of training and test (if available) data.

Parameters:

filename (str, optional (default: 'residuals')) – Name of the plot file.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals_distribution(filename=None)#

Plot distribution of residuals of training and test data (KDE).

Parameters:

filename (str, optional (default: 'residuals_distribution')) – Name of the plot file.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_residuals_histogram(filename=None)#

Plot histogram of residuals of training and test data.

Parameters:

filename (str, optional (default: 'residuals_histogram')) – Name of the plot file.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

plot_scatterplots(filename=None)#

Plot scatterplots label vs. feature for every feature.

Parameters:

filename (str, optional (default: 'scatterplot_{feature}')) – Name of the plot file.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

predict(save_mlr_model_error=None, save_lime_importance=False, save_propagated_errors=False, **kwargs)#

Perform prediction using the MLR model(s) and write *.nc files.

Parameters:
  • save_mlr_model_error (str or int, optional) – Additionally saves estimated squared MLR model error. This error represents the uncertainty of the prediction caused by the MLR model itself and not by errors in the prediction input data (errors in that will be considered by including datasets with var_type set to prediction_input_error and setting save_propagated_errors to True). If the option is set to 'test', the (constant) error is estimated as RMSEP using a (hold-out) test data set. Only possible if test data is available, i.e. the option test_size is not set to False during class initialization. If the option is set to 'logo', the (constant) error is estimated as RMSEP using leave-one-group-out cross-validation using the group_attributes. Only possible if group_datasets_by_attributes is given. If the option is set to an integer n (!= 0), the (constant) error is estimated as RMSEP using n-fold cross-validation.

  • save_lime_importance (bool, optional (default: False)) – Additionally saves local feature importance given by LIME (Local Interpretable Model-agnostic Explanations).

  • save_propagated_errors (bool, optional (default: False)) – Additionally saves propagated errors from prediction_input_error datasets. Only possible when these are available.

  • **kwargs (keyword arguments, optional) – Additional options for the final regressors predict() function.

Raises:
print_correlation_matrices()#

Print correlation matrices for all datasets.

print_kernel_info()[source]#

Print information of the fitted kernel of the GPR model.

print_regression_metrics(logo=False)#

Print all available regression metrics for training data.

Parameters:

logo (bool, optional (default: False)) – Print regression metrics using sklearn.model_selection.LeaveOneGroupOut cross-validation. Only possible when group_datasets_by_attributes was given during class initialization.

property random_state#

Random state instance.

Type:

numpy.random.RandomState

classmethod register_mlr_model(mlr_model_type)#

Add MLR model (subclass of this class) (decorator).

reset_pipeline()#

Reset regressor pipeline.

rfecv(**kwargs)#

Perform recursive feature elimination using cross-validation.

Note

This only works for final estimators that provide information about feature importance either through a coef_ attribute or through a feature_importances_ attribute.

Parameters:

**kwargs (keyword arguments, optional) – Additional options for sklearn.feature_selection.RFECV.

Raises:

RuntimeError – Final estimator does not provide coef_ or feature_importances_ attribute.

test_normality_of_residuals()#

Perform Shapiro-Wilk test to normality of residuals.

Raises:

sklearn.exceptions.NotFittedError – MLR model is not fitted.

update_parameters(**params)#

Update parameters of the whole pipeline.

Note

Parameter names have to be given for each step of the pipeline separated by two underscores, i.e. s__p is the parameter p for step s.

Parameters:

**params (keyword arguments, optional) – Parameters for the pipeline which should be updated.

Raises:

ValueError – Invalid parameter for pipeline given.