`bolero.optimizer`.CREPSOptimizer¶

class bolero.optimizer.CREPSOptimizer(initial_params=None, variance=None, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, context_features=None, gamma=0.0001, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None, **kwargs)[source]¶

Contextual Relative Entropy Policy Search.

Use C-REPS as a black-box contextual optimizer: Learns an upper-level distribution $\pi(\boldsymbol{\theta}|\boldsymbol{s})$ which selects weights $\boldsymbol{\theta}$ for the objective function. At the moment, $\pi(\boldsymbol{\theta}|\boldsymbol{s})$ is assumed to be a multivariate Gaussian distribution whose mean is a linear function of nonlinear features from the context. C-REPS constrains the learning updates such that the KL divergence between successive distribution is below the threshold $\epsilon$.

This contextual version of REPSOptimizer inherits the properties from the original algorithm. More information on the algorithm can be found in the original publication [1].

Parameters:

Parameters:	initial_params : array-like, shape (n_params,) Initial parameter vector. variance : float, optional (default: 1.0) Initial exploration variance. covariance : array-like, optional (default: None) Either a diagonal (with shape (n_params,)) or a full covariance matrix (with shape (n_params, n_params)). A full covariance can contain information about the correlation of variables. epsilon : float, optional (default: 2.0) Maximum Kullback-Leibler divergence of two successive policy distributions. min_eta : float, optional (default: 1e-8) Minimum eta, 0 would result in numerical problems train_freq : int, optional (default: 25) Number of rollouts between policy updates. n_samples_per_update : int, optional (default: 100) Number of samples that will be used to update a policy. context_features : string or callable, optional (default: None) (Nonlinear) feature transformation for the context. Possible options are ‘constant’, ‘linear’, ‘affine’, ‘quadratic’, ‘cubic’, or you can write a custom function that computes a transformation of the context. This will make a linear upper level policy capable of representing nonlinear functions. gamma : float, optional (default: 1e-4) Regularization parameter. Should be removed in the future. bounds : array-like, shape (n_samples, 2), optional (default: None) Upper and lower bounds for each parameter. log_to_file: optional, boolean or string (default: False) Log results to given file, it will be located in the $BL_LOG_PATH log_to_stdout: optional, boolean (default: False) Log to standard output random_state : optional, int Seed for the random number generator.

initial_params : array-like, shape (n_params,): Initial parameter vector.
variance : float, optional (default: 1.0): Initial exploration variance.
covariance : array-like, optional (default: None): Either a diagonal (with shape (n_params,)) or a full covariance matrix (with shape (n_params, n_params)). A full covariance can contain information about the correlation of variables.
epsilon : float, optional (default: 2.0): Maximum Kullback-Leibler divergence of two successive policy distributions.
min_eta : float, optional (default: 1e-8): Minimum eta, 0 would result in numerical problems
train_freq : int, optional (default: 25): Number of rollouts between policy updates.
n_samples_per_update : int, optional (default: 100): Number of samples that will be used to update a policy.
context_features : string or callable, optional (default: None): (Nonlinear) feature transformation for the context. Possible options are ‘constant’, ‘linear’, ‘affine’, ‘quadratic’, ‘cubic’, or you can write a custom function that computes a transformation of the context. This will make a linear upper level policy capable of representing nonlinear functions.
gamma : float, optional (default: 1e-4): Regularization parameter. Should be removed in the future.
bounds : array-like, shape (n_samples, 2), optional (default: None): Upper and lower bounds for each parameter.
log_to_file: optional, boolean or string (default: False): Log results to given file, it will be located in the $BL_LOG_PATH
log_to_stdout: optional, boolean (default: False): Log to standard output
random_state : optional, int: Seed for the random number generator.

References

[1]	(1, 2) Kupcsik, A.; Deisenroth, M.P.; Peters, J.; Loh, A.P.; Vadakkepat, P.; Neumann, G. Model-based contextual policy search for data-efficient generalization of robot skills. Artificial Intelligence 247, 2017.

__init__(initial_params=None, variance=None, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, context_features=None, gamma=0.0001, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None, **kwargs)[source]¶

best_policy()[source]¶

Return current best estimate of contextual policy.

Returns:	policy : UpperLevelPolicy Best estimate of upper-level policy

get_args()¶

Get parameters for this estimator.

Returns:	params : mapping of string to any Parameter names mapped to their values.

get_desired_context()[source]¶

C-REPS does not actively select the context.

Returns:	context : None C-REPS does not have any preference

get_next_parameters(params, explore=True)[source]¶

Get next individual/parameter vector for evaluation.

Parameters:	params : array_like, shape (n_params,) Parameter vector, will be modified explore : bool, optional (default: True) Whether we want to turn exploration on for the next evaluation

init(n_params, n_context_dims)[source]¶

Initialize optimizer.

Parameters:	n_params : int number of parameters n_context_dims : int number of dimensions of the context space

is_behavior_learning_done()[source]¶

Check if the optimization is finished.

Returns:	finished : bool Is the learning of a behavior finished?

set_context(context)[source]¶

Set context of next evaluation.

Parameters:	context : array-like, shape (n_context_dims,) The context in which the next rollout will be performed

set_evaluation_feedback(rewards)[source]¶

Set feedbacks for the parameter vector.

Parameters:	rewards : list of float Feedbacks for each step or for the episode, depends on the problem

bolero.optimizer.CREPSOptimizer¶

`bolero.optimizer`.CREPSOptimizer¶