Fork me on GitHub

bolero.optimizer.CREPSOptimizer

class bolero.optimizer.CREPSOptimizer(initial_params=None, variance=None, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, context_features=None, gamma=0.0001, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None, **kwargs)[source]

Contextual Relative Entropy Policy Search.

Use C-REPS as a black-box contextual optimizer: Learns an upper-level distribution \(\pi(\boldsymbol{\theta}|\boldsymbol{s})\) which selects weights \(\boldsymbol{\theta}\) for the objective function. At the moment, \(\pi(\boldsymbol{\theta}|\boldsymbol{s})\) is assumed to be a multivariate Gaussian distribution whose mean is a linear function of nonlinear features from the context. C-REPS constrains the learning updates such that the KL divergence between successive distribution is below the threshold \(\epsilon\).

This contextual version of REPSOptimizer inherits the properties from the original algorithm. More information on the algorithm can be found in the original publication [1].

Parameters:
initial_params : array-like, shape (n_params,)

Initial parameter vector.

variance : float, optional (default: 1.0)

Initial exploration variance.

covariance : array-like, optional (default: None)

Either a diagonal (with shape (n_params,)) or a full covariance matrix (with shape (n_params, n_params)). A full covariance can contain information about the correlation of variables.

epsilon : float, optional (default: 2.0)

Maximum Kullback-Leibler divergence of two successive policy distributions.

min_eta : float, optional (default: 1e-8)

Minimum eta, 0 would result in numerical problems

train_freq : int, optional (default: 25)

Number of rollouts between policy updates.

n_samples_per_update : int, optional (default: 100)

Number of samples that will be used to update a policy.

context_features : string or callable, optional (default: None)

(Nonlinear) feature transformation for the context. Possible options are ‘constant’, ‘linear’, ‘affine’, ‘quadratic’, ‘cubic’, or you can write a custom function that computes a transformation of the context. This will make a linear upper level policy capable of representing nonlinear functions.

gamma : float, optional (default: 1e-4)

Regularization parameter. Should be removed in the future.

bounds : array-like, shape (n_samples, 2), optional (default: None)

Upper and lower bounds for each parameter.

log_to_file: optional, boolean or string (default: False)

Log results to given file, it will be located in the $BL_LOG_PATH

log_to_stdout: optional, boolean (default: False)

Log to standard output

random_state : optional, int

Seed for the random number generator.

References

[1](1, 2) Kupcsik, A.; Deisenroth, M.P.; Peters, J.; Loh, A.P.; Vadakkepat, P.; Neumann, G. Model-based contextual policy search for data-efficient generalization of robot skills. Artificial Intelligence 247, 2017.
__init__(initial_params=None, variance=None, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, context_features=None, gamma=0.0001, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None, **kwargs)[source]
best_policy()[source]

Return current best estimate of contextual policy.

Returns:
policy : UpperLevelPolicy

Best estimate of upper-level policy

get_args()

Get parameters for this estimator.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_desired_context()[source]

C-REPS does not actively select the context.

Returns:
context : None

C-REPS does not have any preference

get_next_parameters(params, explore=True)[source]

Get next individual/parameter vector for evaluation.

Parameters:
params : array_like, shape (n_params,)

Parameter vector, will be modified

explore : bool, optional (default: True)

Whether we want to turn exploration on for the next evaluation

init(n_params, n_context_dims)[source]

Initialize optimizer.

Parameters:
n_params : int

number of parameters

n_context_dims : int

number of dimensions of the context space

is_behavior_learning_done()[source]

Check if the optimization is finished.

Returns:
finished : bool

Is the learning of a behavior finished?

set_context(context)[source]

Set context of next evaluation.

Parameters:
context : array-like, shape (n_context_dims,)

The context in which the next rollout will be performed

set_evaluation_feedback(rewards)[source]

Set feedbacks for the parameter vector.

Parameters:
rewards : list of float

Feedbacks for each step or for the episode, depends on the problem