Fork me on GitHub

bolero.optimizer.REPSOptimizer

class bolero.optimizer.REPSOptimizer(initial_params=None, variance=1.0, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None)[source]

Relative Entropy Policy Search (REPS) as Optimizer.

Use REPS as a black-box optimizer: learn an upper-level distribution \(\pi(\boldsymbol{\theta})\) which selects weights \(\boldsymbol{\theta}\) for the objective function. At the moment, \(\pi(\boldsymbol{\theta})\) is assumed to be a multivariate gaussian distribution whose mean and covariance (governing exploration) are learned. REPS constrains the learning updates such that the KL divergence between the old and the new distribution is below a threshold epsilon. More details can be found in the original publication [1].

Abdolmaleki et al. [2] state that “the episodic REPS algorithm uses a sample based approximation of the KL-bound, which needs a lot of samples in order to be accurate. Moreover, a typical problem of REPS is that the entropy of the search distribution decreases too quickly, resulting in premature convergence.”

Parameters:
initial_params : array, shape = (num_params,), optional (default: zeros)

Initial parameter vector.

variance : float, optional (default: 1)

Initial exploration variance.

covariance : array-like, optional (default: None), optional (default: I)

Either a diagonal (with shape (n_params,)) or a full covariance matrix (with shape (n_params, n_params)). A full covariance can contain information about the correlation of variables.

epsilon : float > 0.0, optional (default: 2)

The maximum the KL divergence between old and new “data” distribution might take on

train_freq : int > 0, optional (default: 25)

The frequency (the number of rollouts) of training, i.e., using REPS for updating the policies parameters. Defaults to 25 rollouts.

min_eta : float, optional (default: 1e-8)

Minimum eta, 0 would result in numerical problems

n_samples_per_update : int, optional (default: 100)

Number of samples that will be used to update a policy.

bounds : array-like, shape (n_samples, 2), optional (default: None)

Upper and lower bounds for each parameter.

log_to_file : optional, boolean or string (default: False)

Log results to given file, it will be located in the $BL_LOG_PATH

log_to_stdout : optional, boolean (default: False)

Log to standard output

random_state : optional, int

Seed for the random number generator.

References

[1](1, 2) Peters, J.; Muelling, K.; Altuen, Y. Relative Entropy Policy Search. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
[2](1, 2) Abdolmaleki, A.; Lioutikov, R.; Lau, N; Paulo Reis, L.; Peters, J.; Neumann, G. Model-Based Relative Entropy Stochastic Search. Advances in Neural Information Processing Systems 28, 2015.
__init__(initial_params=None, variance=1.0, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None)[source]
get_args()

Get parameters for this estimator.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_best_parameters()[source]

Get the best parameters.

Returns:
best_params : array-like, shape (n_params,)

Best parameters

get_next_parameters(params, explore=True)[source]

Return parameter vector that shall be evaluated next.

Parameters:
params : array-like, shape = (n_params,)

The selected parameters will be written into this as a side-effect.

explore : bool

Whether exploration in parameter selection is enabled

init(n_params)[source]

Initialize optimizer.

Parameters:
n_params : int

number of parameters

is_behavior_learning_done()[source]

Check if the optimization is finished.

Returns:
finished : bool

Is the learning of a behavior finished?

set_evaluation_feedback(feedbacks)[source]

Inform optimizer of outcome of a rollout with current weights.