`bolero.optimizer`.REPSOptimizer¶

class bolero.optimizer.REPSOptimizer(initial_params=None, variance=1.0, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None)[source]¶

Relative Entropy Policy Search (REPS) as Optimizer.

Use REPS as a black-box optimizer: learn an upper-level distribution $\pi(\boldsymbol{\theta})$ which selects weights $\boldsymbol{\theta}$ for the objective function. At the moment, $\pi(\boldsymbol{\theta})$ is assumed to be a multivariate gaussian distribution whose mean and covariance (governing exploration) are learned. REPS constrains the learning updates such that the KL divergence between the old and the new distribution is below a threshold epsilon. More details can be found in the original publication [1].

Abdolmaleki et al. [2] state that “the episodic REPS algorithm uses a sample based approximation of the KL-bound, which needs a lot of samples in order to be accurate. Moreover, a typical problem of REPS is that the entropy of the search distribution decreases too quickly, resulting in premature convergence.”

Parameters:

Parameters:	initial_params : array, shape = (num_params,), optional (default: zeros) Initial parameter vector. variance : float, optional (default: 1) Initial exploration variance. covariance : array-like, optional (default: None), optional (default: I) Either a diagonal (with shape (n_params,)) or a full covariance matrix (with shape (n_params, n_params)). A full covariance can contain information about the correlation of variables. epsilon : float > 0.0, optional (default: 2) The maximum the KL divergence between old and new “data” distribution might take on train_freq : int > 0, optional (default: 25) The frequency (the number of rollouts) of training, i.e., using REPS for updating the policies parameters. Defaults to 25 rollouts. min_eta : float, optional (default: 1e-8) Minimum eta, 0 would result in numerical problems n_samples_per_update : int, optional (default: 100) Number of samples that will be used to update a policy. bounds : array-like, shape (n_samples, 2), optional (default: None) Upper and lower bounds for each parameter. log_to_file : optional, boolean or string (default: False) Log results to given file, it will be located in the $BL_LOG_PATH log_to_stdout : optional, boolean (default: False) Log to standard output random_state : optional, int Seed for the random number generator.

initial_params : array, shape = (num_params,), optional (default: zeros): Initial parameter vector.
variance : float, optional (default: 1): Initial exploration variance.
covariance : array-like, optional (default: None), optional (default: I): Either a diagonal (with shape (n_params,)) or a full covariance matrix (with shape (n_params, n_params)). A full covariance can contain information about the correlation of variables.
epsilon : float > 0.0, optional (default: 2): The maximum the KL divergence between old and new “data” distribution might take on
train_freq : int > 0, optional (default: 25): The frequency (the number of rollouts) of training, i.e., using REPS for updating the policies parameters. Defaults to 25 rollouts.
min_eta : float, optional (default: 1e-8): Minimum eta, 0 would result in numerical problems
n_samples_per_update : int, optional (default: 100): Number of samples that will be used to update a policy.
bounds : array-like, shape (n_samples, 2), optional (default: None): Upper and lower bounds for each parameter.
log_to_file : optional, boolean or string (default: False): Log results to given file, it will be located in the $BL_LOG_PATH
log_to_stdout : optional, boolean (default: False): Log to standard output
random_state : optional, int: Seed for the random number generator.

References

[1]	(1, 2) Peters, J.; Muelling, K.; Altuen, Y. Relative Entropy Policy Search. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

[2]	(1, 2) Abdolmaleki, A.; Lioutikov, R.; Lau, N; Paulo Reis, L.; Peters, J.; Neumann, G. Model-Based Relative Entropy Stochastic Search. Advances in Neural Information Processing Systems 28, 2015.

__init__(initial_params=None, variance=1.0, covariance=None, epsilon=2.0, min_eta=1e-08, train_freq=25, n_samples_per_update=100, bounds=None, log_to_file=False, log_to_stdout=False, random_state=None)[source]¶

get_args()¶

Get parameters for this estimator.

Returns:	params : mapping of string to any Parameter names mapped to their values.

get_best_parameters()[source]¶

Get the best parameters.

Returns:	best_params : array-like, shape (n_params,) Best parameters

get_next_parameters(params, explore=True)[source]¶

Return parameter vector that shall be evaluated next.

Parameters:	params : array-like, shape = (n_params,) The selected parameters will be written into this as a side-effect. explore : bool Whether exploration in parameter selection is enabled

init(n_params)[source]¶

Initialize optimizer.

Parameters:	n_params : int number of parameters

is_behavior_learning_done()[source]¶

Check if the optimization is finished.

Returns:	finished : bool Is the learning of a behavior finished?

set_evaluation_feedback(feedbacks)[source]¶: Inform optimizer of outcome of a rollout with current weights.

bolero.optimizer.REPSOptimizer¶

`bolero.optimizer`.REPSOptimizer¶