Policy Embeddings for Guided Optimization and Diversity
This talk introduces a new perspective on reinforcement learning, by representing a policy as a probability distribution over the behavior it induces. This behavioral embedding can be any meaningful description of behavior, including state visitation frequencies and action distributions (as used in policy gradient methods). First, I propose to measure the distance between embeddings using the dual formulation of the Wasserstein distance (WD) in this newly defined latent behavioral space. With this approach we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors. Next I also introduce a novel approach of measuring the diversity of a population of agents, inspired by Determinantal Point Processes (DPPs). Finally, I will discuss some exciting future directions making use this perspective, which I hope will inspire new research.