## Nived Rajaraman (UC Berkeley)
## Title and Abstract
Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand whether the difficulty of the problem scales as the planning horizon of the problem increases. I will talk about a recent work by Wang et al that provides a partial answer towards this question, which is surprisingly negative! In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon - a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is not necessarily more difficult than short horizon RL, at least in a minimax sense. The analysis introduces two ideas: (i) the construction of an ε-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. In this talk, I will not get too much into the technical details, and instead spend some efforts in trying to unravel the intuition behind why such a result should in fact not be impossible |