Suppose we have some [[trajectory dataset]]
(e.g. lots of previous users of a platform or video game).
How to find an (approximately) [[optimal policy]] *without* access to the environment?
see also [[imitation learning]] when the data comes from an expert;
see [[transfer learning in reinforcement learning]] for [[covariate shift]] issue
another way is to treat rl as a [[stochastic process|time series]] / [[self supervised reinforcement learning]] task
# sources
[[Sergey Levine]] does a lot of work on this
[Reinforcement Learning with Large Datasets: a Path to Resourceful Autonomous Agents - YouTube](https://youtu.be/D74DRVWLS7A)