approximate policy evaluation

for a given policy $\pi$, how to approximate its [[Q function|action-value function]] $Q[\pi]$ using [[on policy]] data i.e. a trajectory $\tau \sim \pi$? this is the *approximate policy evaluation* aka *value approximation* problem. ^problem (here we learn $Q$ but the reasoning for $V$ is essentially identical) [[supervised]] learning (ie [[regression]]) is a useful tool for [[function approximation]]. choose some [[parameter]]ized [[prediction rule|prediction rule class]] $\hat{Q} : [\Theta] \to \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ and fit as a [[residual square sum]] [[regression]] problem: $ \arg\min_{ \theta \in \Theta } \frac{1}{2H} \sum_{h=0}^{H-1} (q^{*}_{h}(\tau) - \hat{Q}[\theta](s_{h}, a_{h}))^{2} $ ^regression where $q^{*}_{h}(\tau)$, the [[value target]], is some [[point estimator]] of $Q[\pi](s_{h}, a_{h})$ computed from $\tau$. one possible [[optimization algorithm|fitting method]] is [[gradient descent]]: $ \theta \gets \theta + \eta \Big( q^{*}_{h}(\tau) - \hat{Q}[\theta](s_{h}, a_{h}) \Big) \nabla \Big[\hat Q[\cdot](s_{h}, a_{h})\Big](\theta) $ ^update where $\eta$ is some [[learning rate]]. note the scalar coefficient $q^{*}_{h}(\tau) - \hat{Q}[\theta](s_{h}, a_{h})]$ is the [[temporal difference error]]. this update rule leads to many [[temporal difference learning]] algorithms. # [[optimization|realizable]] suppose $q^{*}_{h}(\tau)$ is [[bias|unbiased]]: $ \mathop{\mathbb{E}}_{ \tau \sim \rho[\pi] }[q^{*}_{h}(\tau) \mid s_{h}, a_{h}] = Q[\pi](s_{h}, a_{h}). $ then since [[conditional expectation minimizes mean squared error]] the optimal solution to the [[#^regression]] problem is $\hat{Q}[\theta^{*}] = Q[\pi]$ (if the [[prediction rule|prediction rule class]] we have chosen is [[optimization|realizable]] ie contains $Q[\pi]$). # sources [[2018SuttonBartoReinforcementLearningIntroduction]] ch 6.6 [Introducing Q-Learning - Hugging Face Deep RL Course](https://huggingface.co/learn/deep-rl-course/unit2/q-learning?fw=pt) [[2013MnihEtAlPlayingAtariDeep|Playing atari with deep reinforcement learning]]. [Reinforcement Learning (DQN) Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) [[2025GoogleDeepMindRlax]]