for a given policy $\pi$,
how to approximate its [[Q function|action-value function]] $Q[\pi]$
using [[on policy]] data i.e. a trajectory $\tau \sim \pi$?
this is the *approximate policy evaluation* aka *value approximation* problem.
^problem
(here we learn $Q$ but the reasoning for $V$ is essentially identical)
[[supervised]] learning (ie [[regression]]) is a useful tool for [[function approximation]].
choose some [[parameter]]ized [[prediction rule|prediction rule class]] $\hat{Q} : [\Theta] \to \mathcal{S} \times \mathcal{A} \to \mathbb{R}$
and fit as a [[residual square sum]] [[regression]] problem:
$
\arg\min_{ \theta \in \Theta } \frac{1}{2H} \sum_{h=0}^{H-1} (q^{*}_{h}(\tau) - \hat{Q}[\theta](s_{h}, a_{h}))^{2}
$
^regression
where $q^{*}_{h}(\tau)$, the [[value target]],
is some [[point estimator]] of $Q[\pi](s_{h}, a_{h})$ computed from $\tau$.
one possible [[optimization algorithm|fitting method]] is [[gradient descent]]:
$
\theta \gets \theta + \eta \Big( q^{*}_{h}(\tau) - \hat{Q}[\theta](s_{h}, a_{h}) \Big) \nabla \Big[\hat Q[\cdot](s_{h}, a_{h})\Big](\theta)
$
^update
where $\eta$ is some [[learning rate]].
note the scalar coefficient $q^{*}_{h}(\tau) - \hat{Q}[\theta](s_{h}, a_{h})]$ is the [[temporal difference error]].
this update rule leads to many [[temporal difference learning]] algorithms.
# [[optimization|realizable]]
suppose $q^{*}_{h}(\tau)$ is [[bias|unbiased]]:
$
\mathop{\mathbb{E}}_{ \tau \sim \rho[\pi] }[q^{*}_{h}(\tau) \mid s_{h}, a_{h}] = Q[\pi](s_{h}, a_{h}).
$
then since [[conditional expectation minimizes mean squared error]]
the optimal solution to the [[#^regression]] problem is $\hat{Q}[\theta^{*}] = Q[\pi]$
(if the [[prediction rule|prediction rule class]] we have chosen is [[optimization|realizable]] ie contains $Q[\pi]$).
# sources
[[2018SuttonBartoReinforcementLearningIntroduction]] ch 6.6
[Introducing Q-Learning - Hugging Face Deep RL Course](https://huggingface.co/learn/deep-rl-course/unit2/q-learning?fw=pt)
[[2013MnihEtAlPlayingAtariDeep|Playing atari with deep reinforcement learning]].
[Reinforcement Learning (DQN) Tutorial — PyTorch Tutorials 2.0.1+cu117 documentation](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
[[2025GoogleDeepMindRlax]]