In the code, the state is processed by the rollout_k_disks, so the residual target will be always 0. Why do you predict it using a head?
In the code, the state is processed by the rollout_k_disks, so the residual target will be always 0. Why do you predict it using a head?