Environment
Normal
Forbidden
Target
Reward Configuration
Select the penalty for forbidden states:
Note: Changing this will affect the results in Optimality, Q-Learning, and TD Linear modules.
Policy Configuration
Click on cells to cycle actions (这个修改只对当前页面下方的 "Evaluate Policy"(策略评估)按钮生效):
Optimality Equation
Convergence Analysis (PI vs Truncated PI)
Step-by-Step Visualization
Policy
Value Function
Q-Learning
Final Policy
Final Value
TD Linear
Note: 傅里叶变换阶数越高,结果越精确,但计算量也越大。此处展示初始策略为随机的预计算结果。