This issue is about the train method on pufferl.py.
All the data (observations, rewards, actions, terminals, etc) have the shape (segments, horizon, ...).
Horizon is usually 64. The data stores 64 states, however, it just stores 63 transitions.
This means rewards[:, 0] and terminals[:, 0] are never used which is a tiny waste, but more importantly advantages[:, -1] is always 0 (before the normalisation). mb_returns[:, -1] is also always equal to mb_values[:, -1].
So the last sample of every segment the pg_loss is ~0 (not actually 0 because of the normalisation of the advantage), the entropy loss makes the policy more random and the value function loss is conservative - i.e. it pushes the value function to its previous prediction.
I think the typical way to fix this is to store 64 transitions instead of 64 states.
This issue is about the train method on pufferl.py.
All the data (observations, rewards, actions, terminals, etc) have the shape (segments, horizon, ...).
Horizon is usually 64. The data stores 64 states, however, it just stores 63 transitions.
This means
rewards[:, 0]andterminals[:, 0]are never used which is a tiny waste, but more importantlyadvantages[:, -1]is always 0 (before the normalisation).mb_returns[:, -1]is also always equal tomb_values[:, -1].So the last sample of every segment the pg_loss is ~0 (not actually 0 because of the normalisation of the advantage), the entropy loss makes the policy more random and the value function loss is conservative - i.e. it pushes the value function to its previous prediction.
I think the typical way to fix this is to store 64 transitions instead of 64 states.