Training signal could be improved

This issue is about the train method on [pufferl.py](https://github.com/PufferAI/PufferLib/blob/59553d07c2b2696018fe0893f899c6625556791d/pufferlib/pufferl.py#L317).

All the data (observations, rewards, actions, terminals, etc) have the shape (segments, horizon, ...).

Horizon is usually 64. The data stores 64 states, however, it just stores 63 **transitions**.

This means `rewards[:, 0]` and `terminals[:, 0]` are never used which is a tiny waste, but more importantly `advantages[:, -1]` is always 0 (before the normalisation). `mb_returns[:, -1]` is also always equal to `mb_values[:, -1]`.

So the last sample of every segment the pg_loss is ~0 (not actually 0 because of the normalisation of the advantage), the entropy loss makes the policy more random and the value function loss is conservative - i.e. it pushes the value function to its previous prediction.

I think the typical way to fix this is to store 64 transitions instead of 64 states.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training signal could be improved #363

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training signal could be improved #363

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions