Commit 95b82c8
drop episodes with 0 advantages or truncated (#580)
Summary:
Episodes with all rewards = 0 or =1 does not help learning as advantage would be 0. also, episodes with generations that are tuncated due to max_res_tokens would mostly get 0 rewards unnecessary as most of answers are at the end.
Dropping these episodes provides trainer better batches to learn from (at the cost of sampling efficiency)
{F1983571844}
{F1983571853}
Reviewed By: casteryh
Differential Revision: D872436211 parent 5daec1b commit 95b82c8
1 file changed
+23
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
35 | 34 | | |
36 | 35 | | |
37 | 36 | | |
| |||
250 | 249 | | |
251 | 250 | | |
252 | 251 | | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
253 | 257 | | |
254 | 258 | | |
255 | 259 | | |
| |||
396 | 400 | | |
397 | 401 | | |
398 | 402 | | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
399 | 421 | | |
400 | 422 | | |
401 | 423 | | |
| |||
0 commit comments