Commit 1ab4353
authored
document default parameters for streaming diloco (#1308)
Summary:
document why default parameters are set the way they are for streaming
diloco
Test Plan:
```
$ NGPU=2 ./run_train.sh --fault_tolerance.enable --fault_tolerance.group_size=1 --fault_tolerance.semi_sync_method=diloco --fault_tolerance.sync_steps=2 --fault_tolerance.replica_id=0 --fault_tolerance.fragment_sync_delay=1 --fault_tolerance.fragment_update_alpha=0.0
[rank0]:[titan] 2025-06-16 09:39:08,893 - root - INFO - Model llama3 debugmodel size: 6,270,208 total parameters
[rank0]:[titan] 2025-06-16 09:39:08,894 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-16 09:39:08,952 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-16 09:39:09,375 - root - WARNING - Peak flops undefined for: NVIDIA PG509-210, fallback to A100
[rank0]:[titan] 2025-06-16 09:39:09,376 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
[rank0]:[titan] 2025-06-16 09:39:09,376 - root - INFO - CUDA memory usage for model: 0.03GiB(0.04%)
[rank0]:[titan] 2025-06-16 09:39:09,377 - root - INFO - Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2).
[rank0]:[titan] 2025-06-16 09:39:09,377 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-16 09:39:10,325 - root - INFO - step: 1 loss: 8.1934 memory: 1.26GiB(1.59%) tps: 11,442 tflops: 0.82 mfu: 0.26%
[rank0]:[titan] 2025-06-16 09:39:10,325 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-16 09:39:10,431 - root - INFO - step: 2 loss: 8.1507 memory: 1.35GiB(1.71%) tps: 154,916 tflops: 11.14 mfu: 3.57%
[rank0]:[titan] 2025-06-16 09:39:10,524 - root - INFO - step: 3 loss: 8.0737 memory: 1.35GiB(1.71%) tps: 177,405 tflops: 12.76 mfu: 4.09%
[rank0]:[titan] 2025-06-16 09:39:10,623 - root - INFO - step: 4 loss: 7.8865 memory: 1.35GiB(1.71%) tps: 167,289 tflops: 12.03 mfu: 3.86%
[rank0]:[titan] 2025-06-16 09:39:10,714 - root - INFO - step: 5 loss: 7.7620 memory: 1.35GiB(1.71%) tps: 179,656 tflops: 12.92 mfu: 4.14%
[rank0]:[titan] 2025-06-16 09:39:10,808 - root - INFO - step: 6 loss: 7.5449 memory: 1.35GiB(1.71%) tps: 175,901 tflops: 12.65 mfu: 4.05%
[rank0]:[titan] 2025-06-16 09:39:10,911 - root - INFO - step: 7 loss: 7.3452 memory: 1.35GiB(1.71%) tps: 159,859 tflops: 11.49 mfu: 3.68%
[rank0]:[titan] 2025-06-16 09:39:11,005 - root - INFO - step: 8 loss: 7.2973 memory: 1.35GiB(1.71%) tps: 175,980 tflops: 12.65 mfu: 4.06%
[rank0]:[titan] 2025-06-16 09:39:11,096 - root - INFO - step: 9 loss: 7.1333 memory: 1.35GiB(1.71%) tps: 179,903 tflops: 12.94 mfu: 4.15%
[rank0]:[titan] 2025-06-16 09:39:11,186 - root - INFO - step: 10 loss: 7.0747 memory: 1.35GiB(1.71%) tps: 184,628 tflops: 13.28 mfu: 4.26%
[rank0]:[titan] 2025-06-16 09:39:11,186 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-16 09:39:13,186 - root - INFO - Training completed
[rank0]:[titan] 2025-06-16 09:39:13,489 - root - INFO - Process group destroyed.
```1 parent aae7323 commit 1ab4353
1 file changed
+14
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
588 | 588 | | |
589 | 589 | | |
590 | 590 | | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
591 | 596 | | |
592 | 597 | | |
593 | 598 | | |
| |||
597 | 602 | | |
598 | 603 | | |
599 | 604 | | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
600 | 610 | | |
601 | 611 | | |
602 | 612 | | |
603 | 613 | | |
604 | 614 | | |
605 | 615 | | |
606 | 616 | | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
607 | 621 | | |
608 | 622 | | |
609 | 623 | | |
| |||
0 commit comments