Xgboost Dask #11782
Kim-Jongchan
started this conversation in
General
Xgboost Dask
#11782
Replies: 1 comment
-
|
Yes, it needs to load the dataset into the main memory. It's horizontal scaling only (using more workers). External memory supports distributed training, just not through dask, please find the demo in the document. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Is XGBoost with Dask truly scalable in practice — especially regarding dataset size relative to RAM capacity?
I’m testing distributed training with xgboost.dask and noticed that during DaskDMatrix initialization, it appears to load the entire dataset into memory across the workers. This seems to limit scalability, as even with a Dask cluster, the dataset size cannot exceed total cluster RAM by much.
My understanding was that Dask+XGBoost would allow “out-of-core” or streaming-like training behavior, but it looks like the data still needs to be materialized in memory first.
So, a few sub-questions:
Does DaskDMatrix or DaskQuantileDMatrix actually load all data into memory on each worker before training starts?
Is there any truly out-of-core training mode supported (similar to XGBoost’s single-machine external memory mode)?
In practical terms, how large can a dataset be (relative to total RAM across workers) before Dask-XGBoost starts failing due to memory errors?
Are there known best practices to train on datasets larger than memory — e.g., chunked training, Arrow-based streaming, or saving intermediate histograms?
Beta Was this translation helpful? Give feedback.
All reactions