Xgboost Dask #11782

Kim-Jongchan · 2025-10-30T12:54:50Z

Kim-Jongchan
Oct 30, 2025

Is XGBoost with Dask truly scalable in practice — especially regarding dataset size relative to RAM capacity?

I’m testing distributed training with xgboost.dask and noticed that during DaskDMatrix initialization, it appears to load the entire dataset into memory across the workers. This seems to limit scalability, as even with a Dask cluster, the dataset size cannot exceed total cluster RAM by much.

My understanding was that Dask+XGBoost would allow “out-of-core” or streaming-like training behavior, but it looks like the data still needs to be materialized in memory first.

So, a few sub-questions:

Does DaskDMatrix or DaskQuantileDMatrix actually load all data into memory on each worker before training starts?
Is there any truly out-of-core training mode supported (similar to XGBoost’s single-machine external memory mode)?
In practical terms, how large can a dataset be (relative to total RAM across workers) before Dask-XGBoost starts failing due to memory errors?
Are there known best practices to train on datasets larger than memory — e.g., chunked training, Arrow-based streaming, or saving intermediate histograms?

trivialfis · 2025-10-30T14:00:28Z

trivialfis
Oct 30, 2025
Maintainer

Yes, it needs to load the dataset into the main memory. It's horizontal scaling only (using more workers).

External memory supports distributed training, just not through dask, please find the demo in the document.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Xgboost Dask #11782

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Xgboost Dask #11782

Uh oh!

Kim-Jongchan Oct 30, 2025

Replies: 1 comment

Uh oh!

trivialfis Oct 30, 2025 Maintainer

Kim-Jongchan
Oct 30, 2025

trivialfis
Oct 30, 2025
Maintainer