Optimizing ML pipeline #11409
-
|
Hi, My pipeline setup:
Is there a way to make this process faster? To skip some steps? Potential ideas:I would try to avoid pandas in step 3 and use Is there a way to load data from Spark memory or HDFS directly to |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Not a Spark expert. But here are a few things that might be useful:
|
Beta Was this translation helpful? Give feedback.
-
|
In addition to the above comment, you can consider using the (py)spark interface of XGBoost. However, getting rid of the text-based inputs should be the first step before trying anything else. |
Beta Was this translation helpful? Give feedback.
Not a Spark expert. But here are a few things that might be useful:
QuantileDMatixfor thedtrain(training dataset) instead ofDMatrixwhen using thehisttree method (default).DMatrixto load files, use pandas to load parquet or numpy to load its data, then pass them to XGBoost.