-
Notifications
You must be signed in to change notification settings - Fork 600
Labels
Description
ML::HDBSCAN::detail::Condense::build_condensed_hierarchy becomes a bottleneck within end-to-end HDBSCAN runs after optimizing the kNN construction (use NN Descent for kNN).
The below is the breakdown of running HDBSCAN on subsamples of the Appliances Amazon review dataset (1.8M x 768).
It can be seen that especially when we optimize the MR (mutual reachability) kNN step NN Descent, the build_condensed_hierarchy takes up a very large portion of the e2e time.
If the tree turns out to be lopsided or very deep (which can happen depending on the data distribution), or if the branching factor is small at the top of the tree, this can take up a lot of time. Below is the breakdown for a make_blobs synthetic dataset.
