Skip to content

Optimize build_condensed_hierarchy #7377

@jinsolp

Description

@jinsolp

ML::HDBSCAN::detail::Condense::build_condensed_hierarchy becomes a bottleneck within end-to-end HDBSCAN runs after optimizing the kNN construction (use NN Descent for kNN).

The below is the breakdown of running HDBSCAN on subsamples of the Appliances Amazon review dataset (1.8M x 768).
It can be seen that especially when we optimize the MR (mutual reachability) kNN step NN Descent, the build_condensed_hierarchy takes up a very large portion of the e2e time.

Image

If the tree turns out to be lopsided or very deep (which can happen depending on the data distribution), or if the branching factor is small at the top of the tree, this can take up a lot of time. Below is the breakdown for a make_blobs synthetic dataset.

Image

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions