Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions MIGRATION_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -1387,6 +1387,80 @@ Notes:
`ProposalGenerator` and provide your own `_build` returning a C++
proposal-generator object if you need to extend the set.

### Agglomerative Cluster Policies

`bioimage_cpp.graph.agglomeration` provides hierarchical agglomerative
clustering driven by a small set of policy classes, matching the policies
in `nifty.graph.agglo`. Each policy is a max-heap-style driver (smaller
edge indicator = stronger merge candidate, matching nifty's convention)
with policy-specific priority computation, merge rule, and stopping
criterion. All policies accept any `UndirectedGraph` subclass —
`RegionAdjacencyGraph`, `GridGraph2D`/`GridGraph3D` included.

Nifty:

```python
import nifty.graph.agglo as nagglo

# Hierarchical, edge-weighted clustering.
policy = nagglo.edgeWeightedClusterPolicy(
graph=graph,
edgeIndicators=edge_indicators,
edgeSizes=edge_sizes,
nodeSizes=node_sizes,
numberOfNodesStop=number_of_clusters_stop,
sizeRegularizer=0.5,
)
labels = nagglo.agglomerativeClustering(policy).run().result()
```

bioimage-cpp:

```python
labels = bic.graph.agglomeration.EdgeWeightedClusterPolicy(
num_clusters_stop=number_of_clusters_stop,
size_regularizer=0.5,
).optimize(graph, edge_indicators, edge_sizes=edge_sizes, node_sizes=node_sizes)
```

Mapping:

| Nifty | bioimage-cpp |
| --- | --- |
| `edgeWeightedClusterPolicy(...)` | `EdgeWeightedClusterPolicy(num_clusters_stop=, size_regularizer=).optimize(graph, edge_indicators, edge_sizes=, node_sizes=)` |
| `nodeAndEdgeWeightedClusterPolicy(...)` | `NodeAndEdgeWeightedClusterPolicy(num_clusters_stop=, size_regularizer=, beta=).optimize(graph, edge_indicators, node_features, edge_sizes=, node_sizes=)` |
| `malaClusterPolicy(...)` | `MalaClusterPolicy(num_bins=, bin_min=, bin_max=, num_clusters_stop=, num_edges_stop=, threshold=).optimize(graph, edge_indicators)` |
| `gaspClusterPolicy(...)` (signed weights + linkage) | `GaspClusterPolicy(num_clusters_stop=, linkage=).optimize(graph, edge_weights, edge_sizes=, is_mergeable=)` |

`GaspClusterPolicy` linkage strings map to the rules in Bailoni et al.'s
GASP framework: `"sum"`, `"mean"`, `"max"`, `"min"`, `"abs_max"`,
`"mutex_watershed"`. The `mutex_watershed` linkage treats a negative
heap-top weight as a cannot-link constraint; the others apply the chosen
linkage update without imposing hard constraints from signs. The
optional `is_mergeable` mask marks edges that should be used only to
install cluster-level cannot-link constraints.

Differences from nifty:

- `optimize` returns dense `uint64` node labels directly. Nifty exposes a
separate driver (`agglomerativeClustering(policy).run().result()`); the
underlying loop is the same.
- Both `float32` and `float64` inputs are accepted; computation runs in
`float64` internally.
- Tie-breaks follow the deterministic order of edge ids returned by
`UndirectedGraph`, which may differ from nifty's. On inputs where many
edges share the same indicator value, this combines with the
hierarchical agglomeration's positive feedback loop (each tied merge
changes node sizes, which changes the harmonic size factor `sFac`,
which changes future priorities) to give cascading divergence. On the
external multicut problem sample C/medium, where 86% of indicator
values are non-unique, perturbing the indicators of a single bic run by
1e-9 random noise can change the final partition's adjusted Rand index
vs. its own unperturbed output by ~0.5 (the algorithm is chaotically
sensitive to tie-breaking under non-zero `size_regularizer`). Both
partitions are valid clusterings; partition agreement (VI, ARI) is the
appropriate comparison metric, not label equality.

### Projecting RAG Node Labels to Pixels

Nifty projects scalar node data back to pixels with
Expand Down
307 changes: 307 additions & 0 deletions development/graph/agglomeration/PERFORMANCE_NOTES.md

Large diffs are not rendered by default.

171 changes: 171 additions & 0 deletions development/graph/agglomeration/_compatibility.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
"""Benchmark scaffolding for comparing bioimage-cpp agglomeration policies
against the corresponding ``nifty.graph.agglo`` implementations.

Loads the external multicut problem (a generic edge list + costs) and
reinterprets the costs as boundary indicators (after a sigmoid) so the
policies have something realistic to chew on. Reports median runtime over
``--repeats`` invocations and partition agreement (variation of information
and adjusted Rand index) between the two implementations.
"""

from __future__ import annotations

import argparse
from statistics import median
from time import perf_counter
from typing import Callable

import numpy as np


def parser(description: str) -> argparse.ArgumentParser:
arg_parser = argparse.ArgumentParser(description=description)
arg_parser.add_argument(
"--sample",
default="A",
choices=["A", "B", "C"],
help="Multicut problem sample to load.",
)
arg_parser.add_argument(
"--size",
default="small",
choices=["small", "medium"],
help="Multicut problem size to load (small ~ 60k nodes, medium ~ 700k).",
)
arg_parser.add_argument(
"--repeats",
type=int,
default=3,
help="Number of timed repeats per implementation.",
)
arg_parser.add_argument(
"--num-clusters-stop",
type=int,
default=200,
help="Stop when this many clusters remain (must be > 1 to keep both "
"implementations from collapsing the whole graph).",
)
arg_parser.add_argument(
"--size-regularizer",
type=float,
default=0.5,
help="Size regulariser exponent for the edge-weighted policies.",
)
arg_parser.add_argument(
"--threshold",
type=float,
default=0.5,
help="Threshold for the MALA policy.",
)
arg_parser.add_argument(
"--timeout",
type=float,
default=120.0,
help="Download timeout in seconds if the external problem is not cached.",
)
return arg_parser


def load_problem(sample: str = "A", size: str = "small", *, timeout: float = 120.0):
"""Load a multicut problem and derive indicator / weight arrays.

Returns ``(bic_graph, nifty_graph, indicators, signed_weights, uv_ids)``
where ``indicators`` are in ``[0, 1]`` (boundary strength) and
``signed_weights`` keeps the original multicut sign (positive = attract).
"""
import bioimage_cpp as bic
import nifty.graph as ng

uv_ids, costs = bic.graph.multicut.load_multicut_problem_data(
sample, size, timeout=timeout
)
n_nodes = int(uv_ids.max()) + 1

bic_graph = bic.graph.UndirectedGraph.from_edges(n_nodes, uv_ids)
nifty_graph = ng.undirectedGraph(n_nodes)
nifty_graph.insertEdges(uv_ids)

# Multicut costs are signed log-odds (positive = attractive, large
# magnitude = certain). Map to a boundary-strength indicator in [0, 1]
# via a sigmoid of the negated cost so 'large positive cost' becomes
# 'small indicator' (weak boundary), matching nifty's convention.
indicators = 1.0 / (1.0 + np.exp(np.asarray(costs, dtype=np.float64)))
indicators = np.ascontiguousarray(indicators.astype(np.float64))
# Signed weights for GASP: keep the multicut sign directly.
signed_weights = np.ascontiguousarray(np.asarray(costs, dtype=np.float64))
return bic_graph, nifty_graph, indicators, signed_weights, uv_ids


def time_call(function: Callable[[], np.ndarray], repeats: int):
timings = []
result = None
for _ in range(repeats):
start = perf_counter()
result = function()
timings.append(perf_counter() - start)
assert result is not None
return timings, result


def variation_of_information(labels_a: np.ndarray, labels_b: np.ndarray) -> float:
labels_a = np.asarray(labels_a).astype(np.int64)
labels_b = np.asarray(labels_b).astype(np.int64)
n = labels_a.size
if n == 0:
return 0.0
_, a_inv, a_counts = np.unique(labels_a, return_inverse=True, return_counts=True)
_, b_inv, b_counts = np.unique(labels_b, return_inverse=True, return_counts=True)
pa = a_counts / n
pb = b_counts / n
contingency = np.zeros((a_counts.size, b_counts.size), dtype=np.float64)
np.add.at(contingency, (a_inv, b_inv), 1.0)
contingency /= n
with np.errstate(divide="ignore", invalid="ignore"):
ha = -np.sum(pa * np.log(pa, where=pa > 0))
hb = -np.sum(pb * np.log(pb, where=pb > 0))
joint = -np.sum(
contingency * np.log(contingency, where=contingency > 0)
)
mutual_info = ha + hb - joint
return float(2.0 * joint - ha - hb - 2.0 * mutual_info + ha + hb)


def adjusted_rand(labels_a: np.ndarray, labels_b: np.ndarray) -> float:
try:
from sklearn.metrics import adjusted_rand_score
except ImportError:
return float("nan")
return float(adjusted_rand_score(labels_a, labels_b))


def report(
name: str,
bic_timings,
nifty_timings,
bic_labels,
nifty_labels,
n_nodes,
n_edges,
*,
sample: str | None = None,
size: str | None = None,
):
vi = variation_of_information(bic_labels, nifty_labels)
ari = adjusted_rand(bic_labels, nifty_labels)
bic_clusters = int(np.unique(bic_labels).size)
nifty_clusters = int(np.unique(nifty_labels).size)
bic_med = median(bic_timings)
nifty_med = median(nifty_timings)
speedup = nifty_med / bic_med if bic_med > 0 else float("nan")
suffix = ""
if sample is not None and size is not None:
suffix = f" [sample {sample} / {size}]"
print(f"policy: {name}{suffix}")
print(f"nodes: {n_nodes}, edges: {n_edges}")
print(f"bioimage_cpp clusters: {bic_clusters}")
print(f"nifty clusters: {nifty_clusters}")
print(f"bioimage_cpp median runtime [s]: {bic_med:.6f}")
print(f"nifty median runtime [s]: {nifty_med:.6f}")
print(f"speedup (nifty / bioimage_cpp): {speedup:.2f}x")
print(f"variation of information: {vi:.6f}")
print(f"adjusted Rand index: {ari:.6f}")
65 changes: 65 additions & 0 deletions development/graph/agglomeration/check_edge_weighted.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
"""Compare bioimage-cpp and nifty edge-weighted agglomerative clustering."""

from __future__ import annotations

import numpy as np

import bioimage_cpp as bic

from _compatibility import load_problem, parser, report, time_call


def main() -> None:
args = parser(__doc__ or "").parse_args()
bic_graph, nifty_graph, indicators, _, _ = load_problem(
args.sample, args.size, timeout=args.timeout
)
n_edges = int(bic_graph.number_of_edges)
n_nodes = int(bic_graph.number_of_nodes)
edge_sizes = np.ones(n_edges, dtype=np.float64)
node_sizes = np.ones(n_nodes, dtype=np.float64)

import nifty.graph.agglo as nagglo

def run_bic() -> np.ndarray:
return bic.graph.agglomeration.EdgeWeightedClusterPolicy(
num_clusters_stop=args.num_clusters_stop,
size_regularizer=args.size_regularizer,
).optimize(
bic_graph,
indicators,
edge_sizes=edge_sizes,
node_sizes=node_sizes,
)

def run_nifty() -> np.ndarray:
policy = nagglo.edgeWeightedClusterPolicy(
graph=nifty_graph,
edgeIndicators=indicators.astype(np.float32),
edgeSizes=edge_sizes.astype(np.float32),
nodeSizes=node_sizes.astype(np.float32),
numberOfNodesStop=args.num_clusters_stop,
sizeRegularizer=args.size_regularizer,
)
clustering = nagglo.agglomerativeClustering(policy)
clustering.run()
return np.asarray(clustering.result(), dtype=np.uint64)

bic_timings, bic_labels = time_call(run_bic, args.repeats)
nifty_timings, nifty_labels = time_call(run_nifty, args.repeats)

report(
"edge_weighted",
bic_timings,
nifty_timings,
bic_labels,
nifty_labels,
n_nodes,
n_edges,
sample=args.sample,
size=args.size,
)


if __name__ == "__main__":
main()
Loading
Loading