Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions tsml_eval/publications/clustering/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Files for clustering publications."""
109 changes: 109 additions & 0 deletions tsml_eval/publications/clustering/kasba/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# 📘 KASBA: k-means Accelerated Stochastic Subgradient Barycentre Averaging
**Official Repository for the KASBA Time Series Clustering Paper**

This repository accompanies the paper:

> **Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering**
>
> https://arxiv.org/abs/2411.17838

KASBA is a $k$-means clustering algorithm that uses the Move-Split-Merge (MSM) elastic distance at all stages of clustering, applies a randomised stochastic subgradient descent to find barycentre centroids, links each stage of clustering to accelerate convergence and exploits the metric property of MSM distance to avoid a large proportion of distance calculations. It is a versatile and scalable clusterer designed for real-world TSCL applications. It allows practitioners to balance runtime and clustering performance when similarity is best measured by an elastic distance.

KASBA delivers state-of-the-art clustering performance while achieving 1–3 orders of magnitude speedups over existing elastic distance–based k-means algorithms.

This repository contains the exact model configurations, experiment scripts, and visualisation tools used to produce the results in the paper.

---

## 📁 Repository Structure

kasba/
├── README.md # This file
├── __init__.py
├── _utils.py # Internal utilities used across the project
├── _model_configuration.py # Definitions of all models and configurations used in experiments
├── _experiment_script.py # Script used to run experiments on datasets
├── kasba.ipynb # Notebook demonstrating how to run KASBA
├── result_visualisation.ipynb # Notebook for generating CD diagrams, MCM plots, etc.
└── results/ # Raw CSV result files used in the paper
└── combined # Subfolder for combined results
└── k-shape-compare # Subfolders results in section 5.4
└── section-5.1 # Subfolders results in section 5.1
└── train-test # Subfolders for train and test results
└── section-5.1 # Subfolders results in section 5.1
└── section-5.2 # Subfolders results in section 5.2
└── section-5.3 # Subfolders results in section 5.3


## 🚀 Getting Started

### Install dependencies

Create and activate a virtual environment from tsml-eval:

python3 -m venv venv
source venv/bin/activate
pip install -e .

If you are reading this message you will have to install a specific branch
of aeon while we wait for a new release. Run the following command to install:

pip uninstall aeon
pip install git+https://github.com/aeon-toolkit/aeon@kasba-results#egg=aeon

Note: The project uses aeon, numpy, matplotlib, and other standard scientific Python packages.

---

## 🧪 Running KASBA

Minimal example from the kasba.ipynb notebook:

from kasba import KASBA
from aeon.datasets import load_dataset

X, y = load_dataset("GunPoint")

model = KASBA(
n_clusters=2,
distance="msm",
distance_params={
"c": 1.0
},
)

labels = model.fit_predict(X)

The notebook demonstrates:

- How to use KASBA with different elastic distances
- How to cluster multivariate or unequal-length time series
- How to run multiple initialisations
- How to inspect convergence behaviour

---

## 📊 Reproducing Figures (CD & MCM)

Use the result_visualisation.ipynb notebook to generate:

- Critical Difference diagrams
- Model Comparison Matrices
- Ranking curves and statistical tests

---

## 📜 Citation

If you use KASBA in academic work, please cite the paper:

C. Holder, A. Bagnall, Rock the kasba: Blazingly fast and accurate time
series clustering, arXiv preprint arXiv:2411.17838 (2024)

(A full BibTeX entry will be added once the paper is published.)

---

## 🤝 Contact

For questions or queries please open an issue on tsml-eval.
1 change: 1 addition & 0 deletions tsml_eval/publications/clustering/kasba/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Files for Rock the KASBA."""
149 changes: 149 additions & 0 deletions tsml_eval/publications/clustering/kasba/_experiment_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
import sys

import numpy as np

from tsml_eval.experiments import (
run_clustering_experiment as tsml_clustering_experiment,
)
from tsml_eval.publications.clustering.kasba._model_configuration import (
EXPERIMENT_MODELS,
)
from tsml_eval.publications.clustering.kasba._utils import (
_parse_command_line_bool,
check_experiment_results_exist,
load_dataset_from_file,
)


def run_threaded_clustering_experiment(
dataset: str,
clusterer_name: str,
dataset_path: str,
results_path: str,
combine_test_train: bool,
resample_id: int,
):
"""Run clustering experiment.

Parameters
----------
dataset : str
Dataset name.
distance : str
Distance string (assumed correct and final), e.g.:
"msm", "dtw", "soft_msm", "soft_dtw",
"soft_divergence_msm", "soft_divergence_dtw".
clusterer_str : str
Free-form label used only for naming/logging (not logic).
dataset_path : str
Path to the dataset.
results_path : str
Path to the results.
averaging_method : str
One of: "soft", "kasba", "petitjean_ba", "subgradient_ba".
combine_test_train : bool, default=False
Boolean indicating if data should be combined for test and train.
resample_id : int, default=0
Integer indicating the resample id.
n_jobs : int default=-1
Integer indicating the number of jobs to run in parallel.
"""
if clusterer_name not in EXPERIMENT_MODELS:
raise ValueError(f"Unknown clusterer_name '{clusterer_name}'")

# Skip if results already exist
if check_experiment_results_exist(
model_name=clusterer_name,
dataset=dataset,
combine_test_train=combine_test_train,
path_to_results=results_path,
resample_id=resample_id,
):
return (
f"[SKIP] {clusterer_name} (resample {resample_id}): "
f"results already exist."
)

X_train, y_train, X_test, y_test = load_dataset_from_file(
dataset,
dataset_path,
normalize=True,
combine_test_train=combine_test_train,
resample_id=0,
)
n_clusters = np.unique(y_train).size

factory = EXPERIMENT_MODELS[clusterer_name]
clusterer = factory(
n_clusters=n_clusters,
random_state=resample_id,
n_jobs=1,
)

tsml_clustering_experiment(
X_train=X_train,
y_train=y_train,
clusterer=clusterer,
results_path=results_path,
X_test=X_test,
y_test=y_test,
n_clusters=n_clusters,
clusterer_name=clusterer_name,
dataset_name=dataset,
resample_id=resample_id,
data_transforms=None,
build_test_file=not combine_test_train,
build_train_file=True,
benchmark_time=True,
)
print(f"[DONE] {clusterer_name} (resample {resample_id})")


# Boolean to toggle if running locally or via command line.
RUN_LOCALLY = True

if __name__ == "__main__":
"""NOTE: To run with command line arguments, set RUN_LOCALLY to False."""
if RUN_LOCALLY:
print("RUNNING WITH TEST CONFIG")

dataset = "GunPoint"
clusterer_name = "KASBA"
combine_test_train = True

dataset_path = (
"/Users/chrisholder/Documents/Research/datasets/UCR/Univariate_ts"
)
results_path = "/Users/chrisholder/projects/kasba-experiments/full_results"
run_threaded_clustering_experiment(
dataset=dataset,
clusterer_name=clusterer_name,
dataset_path=dataset_path,
results_path=results_path,
combine_test_train=combine_test_train,
resample_id=0,
)

else:
if len(sys.argv) != 6:
print(
"Usage: python _clustering_experiment_all.py "
"<dataset> <clusterer_name> <dataset_path> <result_path> "
"<combine_test_train>"
)
sys.exit(1)

dataset = str(sys.argv[1])
clusterer_name = str(sys.argv[2])
dataset_path = str(sys.argv[3])
results_path = str(sys.argv[4])
combine_test_train = _parse_command_line_bool(sys.argv[5])

run_threaded_clustering_experiment(
dataset=dataset,
clusterer_name=clusterer_name,
dataset_path=dataset_path,
results_path=results_path,
combine_test_train=combine_test_train,
resample_id=1,
)
Loading
Loading