Representative Sampling with Realisation-Dependent Stopping

A Python implementation of a realisation-dependent Bayesian stopping algorithm for representative sampling from empirical distributions and datasets.

This repository provides:

a reusable implementation of the stopping algorithm;
helper functions for empirical sampling and empirical CDFs;
Monte Carlo diagnostics for repeated stopping runs;
two examples:
1. sampling from an empirical Student-t distribution;
2. selecting representative rows from a dataset.

What this repository does

The goal of the method is to select a smaller representative sample from a larger empirical distribution or dataset.

In each iteration, one sample is drawn randomly from the distribution, and the stopping algorithm determines when enough samples have been collected. The output is a subset that can be used for downstream analysis, modelling, or manual inspection while remaining representative of the original distribution.

The algorithm provides a way to control the stopping tolerance, which affects how many samples are selected. See The choice of epsilon.

The algorithm also provides a way to track and evaluate representativeness of the selected samples. See The choice of the statistic function g(x).

Repository structure

bayesian-representative-sampling/
│
├── src/
│   └── bayes_rep_sampling/
│       ├── __init__.py
│       ├── stopping.py
│       ├── empirical.py
│       ├── diagnostics.py
│       └── plotting.py
├── examples/
│   ├── 01_student_t_empirical_sampling.py
│   └── 02_dataset_representative_sampling.py
├── data/
├── docs/
│   ├── figures/
│   └── animations/
├── outputs/
├── README.md
└── pyproject.toml

Installation

git clone https://github.com/emilyykchan/bayesian-representative-sampling.git
cd bayesian-representative-sampling

python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt

Examples

Example 1: empirical Student-t sampling

This example samples from an empirical Student-t distribution and also runs 500 Monte Carlo repetitions to show the distribution of the final stopped sample size.

Run:

python examples/01_student_t_empirical_sampling.py

Example 1 output

Example 1 uses an epsilon=0.1 and g(x)=x. This yielded the final stopped sample size n=391, and final estimate \hat{g}_n=-0.0887, where the full empirical sample size is 5000, and the full-data mean is -0.0141.

The Student-t example generates two plots.

Output plot: Selected samples overlaid on the empirical Student-t distribution

This plot compares the full empirical distribution with the subset selected by the stopping algorithm.

Diagnostic plot: Distribution of stopped sample sizes over 500 Monte Carlo repetitions

This plot shows how many samples were selected before stopping across 500 repeated runs.

Example 2: representative rows from a dataset

This example applies the same algorithm to a dataset and exports the selected representative rows.

Run:

python examples/02_dataset_representative_sampling.py

The choice of `epsilon`

epsilon controls the stopping tolerance.

smaller epsilon usually leads to more samples before stopping;
larger epsilon usually leads to fewer samples before stopping.

In the included examples, we use:

epsilon = 0.2

This should always be reported when presenting results.

The choice of the statistic function the statistic function `g(x)`

The function g(x) determines the quantity that is tracked during sampling and therefore helps define how the representative sample is evaluated. In the paper notation, this is g(x); in ths implementation, this is statistic_fn.

In practice, g(x) should be chosen according to the feature of the distribution that matters most for the application.

Examples:

g(x) = x
tracks the first moment and is useful when evaluating how well the selected sample preserves the mean.
g(x) = x**2
gives more emphasis to larger magnitudes and may be useful when variability or tail behaviour matters.
application-specific transformations may also be used if a particular summary of the sampled variable is important.

After sampling, the selected subset should be evaluated against the original data using appropriate summary statistics, such as:

mean;
median;
standard deviation;
relative error;
standardised error.

In the Student-t example, we use:

statistic_fn = lambda x: x

and evaluate how well the selected sample preserves the mean of the empirical distribution.

Citation

If you use this repository, please cite both this software implementation and the original method paper.

Chan, E. Y. K. Bayesian Representative Sampling. GitHub repository, 2026.

Quinn A, Kárný M. Learning for non-stationary Dirichlet processes.
International Journal of Adaptive Control and Signal Processing.
2007;21(10):827–855.

License

This project is released under the BSD 3-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
docs		docs
examples		examples
outputs		outputs
scripts		scripts
src/bayes_rep_sampling		src/bayes_rep_sampling
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Representative Sampling with Realisation-Dependent Stopping

What this repository does

Repository structure

Installation

Examples

Example 1: empirical Student-t sampling

Example 1 output

Example 2: representative rows from a dataset

The choice of `epsilon`

The choice of the statistic function the statistic function `g(x)`

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Representative Sampling with Realisation-Dependent Stopping

What this repository does

Repository structure

Installation

Examples

Example 1: empirical Student-t sampling

Example 1 output

Example 2: representative rows from a dataset

The choice of epsilon

The choice of the statistic function the statistic function g(x)

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The choice of `epsilon`

The choice of the statistic function the statistic function `g(x)`

Packages