Skip to content

emilyykchan/bayesian-representative-sampling

Repository files navigation

Representative Sampling with Realisation-Dependent Stopping

A Python implementation of a realisation-dependent Bayesian stopping algorithm for representative sampling from empirical distributions and datasets.

This repository provides:

  • a reusable implementation of the stopping algorithm;
  • helper functions for empirical sampling and empirical CDFs;
  • Monte Carlo diagnostics for repeated stopping runs;
  • two examples:
    1. sampling from an empirical Student-t distribution;
    2. selecting representative rows from a dataset.

Student-t sampling animation

What this repository does

The goal of the method is to select a smaller representative sample from a larger empirical distribution or dataset.

In each iteration, one sample is drawn randomly from the distribution, and the stopping algorithm determines when enough samples have been collected. The output is a subset that can be used for downstream analysis, modelling, or manual inspection while remaining representative of the original distribution.

The algorithm provides a way to control the stopping tolerance, which affects how many samples are selected. See The choice of epsilon.

The algorithm also provides a way to track and evaluate representativeness of the selected samples. See The choice of the statistic function g(x).

Repository structure

bayesian-representative-sampling/
│
├── src/
│   └── bayes_rep_sampling/
│       ├── __init__.py
│       ├── stopping.py
│       ├── empirical.py
│       ├── diagnostics.py
│       └── plotting.py
├── examples/
│   ├── 01_student_t_empirical_sampling.py
│   └── 02_dataset_representative_sampling.py
├── data/
├── docs/
│   ├── figures/
│   └── animations/
├── outputs/
├── README.md
└── pyproject.toml

Installation

git clone https://github.com/emilyykchan/bayesian-representative-sampling.git
cd bayesian-representative-sampling

python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt

Examples

Example 1: empirical Student-t sampling

This example samples from an empirical Student-t distribution and also runs 500 Monte Carlo repetitions to show the distribution of the final stopped sample size.

Run:

python examples/01_student_t_empirical_sampling.py

Example 1 output

Example 1 uses an epsilon=0.1 and g(x)=x. This yielded the final stopped sample size n=391, and final estimate \hat{g}_n=-0.0887, where the full empirical sample size is 5000, and the full-data mean is -0.0141.

The Student-t example generates two plots.

Output plot: Selected samples overlaid on the empirical Student-t distribution

This plot compares the full empirical distribution with the subset selected by the stopping algorithm.

Selected sample overlay

Diagnostic plot: Distribution of stopped sample sizes over 500 Monte Carlo repetitions

This plot shows how many samples were selected before stopping across 500 repeated runs.

Stopping distribution

Example 2: representative rows from a dataset

This example applies the same algorithm to a dataset and exports the selected representative rows.

Run:

python examples/02_dataset_representative_sampling.py

The choice of epsilon

epsilon controls the stopping tolerance.

  • smaller epsilon usually leads to more samples before stopping;
  • larger epsilon usually leads to fewer samples before stopping.

In the included examples, we use:

epsilon = 0.2

This should always be reported when presenting results.

The choice of the statistic function the statistic function g(x)

The function g(x) determines the quantity that is tracked during sampling and therefore helps define how the representative sample is evaluated. In the paper notation, this is g(x); in ths implementation, this is statistic_fn.

In practice, g(x) should be chosen according to the feature of the distribution that matters most for the application.

Examples:

  • g(x) = x
    tracks the first moment and is useful when evaluating how well the selected sample preserves the mean.

  • g(x) = x**2
    gives more emphasis to larger magnitudes and may be useful when variability or tail behaviour matters.

  • application-specific transformations may also be used if a particular summary of the sampled variable is important.

After sampling, the selected subset should be evaluated against the original data using appropriate summary statistics, such as:

  • mean;
  • median;
  • standard deviation;
  • relative error;
  • standardised error.

In the Student-t example, we use:

statistic_fn = lambda x: x

and evaluate how well the selected sample preserves the mean of the empirical distribution.

Citation

If you use this repository, please cite both this software implementation and the original method paper.

Chan, E. Y. K. Bayesian Representative Sampling. GitHub repository, 2026.
Quinn A, Kárný M. Learning for non-stationary Dirichlet processes.
International Journal of Adaptive Control and Signal Processing.
2007;21(10):827–855.

License

This project is released under the BSD 3-Clause License.

About

A Python implementation of a realisation-dependent Bayesian stopping rule for representative sampling from empirical datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages