A Python implementation of a realisation-dependent Bayesian stopping algorithm for representative sampling from empirical distributions and datasets.
This repository provides:
- a reusable implementation of the stopping algorithm;
- helper functions for empirical sampling and empirical CDFs;
- Monte Carlo diagnostics for repeated stopping runs;
- two examples:
- sampling from an empirical Student-t distribution;
- selecting representative rows from a dataset.
The goal of the method is to select a smaller representative sample from a larger empirical distribution or dataset.
In each iteration, one sample is drawn randomly from the distribution, and the stopping algorithm determines when enough samples have been collected. The output is a subset that can be used for downstream analysis, modelling, or manual inspection while remaining representative of the original distribution.
The algorithm provides a way to control the stopping tolerance, which affects how many samples are selected. See The choice of epsilon.
The algorithm also provides a way to track and evaluate representativeness of the selected samples. See The choice of the statistic function g(x).
bayesian-representative-sampling/
│
├── src/
│ └── bayes_rep_sampling/
│ ├── __init__.py
│ ├── stopping.py
│ ├── empirical.py
│ ├── diagnostics.py
│ └── plotting.py
├── examples/
│ ├── 01_student_t_empirical_sampling.py
│ └── 02_dataset_representative_sampling.py
├── data/
├── docs/
│ ├── figures/
│ └── animations/
├── outputs/
├── README.md
└── pyproject.toml
git clone https://github.com/emilyykchan/bayesian-representative-sampling.git
cd bayesian-representative-sampling
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txtThis example samples from an empirical Student-t distribution and also runs 500 Monte Carlo repetitions to show the distribution of the final stopped sample size.
Run:
python examples/01_student_t_empirical_sampling.pyExample 1 uses an epsilon=0.1 and g(x)=x.
This yielded the final stopped sample size n=391, and final estimate \hat{g}_n=-0.0887, where the full empirical sample size is 5000, and the full-data mean is -0.0141.
The Student-t example generates two plots.
Output plot: Selected samples overlaid on the empirical Student-t distribution
This plot compares the full empirical distribution with the subset selected by the stopping algorithm.
Diagnostic plot: Distribution of stopped sample sizes over 500 Monte Carlo repetitions
This plot shows how many samples were selected before stopping across 500 repeated runs.
This example applies the same algorithm to a dataset and exports the selected representative rows.
Run:
python examples/02_dataset_representative_sampling.pyepsilon controls the stopping tolerance.
- smaller
epsilonusually leads to more samples before stopping; - larger
epsilonusually leads to fewer samples before stopping.
In the included examples, we use:
epsilon = 0.2This should always be reported when presenting results.
The function g(x) determines the quantity that is tracked during sampling and therefore helps define how the representative sample is evaluated. In the paper notation, this is g(x); in ths implementation, this is statistic_fn.
In practice, g(x) should be chosen according to the feature of the distribution that matters most for the application.
Examples:
-
g(x) = x
tracks the first moment and is useful when evaluating how well the selected sample preserves the mean. -
g(x) = x**2
gives more emphasis to larger magnitudes and may be useful when variability or tail behaviour matters. -
application-specific transformations may also be used if a particular summary of the sampled variable is important.
After sampling, the selected subset should be evaluated against the original data using appropriate summary statistics, such as:
- mean;
- median;
- standard deviation;
- relative error;
- standardised error.
In the Student-t example, we use:
statistic_fn = lambda x: xand evaluate how well the selected sample preserves the mean of the empirical distribution.
If you use this repository, please cite both this software implementation and the original method paper.
Chan, E. Y. K. Bayesian Representative Sampling. GitHub repository, 2026.
Quinn A, Kárný M. Learning for non-stationary Dirichlet processes.
International Journal of Adaptive Control and Signal Processing.
2007;21(10):827–855.
This project is released under the BSD 3-Clause License.


