This repository contains the source code for our SIGMOD 2025 demonstration of Fainder, a fast and accurate index for distribution-aware dataset search. The demo consists of two components:
- Frontend: Web-based user interface for interacting with the search engine.
- Backend: Responsible for query parsing, optimization, and execution (including percentile predicates).
The repository is structured as follows:
fainder-demo/
├── backend # main component for query parsing, optimization, and execution
├── docs # documentation about system design and implementation
├── scripts # scripts for installing and starting components
└── ui # user interfaceOur system uses environment variables to configure its components. You can export these variables
in your shell or create a .env file in the directory from where you start the components.
The following variables are available (no default means it must be set):
# General Backend
DATA_DIR= # Directory containing dataset collections
COLLECTION_NAME= # Name of the dataset collection (subdirectory in DATA_DIR)
CROISSANT_DIR=croissant # Subdirectory containing the Croissant files of a collection
EMBEDDING_DIR=embeddings # Subdirectory containing a HNSW index with column names
FAINDER_DIR=fainder # Subdirectory containing Fainder indices for a collection
TANTIVY_DIR=tantivy # Subdirectory containing a keyword index for a collection
METADATA_FILE=metadata.json # JSON file with metadata about a collection
DATASET_SLUG=kaggleRef # Document field with a unique dataset identifier
CROISSANT_STORE_TYPE=dict # Croissant store implementation (dict or file)
CROISSANT_CACHE_SIZE=128 # Size of the Croissant store cache (only relevant for file store)
# Engine
QUERY_CACHE_SIZE=128 # Maximum number of query results to cache
MIN_USABILITY_SCORE=0.0 # Minimum usability threshold for query results
RANK_BY_USABILITY=True # Boolean to enable/disable usability
EXECUTOR_TYPE=simple # Query executor implementation (simple, prefiltering, threaded, or threaded_prefiltering)
MAX_WORKERS=os.cpu_count() # Number of threads for parallel execution
# Fainder
FAINDER_N_CLUSTERS=50 # Number of index clusters
FAINDER_BIN_BUDGET=1000 # Bin/storage budget
FAINDER_ALPHA=1.0 # Float value for additive smoothing
FAINDER_TRANSFORM=None # None, standard, robust, quantile, or power
FAINDER_CLUSTER_ALGORITHM=kmeans # kmeans, hdbscan, or agglomerative
FAINDER_DEFAULT=default # Name of the default fainder configuration
FAINDER_CHUNK_LAYOUT=round_robin # Chunk layout for Fainder indices (round_robin, sequential)
FAINDER_NUM_WORKERS=os.cpu_count() - 1 # Number of threads for exact Fainder index execution
FAINDER_NUM_CHUNKS=os.cpu_count() - 1 # Number of chunks for Fainder indices
# Similarity Search / Embeddings
USE_EMBEDDINGS=True # Boolean to enable/disable embeddings
EMBEDDING_MODEL=all-MiniLM-L6-v2 # Name of the embedding model on Hugging Face
EMBEDDING_BATCH_SIZE=32 # Batch size for embedding generation (during indexing)
HNSW_EF_CONSTRUCTION=400 # Construction parameter for HNSW
HNSW_N_BIDIRECTIONAL_LINKS=64 # Number of bidirectional links for HNSW
HNSW_EF=50 # Search parameter for HNSW
# Frontend
NUXT_API_BASE=http://localhost:8000 # Backend API base URL
# Misc
LOG_LEVEL=INFO # Logging level (TRACE, DEBUG, INFO, WARNING, ERROR)You only need to bring a collection of Croissant files enriched with statistical information to use our prototype yourself. See our documentation on the Croissant schema extensions that we define as part of the demo. To reproduce the dataset collection used in our paper, see the dataset-scrapers repository.
All index data structures are generated automatically by the backend. For that, you
must place your Croissant files into a folder and set the DATA_DIR and COLLECTION_NAME
accordingly (see above, we recommend ./data/<collection_name/croissant if you want to use the
Docker setup).
The backend automatically generates the necessary index files for Fainder, HNSW, and Tantivy if
the respective folders do not exist. In order to recreate the indices, delete the folders and
restart the application or call the /update_indices endpoint.
You need a recent version of Docker, including Docker Compose 2.22 or later, to run the demo.
Build and start the demo:
docker compose up --buildTo stop the containers, press Ctrl+C or run:
docker compose down- Python 3.11-3.13
- Node.js 18 or greater
- A Python package manager (e.g.,
piporuv) - A Node.js package manager (e.g.,
npm)
We recommend using uv to manage the development environment of the
backend component. You just have to run:
scripts/install.shNote
The pre-commit configuration expects that you installed the Python dependencies in a virtual
environment at backend/.venv. If you use a different location, you have to adjust the
configuration accordingly.
eslint and vue-tsc are currently not integrated into the pre-commit hooks.
Therefore, you should run npm run lint and npm run typecheck before committing UI changes.
If you want to use Docker for development, you can use the following command to start the components in development mode:
COMPOSE_BAKE=true FASTAPI_MODE=dev NUXT_MODE=dev docker compose up --build --watch