Sorting experiments to compare DuckDB's sorting implementation with other systems.
Random integer data is generated with python3 randints.py in data/randints.
TPC-DS data is generated with python3 dsdgen.py in data/tpcds (requires the duckdb Python package).
Warning: generating SF300 takes a long time, and requires ~300GB of disk space!
Queries are generated with python3 generate_queries.py under queries/randints, queries/tpcds/catalog_sales and queries/tpcds/customer.
DuckDB was compared to 4 other systems.
The DuckDB CLI is required, as well as the Python package. Installation details for both can be found at https://duckdb.org.
ClickHouse needs to be compiled, see the README's under systems/clickhouse and systems/clickhouse/clickhouse_client.
ClickHouse's python client is also required:
python3 -m pip install clickhouse-driver`We manually extracted HyPer from the Tableau binary. We are not able to disclose how we got this to work.
Simple installation with pip:
python3 -m pip install pandas`SQLite comes pre-installed in most Python installs. SQLite's CLI is also required, which can be found in most package managers (if not pre-installed in the OS already).
To run the experiments, first set the appropriate values in pathvar.sh, which include the path to this directory, and the location of the duckdb binary executable.
The experiments are run with ./run.sh.
This script can be modified to run the specific scale factor or system of your liking.
Running experiments creates results.csv with query timings under results/<system>/..., and <query_name>.sql files under the same directory, to denote that the query has completed.
Plots were created in a Jupyter notebook. The relevant Python packages are installed with:
python3 -m pip install notebook matplotlib seabornRun jupyter-notebook (or jupyter notebook depending on the OS) in the plots folder, then select the plots.ipynb notebook and run all cells to create the plots.