CuQuantum.jl

Julia bindings for the NVIDIA cuQuantum SDK, providing GPU-accelerated density matrix simulation for open quantum systems.

Motivation

Standard Lindblad solvers materialize the Liouvillian as a sparse $D^2 \times D^2$ matrix and compute $\mathcal{L}[\rho]$ via sparse matrix-vector multiply. For $M$ coupled cavities with Fock truncation $d$, the Hilbert space dimension is $D = d^M$. At $M=8$, $D=6{,}561$ and the Liouvillian has 43 million rows — requiring >77 GB just for the sparse matrix.

CuQuantum.jl wraps NVIDIA's cuDensityMat library, which decomposes $\mathcal{L}[\rho]$ into tensor network contractions over small per-mode operators. It never forms the full superoperator, enabling simulation of systems where sparse approaches are infeasible.

When to use cuDensityMat

Regime	Hilbert dim	Best approach
$M \le 4$	$\le 81$	CPU sparse (QuantumToolbox.jl)
$M = 5\text{--}6$	$\le 729$	GPU cuSPARSE SpMV (fastest per-action)
$M \ge 8$	$\ge 6{,}561$	cuDensityMat (only option)

cuDensityMat also supports native batching — many density matrices evolved with different parameters in a single kernel launch — and backward differentiation for parameter gradients, both essential for quantum optimal control.

Features

Lindblad master equation — time-dependent Hamiltonians with dissipation
Time-dependent callbacks — CPU scalar and tensor callbacks for driven systems
Backward differentiation — parameter gradients via VJP (single-GPU)
Native batching — parallel evolution of many states with different parameters
Expectation values — $\text{Tr}(O\rho)$ for arbitrary operators
MPI/NCCL distributed — multi-GPU forward-pass computation
Tensor network contraction — never materializes the full superoperator

Installation

using Pkg
Pkg.add(url="https://github.com/harmoniqs/CuQuantum.jl")

Requires a CUDA-capable GPU (Turing or newer) and Julia 1.10+.

Benchmarks

All benchmarks use $M$ coupled cavities with Fock truncation $d=3$, Kerr nonlinearity + all-to-all coupling + photon loss. See Methodology for the full problem specification.

Single $\mathcal{L}[\rho]$ Action — A100

Time for one evaluation of the Liouvillian action (median, after warmup).

M	D	CuQuantum.jl	cuSPARSE SpMV	Sparse feasible?
2	9	0.27 ms	0.039 ms	yes
4	81	1.22 ms	0.048 ms	yes
6	729	6.45 ms	0.90 ms	yes
8	6,561	620 ms	—	no (>77 GB)
9	19,683	6,742 ms	—	no

Full 100-step RK4 Simulation — A100

M	D	CuQuantum.jl	cuSPARSE RK4	QT.jl CPU
4	81	0.46 s	0.018 s	0.021 s
6	729	2.62 s	0.32 s	5.89 s
8	6,561	250 s	—	—

Batched Evolution — A100

Many density matrices evolved with different coupling strengths in one kernel launch. This is the parameter sweep use case for quantum optimal control.

M	D	Batch	cuDM batched	cuDM sequential	Speedup
2	9	64	0.45 ms	20.6 ms	46×
2	9	256	0.38 ms	70.1 ms	186×
4	81	64	2.79 ms	70.1 ms	25×
4	81	256	8.05 ms	280.7 ms	35×
6	729	64	297 ms	497 ms	1.7×
6	729	256	1,146 ms	1,990 ms	1.7×

At small system sizes ($M=2\text{--}4$), native batching amortizes kernel launch overhead by up to 186×. At $M=6$, the per-action cost dominates and batching gives ~1.7× speedup.

Cross-Framework Comparison — Tesla T4

M	D	CuQuantum.jl	Python CuPy	JAX ext	QT.jl CPU	QuTiP
2	9	0.22 s	1.36 s	0.38 s	0.0003 s	0.005 s
4	81	1.48 s	1.01 s	1.08 s	0.021 s	0.043 s
6	729	60.0 s	64.0 s	64.4 s	6.67 s	12.3 s

Julia wrapper is 7–8% faster than the Python (CuPy) and JAX wrappers over the same C library, due to lower dispatch overhead via direct ccall.

Wrapper Overhead

Wrapper	Single $\mathcal{L}[\rho]$ at $M=6$ (T4)
CuQuantum.jl (Julia)	149 ms
Python CuPy	159 ms (+7%)
JAX extension	161 ms (+8%)

Methodology

All benchmarks use the same physical system:

$$H = \sum_m \chi, n_m(n_m - 1) + \sum_{n \ne m} \kappa, a_n^\dagger a_m$$

$$\mathcal{L}[\rho] = -i[H, \rho] + \gamma \sum_m \left( a_m \rho a_m^\dagger - \tfrac{1}{2}{n_m, \rho} \right)$$

Fock truncation: $d = 3$ per cavity
Parameters: $\chi = 2\pi \times 0.2$, $\kappa = 2\pi \times 0.1$, $\gamma = 0.01$
Initial state: $|1,0,\ldots,0\rangle\langle 1,0,\ldots,0|$
Integration: 100-step RK4 with $dt = 0.01$ (cuDensityMat, cuSPARSE) or DP5 adaptive with $\text{atol}=10^{-8}$, $\text{rtol}=10^{-6}$ (QuantumToolbox.jl, QuTiP)
Timing: wall-clock, excludes JIT/compilation warmup
Validation: CPU and GPU produce bitwise-identical results for static Liouvillian at $t=0$

Hardware

	A100 benchmarks	T4 benchmarks
GPU	A100-SXM4-40GB (2 TB/s, 9.7 TFLOPS FP64)	Tesla T4 (300 GB/s, 0.25 TFLOPS FP64)
CPU	12 vCPUs, 85 GB RAM	8 vCPUs, 30 GB RAM
Instance	GCE a2-highgpu-1g	GCE n1-standard-8
CUDA	12.8, Driver 570.211.01	12.8, Driver 570.211.01

Software

Component	Version
Julia	1.12.5
CUDA.jl	5.8.x
cuQuantum SDK (JLL)	25.11.0
QuantumToolbox.jl	latest
QuTiP	5.2.3
JAX	0.6.2
cuquantum-python-jax	0.0.3 (built from source)

Reproduction

Benchmark scripts are in benchmark/comparison/ and benchmark/batched/. See the comparison README for detailed setup instructions including GCE instance creation, Julia/Python installation, and the JAX extension source build.

Documentation

Full documentation: harmoniqs.github.io/CuQuantum.jl

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
benchmark		benchmark
docs		docs
examples		examples
gen		gen
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CuQuantum.jl

Motivation

When to use cuDensityMat

Features

Installation

Benchmarks

Single $\mathcal{L}[\rho]$ Action — A100

Full 100-step RK4 Simulation — A100

Batched Evolution — A100

Cross-Framework Comparison — Tesla T4

Wrapper Overhead

Methodology

Hardware

Software

Reproduction

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CuQuantum.jl

Motivation

When to use cuDensityMat

Features

Installation

Benchmarks

Single $\mathcal{L}[\rho]$ Action — A100

Full 100-step RK4 Simulation — A100

Batched Evolution — A100

Cross-Framework Comparison — Tesla T4

Wrapper Overhead

Methodology

Hardware

Software

Reproduction

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages