Skip to content

feat: add Qureg checkpointing via ADIOS2 (#747)#780

Open
ashmitjsg wants to merge 4 commits into
QuEST-Kit:develfrom
ashmitjsg:feat/qureg-checkpointing-747
Open

feat: add Qureg checkpointing via ADIOS2 (#747)#780
ashmitjsg wants to merge 4 commits into
QuEST-Kit:develfrom
ashmitjsg:feat/qureg-checkpointing-747

Conversation

@ashmitjsg
Copy link
Copy Markdown

Summary

Implements Qureg checkpointing (issue #747): two new API functions for writing a Qureg to disk and restoring it later.

void  saveQuregToFile(Qureg qureg, const char* fn);
Qureg createQuregFromFile(const char* fn);

This is useful for long-running HPC jobs vulnerable to timeout or failure - an evolving Qureg can be periodically written to disk and resumed in a later process.

Design

Following the approach suggested in the issue, checkpointing is built upon ADIOS2 and gated behind a new CMake option ENABLE_CHECKPOINTING (OFF by default), so the ADIOS2 dependency is only required when the feature is requested.

  • What is saved: only the Qureg dimension (numQubits, isDensityMatrix) and the full amplitude set. Incidental deployment fields (multithreading, GPU-acceleration, distribution) are not saved, nor are derivable fields (numAmps, etc). A Qureg saved by one deployment can therefore be restored by any other - createQuregFromFile() creates the Qureg with automatically chosen deployments and a precision marker (sizeof(qreal)) guards against restoring into a mismatched-precision build.
  • Amplitudes: stored as an ADIOS2 global array of interleaved (real, imag) reals (reinterpret_cast of the contiguous qcomp buffer), keeping the format agnostic to precision and to ADIOS2's complex-type support.
  • Memory / deployments: each node writes/reads only its local slice (start = 2·rank·numAmpsPerNode, count = 2·numAmpsPerNode) of the global array, so the implementation streams without excessive memory and is distributed-ready. GPU-resident state is synced to host before writing (syncQuregFromGpu) and back after reading (syncQuregToGpu).
  • Not compiled: calling either function in a build without ENABLE_CHECKPOINTING raises a clear validation error (rather than failing to link), via validate_quregCheckpointingIsCompiled.

The new API functions live in the existing C-and-C++-agnostic partition of qureg.h (they pass no qcomp by value, so remain C-ABI-safe).

Scope

This first pass targets and is verified for CPU, single-node (the ADIOS2 build used here has MPI off). The code is written deployment-agnostically against the global-array abstraction, so enabling distribution (rebuild ADIOS2 + QuEST with MPI) and GPU should work unchanged; I'm happy to extend/verify those if preferred.

Testing

  • New tests/unit/checkpoint.cpp (guarded by ENABLE_CHECKPOINTING): statevector and density-matrix round-trips assert the restored Qureg matches dimension and amplitudes. ./tests/tests "[checkpoint]"all assertions pass (CPU and CPU+OMP).
  • Standalone round-trip check also confirms bit-exact restoration (maxAmpDiff = 0) for both statevector and density matrix.
  • Build verified: cmake .. -D ENABLE_CHECKPOINTING=ON -D CMAKE_PREFIX_PATH=$HOME/.local → clean compile + link against adios2::cxx.

Build

cmake .. -D ENABLE_CHECKPOINTING=ON -D CMAKE_PREFIX_PATH=/path/to/adios2/prefix
cmake --build . --parallel

Documented in docs/compile.md (new "Checkpointing" section).

Notes / open questions

  • File format is ADIOS2 BP5 (a .bp directory). Happy to adjust naming/engine conventions.
  • I kept the saved metadata minimal (dimension + precision). If you'd like additional provenance (e.g. a format version field for forward-compatibility), I can add it.

AI-Assisted Contribution Disclosure

I used an AI assistant (Claude) to help explore the QuEST architecture, discuss the design, and review the code and tests. I traced the Qureg struct, the qureg.cpp / validation.cpp patterns, and the amplitude-access and sync routines myself, made the implementation decisions (interleaved-reals storage to dodge ADIOS2's lack of a long-double-complex type; the global-array slice scheme for memory-efficient, distribution-ready I/O; gating via a validation error rather than a link error; where the API and validation belong), and verified all behaviour locally with the tests above plus bit-exact round-trip and build checks. I can explain and stand behind every line.

Adds saveQuregToFile() and createQuregFromFile() to write a Qureg to disk
and restore it later, behind the optional CMake flag ENABLE_CHECKPOINTING
(which requires ADIOS2). The file records only the Qureg dimension (numQubits,
isDensityMatrix) and its amplitudes - never the incidental deployment fields,
nor derivable fields like numAmps - so a Qureg may be restored under a
different deployment than it was saved with.

Amplitudes are written as an ADIOS2 global array of interleaved (real, imag)
reals, with each node contributing only its local slice, so the implementation
streams without excessive memory and is distributed- and GPU-ready: GPU state
is synced to host before writing and back after reading, and the global-array
selection lets any node count read back its own portion.

Also adds a validation error when the API is called in a build without
checkpointing, reports isCheckpointingCompiled in the environment info
(alongside isOmpCompiled, isGpuCompiled, etc), a guarded Catch2 test
(tests/unit/checkpoint.cpp) exercising statevector and density-matrix
round-trips, and documents the build flag in docs/compile.md.
@ashmitjsg ashmitjsg changed the base branch from main to devel June 4, 2026 20:07
Comment thread quest/src/core/validation.cpp Outdated
Comment on lines +2007 to +2011
#ifdef ENABLE_CHECKPOINTING
bool isCompiled = true;
#else
bool isCompiled = false;
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this file to see the ENABLE_CHECKPOINTING (and other) preprocessors, you must include

#include "quest/include/config.h"

Presently, the undefined macro will default to 0, setting isCompiled=false, and making this validation always trigger. This makes me suspect you did not compile and run the tests yourself before submitting this PR. Please see test instructions here

@TysonRayJones
Copy link
Copy Markdown
Member

For ease of testing, I've made QuEST's build download adios2 when not locally installed

@TysonRayJones
Copy link
Copy Markdown
Member

TysonRayJones commented Jun 5, 2026

Looks like the CI is successfully downloading adios2, but then some jobs fail to compile! Strangely, the logs (like this one) don't show an error - compilation just stops. Meanwhile, other jobs on the same platforms (like this one) compile fine!

Quite irksome, but anyways suggests you should have a go at running your new unit tests yourself, as guided here. The new CMake provision to download adios2 should make that easier. You can compile the tests and run them locally, even with MPI and few processors, via:

mpirun --oversubscribe -np 8 ./tests/tests "[checkpoint]"

I'll dig into your implementation once we're assured it compiles and works!

…INTING)

  QuEST defines all compile-time feature macros centrally in config.h
  (generated from config.h.in). The checkpointing flag was instead passed
  as a raw target_compile_definitions, so validation.cpp (which doesn't
  include config.h) saw it undefined and always reported 'not compiled'
  under the project's normal build path.

  Add #cmakedefine01 QUEST_COMPILE_CHECKPOINTING to config.h.in, set it
  from the ENABLE_CHECKPOINTING option, link ADIOS2 to the QuEST target,
  and switch the sources/tests to #include config.h + #if
  QUEST_COMPILE_CHECKPOINTING. Remove the per-target compile-definition
  hacks.

  Verified: ON build -> config.h has =1 and tests/tests '[checkpoint]'
  passes (CPU, CPU+OMP); default OFF build has =0 and compiles without
  ADIOS2.
@ashmitjsg
Copy link
Copy Markdown
Author

@TysonRayJones You're right, thanks for catching this. I did build and run the tests locally before opening - but via cmake -DENABLE_CHECKPOINTING=ON, which set the macro as a raw compile definition on the QuEST target. Since validation.cpp compiles into that single target, the -D reached it in my build, so the #ifdef saw it and the [checkpoint] round-trip tests passed. The problem is that's the wrong mechanism: it only works for that specific cmake invocation and bypasses QuEST's convention, where compile macros are the single source of truth in config.h (and validation.cpp doesn't include config.h, so by the proper path the macro is undefined and defaults to 0). My mistake - I should have wired it through the config system from the start.

Fixed: added #cmakedefine01 QUEST_COMPILE_CHECKPOINTING to config.h.in, set it from the CMake option, dropped the direct compile-definition, and switched the sources to #include "quest/include/config.h" + #if QUEST_COMPILE_CHECKPOINTING. I also linked ADIOS2 to the QuEST target (completing the FetchContent block).

Verified locally:

  • cmake -DENABLE_CHECKPOINTING=ON -DQUEST_BUILD_TESTS=ON -> generated config.h has #define QUEST_COMPILE_CHECKPOINTING 1; ./tests/tests "[checkpoint]" passes (6 assertions, 1 test case) across the CPU and CPU+OMP deployments.
  • default build (OFF) -> config.h has QUEST_COMPILE_CHECKPOINTING 0 and compiles cleanly without ADIOS2.

This was a single-node (non-MPI) build, so I ran the tests serially rather than under mpirun. Sorry for the churn.

Still to do - I'll take these on next:

  • CI compile stalls. I believe the "compilation just stops with no error" on some jobs is the runner being OOM-killed during the ADIOS2 FetchContent build (its templates are memory-heavy when compiled in parallel). I'll look into reducing that pressure - e.g. building ADIOS2 with fewer parallel jobs and/or trimming more of its optional components in the FetchContent block - and confirm against the jobs that currently fail.
  • Distributed (MPI) test. My local build is single-node (no MPI here), so I've only verified the serial CPU / CPU+OMP path. I'll set up MPI locally and run mpirun --oversubscribe -np 8 ./tests/tests "[checkpoint]" to exercise the per-rank slice logic (numAmpsPerNode / rank offsets) end-to-end.
  • GPU path. The GPU sync (syncQuregFromGpu / syncQuregToGpu) is in place but I haven't been able to test on a GPU; I'll verify it (or call it out explicitly as untested) so the scope is clear.

I'll report back here with results on each. Thanks for your patience while I get this to a properly-tested state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants