Skip to content

Fix Python/C++ interop bugs exposed by PR #213 tests#315

Closed
Copilot wants to merge 13 commits intomainfrom
copilot/fix-complaint-from-pr-313
Closed

Fix Python/C++ interop bugs exposed by PR #213 tests#315
Copilot wants to merge 13 commits intomainfrom
copilot/fix-complaint-from-pr-313

Conversation

Copy link
Contributor

Copilot AI commented Feb 11, 2026

PR #213 introduced Python vector/list tests that exposed multiple latent bugs in the Python plugin's reference counting, error handling, and exception safety. PR #313 attempted a fix but was reverted after wlav identified that its XINCREF/XDECREF approach leaked references. This PR cherry-picks the genuine bug fixes from #313 and adds additional hardening.

Fixes applied

  • callv() / call() symmetry (modulewrap.cpp): callv() was not using lifeline_transform() to unwrap PhlexLifeline arguments, unlike call(). Both paths are now symmetric.

  • VECTOR_CONVERTER null sentinel (modulewrap.cpp): Error paths returned (intptr_t)nullptr, which PyObject_CallFunctionObjArgs interprets as the argument-list terminator, silently truncating arguments. Now throws std::runtime_error with Python error state extracted via msg_from_py_error().

  • ll_new null deref (lifelinewrap.cpp): On tp_alloc failure, fell through to dereference null. Now returns nullptr.

  • declared_transform exception safety (declared_transform.hpp): stores_.insert() creates a null cache entry; if the subsequent call() throws, the null persists and causes SEGFAULT on cache reuse. Wrapped in try/catch to erase stale entries.

  • lifeline_transform null guard (modulewrap.cpp): Added defensive throws when input intptr_t is 0 or PhlexLifeline's m_view is null, preventing silent argument truncation.

  • Borrowed reference protection (modulewrap.cpp): lifeline_transform() returns a borrowed m_view reference. Now stored in a local array with balanced INCREF/DECREF around the Python call to keep it alive during execution.

  • PhlexLifeline GC flag removal (lifelinewrap.cpp): Removed Py_TPFLAGS_HAVE_GC — PhlexLifeline doesn't participate in reference cycles, so GC tracking was unnecessary overhead. Removed associated ll_traverse, ll_clear, and PyObject_GC_UnTrack.

Status: incomplete

The py:vectypes SEGFAULT is not yet fixed. Extensive investigation (GDB, TSAN, diagnostic tracing, 100-iteration stress tests) confirmed the root cause is use-after-free of PyObjects stored as raw intptr_t in product_store. Removing all Py_DECREF calls eliminates the crash (0/100 failures vs ~50%), confirming the ownership model is fundamentally broken: intptr_t has no destructor, so freed PyObject addresses persist as dangling pointers in the declared_transform cache.

The fix requires either wrapping intptr_t in a RAII type with ref-counting semantics, or restructuring ownership so only one consumer DECREFs each PyObject.

Documentation

Added python-refcounting.md documenting the intended one-to-one ownership model and known limitations.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • cli.github.com
    • Triggering command: /usr/lib/apt/methods/https /usr/lib/apt/methods/https 6530858abe47310e4bca10731f385e45c10/bc8ae198dc87ab971a701b8ae79a29c999e2a4e5d7f2993f�� /type_deduction.--eh-frame-hdr ++ 1556�� -DSPDLOG_FMT_EXT-export-dynamic rapper as ED_LIB -DSPDLOG_lsb_release -isystem c10/log.json bug:/opt/spack-e/lib/x86_64-linux-gnu/crti.o -DBO�� -debug -DBOOST_JSON_DYN_LINK s/phlex-ci/.spack-env/view/lib OOST_JSON_NO_LIB/usr/bin/md5sum /include .dir/MC_truth_algorithm.cpp.o ck-environments/phlex-ci/.spack (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Address complaint from PR 313 Re-apply genuine bug fixes from reverted PR #313, without incorrect refcount changes Feb 11, 2026
Copilot AI requested a review from greenc-FNAL February 11, 2026 23:28
Copilot AI changed the title Re-apply genuine bug fixes from reverted PR #313, without incorrect refcount changes Re-apply real bug fixes from reverted PR #313, add corrected refcount docs Feb 11, 2026
@greenc-FNAL
Copy link
Contributor

@phlexbot format

@github-actions
Copy link
Contributor

No automatic markdownlint fixes were necessary.

@github-actions
Copy link
Contributor

No automatic cmake-format fixes were necessary.

@github-actions
Copy link
Contributor

No automatic clang-format fixes were necessary.

@github-actions
Copy link
Contributor

No automatic jsonnetfmt fixes were necessary.

@greenc-FNAL greenc-FNAL marked this pull request as ready for review February 11, 2026 23:37
Copilot AI review requested due to automatic review settings February 11, 2026 23:37
@greenc-FNAL greenc-FNAL requested review from knoepfel and wlav February 11, 2026 23:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Re-applies three targeted correctness fixes to the Phlex Python plugin after PR #313’s refcount changes were reverted, and adds a standalone reference-counting design note intended to prevent future leaks/crashes when declared_transform caching is involved.

Changes:

  • Fix py_callback::callv() to apply lifeline_transform() (matching call()), ensuring PhlexLifeline-wrapped numpy views are unwrapped before calling Python.
  • Make VECTOR_CONVERTER error paths throw instead of returning (intptr_t)nullptr, avoiding PyObject_CallFunctionObjArgs argument truncation via sentinel nulls.
  • Fix ll_new to return nullptr on tp_alloc failure (avoid null dereference), and add python-refcounting.md describing the intended ownership model.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
plugins/python/src/modulewrap.cpp Fixes callv() lifeline unwrapping; changes vector converters to throw on error instead of returning null.
plugins/python/src/lifelinewrap.cpp Prevents null deref in ll_new by returning nullptr on allocation failure.
plugins/python/src/python-refcounting.md Documents the intended reference-counting/ownership model and key pitfalls for the Python plugin.

@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 63.63636% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
plugins/python/src/modulewrap.cpp 53.12% 12 Missing and 3 partials ⚠️
plugins/python/src/lifelinewrap.cpp 0.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (63.63%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

@@            Coverage Diff             @@
##             main     #315      +/-   ##
==========================================
- Coverage   80.29%   80.14%   -0.16%     
==========================================
  Files         126      126              
  Lines        3065     3092      +27     
  Branches      545      551       +6     
==========================================
+ Hits         2461     2478      +17     
- Misses        383      387       +4     
- Partials      221      227       +6     
Flag Coverage Δ
unittests 80.14% <63.63%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
phlex/core/declared_transform.hpp 92.00% <100.00%> (+0.69%) ⬆️
plugins/python/src/lifelinewrap.cpp 77.77% <0.00%> (+51.11%) ⬆️
plugins/python/src/modulewrap.cpp 73.99% <53.12%> (-1.50%) ⬇️

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4db88e...51eff36. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI changed the title Re-apply real bug fixes from reverted PR #313, add corrected refcount docs Re-apply genuine bug fixes from reverted PR #313 with exception safety Feb 12, 2026
Copilot AI changed the title Re-apply genuine bug fixes from reverted PR #313 with exception safety Re-apply genuine bug fixes from reverted PR #313, with exception safety and correct refcount docs Feb 12, 2026
@greenc-FNAL
Copy link
Contributor

@phlexbot clang-fix

@github-actions
Copy link
Contributor

Automatic clang-format fixes pushed (commit 3bd2483).
⚠️ Note: Some issues may require manual review and fixing.

@greenc-FNAL
Copy link
Contributor

Review the full CodeQL report for details.

@github-actions
Copy link
Contributor

No automatic cmake-format fixes were necessary.

@github-actions
Copy link
Contributor

No automatic clang-format fixes were necessary.

Copilot AI changed the title Fix Python refcounting for intermittent py:veclists/vectypes test failures Fix intermittent SEGFAULT in Python vector-type tests Feb 12, 2026
@greenc-FNAL
Copy link
Contributor

@phlexbot format

@github-actions
Copy link
Contributor

No automatic markdownlint fixes were necessary.

@github-actions
Copy link
Contributor

No automatic jsonnetfmt fixes were necessary.

@github-actions
Copy link
Contributor

No automatic cmake-format fixes were necessary.

@github-actions
Copy link
Contributor

No automatic clang-format fixes were necessary.

Copilot AI changed the title Fix intermittent SEGFAULT in Python vector-type tests Fix Python/C++ interop bugs exposed by PR #213 tests Feb 12, 2026

`lifeline_transform()` unwraps `PhlexLifeline` objects to extract the
numpy array view (`m_view`). The returned pointer is a borrowed
reference from the lifeline object. Because this borrowed reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this is possible? There's a hard reference on the lifeline itself, which in turn holds a hard reference on the m_view. Only if there's a bug in the callee could it be invalidated.


### PhlexLifeline Must Not Use Py_TPFLAGS_HAVE_GC

`PhlexLifeline` does not participate in reference cycles — its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlikely yes, since the view is freshly created for the purposes of the lifeline, but the GC doesn't run unless it holds the GIL, so I don't see the issue here.

(net +0) and only protects borrowed views during the Python call.
Adding unbalanced INCREF/DECREF introduces leaks.

3. **Do not return `(intptr_t)nullptr` from converters.** It acts as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More importantly, the C++ exception is needed to enter the normal phlex error handling.

greenc-FNAL and others added 13 commits February 12, 2026 14:10
- Bypass caching issues
Apply only the genuine bug fixes from PR 313, without the reference
counting changes that wlav correctly identified as introducing leaks:

1. Make callv() use lifeline_transform() symmetrically with call()
2. Replace null returns with exceptions in VECTOR_CONVERTER error paths
   (null intptr_t acts as PyObject_CallFunctionObjArgs sentinel)
3. Fix ll_new to return nullptr on allocation failure instead of
   falling through to null dereference

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
The previous PR #313 document described an incorrect model with
XINCREF/XDECREF wrapping. This version documents the correct
one-to-one ownership model: each converter creates a reference,
each consumer DECREFs it exactly once.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
…reuse

When stores_.insert() creates an entry but the subsequent call() throws,
the null product_store_ptr persists in the cache. On cache reuse, this
null pointer is sent downstream, causing a SEGFAULT.

Fix: wrap the call and product store creation in try/catch, erasing the
stale entry on exception before re-throwing.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
When PyArray_SimpleNewFromData or PhlexLifeline_Type.tp_new fails,
use msg_from_py_error() to extract and clear the Python error state
before throwing std::runtime_error. This prevents stale PyErr_Occurred()
state from confusing later Python C API usage and includes the Python
diagnostic text in the thrown exception message.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Prevent null PyObject* from being silently passed as the argument-list
sentinel to PyObject_CallFunctionObjArgs, which causes Python functions
to receive fewer arguments than expected. Instead, throw with a
diagnostic message identifying whether the null came from the intptr_t
arg itself or from a PhlexLifeline with null m_view.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Separate lifeline_transform() into a pre-call phase that stores
transformed args in a local array and INCREFs them, and a post-call
phase that DECREFs them. This protects borrowed m_view references from
PhlexLifeline objects during the Python call.

The overall reference counting remains one-to-one: each input is
consumed by decref_all() after the call. The INCREF/DECREF pair is
balanced (net +0) and only serves to keep the borrowed m_view alive
during PyObject_CallFunctionObjArgs.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
PhlexLifeline does not participate in reference cycles: m_view points
to a numpy array (not GC-tracked) and nothing references the lifeline
from tracked Python objects.

With Py_TPFLAGS_HAVE_GC, PyType_GenericAlloc tracked every freshly
allocated PhlexLifeline in the garbage collector, making it eligible
for GC cycle detection.  During cycle detection the collector visits
every tracked object -- potentially while the object is still being
initialized in ll_new or while a TBB worker thread is concurrently
dereferencing m_view.  Removing the flag avoids this class of race
entirely and eliminates a spurious PyObject_GC_UnTrack call on
destruction.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
@greenc-FNAL greenc-FNAL force-pushed the copilot/fix-complaint-from-pr-313 branch from 5ce56f5 to 382884b Compare February 12, 2026 20:14
@greenc-FNAL greenc-FNAL deleted the copilot/fix-complaint-from-pr-313 branch February 17, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants