Fix Python/C++ interop bugs exposed by PR #213 tests#315
Fix Python/C++ interop bugs exposed by PR #213 tests#315
Conversation
|
@phlexbot format |
|
No automatic markdownlint fixes were necessary. |
|
No automatic cmake-format fixes were necessary. |
|
No automatic clang-format fixes were necessary. |
|
No automatic jsonnetfmt fixes were necessary. |
There was a problem hiding this comment.
Pull request overview
Re-applies three targeted correctness fixes to the Phlex Python plugin after PR #313’s refcount changes were reverted, and adds a standalone reference-counting design note intended to prevent future leaks/crashes when declared_transform caching is involved.
Changes:
- Fix
py_callback::callv()to applylifeline_transform()(matchingcall()), ensuringPhlexLifeline-wrapped numpy views are unwrapped before calling Python. - Make
VECTOR_CONVERTERerror paths throw instead of returning(intptr_t)nullptr, avoidingPyObject_CallFunctionObjArgsargument truncation via sentinel nulls. - Fix
ll_newto returnnullptrontp_allocfailure (avoid null dereference), and addpython-refcounting.mddescribing the intended ownership model.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| plugins/python/src/modulewrap.cpp | Fixes callv() lifeline unwrapping; changes vector converters to throw on error instead of returning null. |
| plugins/python/src/lifelinewrap.cpp | Prevents null deref in ll_new by returning nullptr on allocation failure. |
| plugins/python/src/python-refcounting.md | Documents the intended reference-counting/ownership model and key pitfalls for the Python plugin. |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (63.63%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## main #315 +/- ##
==========================================
- Coverage 80.29% 80.14% -0.16%
==========================================
Files 126 126
Lines 3065 3092 +27
Branches 545 551 +6
==========================================
+ Hits 2461 2478 +17
- Misses 383 387 +4
- Partials 221 227 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
@phlexbot clang-fix |
|
Automatic clang-format fixes pushed (commit 3bd2483). |
|
Review the full CodeQL report for details. |
|
No automatic cmake-format fixes were necessary. |
|
No automatic clang-format fixes were necessary. |
|
@phlexbot format |
|
No automatic markdownlint fixes were necessary. |
|
No automatic jsonnetfmt fixes were necessary. |
|
No automatic cmake-format fixes were necessary. |
|
No automatic clang-format fixes were necessary. |
|
|
||
| `lifeline_transform()` unwraps `PhlexLifeline` objects to extract the | ||
| numpy array view (`m_view`). The returned pointer is a borrowed | ||
| reference from the lifeline object. Because this borrowed reference |
There was a problem hiding this comment.
I don't see how this is possible? There's a hard reference on the lifeline itself, which in turn holds a hard reference on the m_view. Only if there's a bug in the callee could it be invalidated.
|
|
||
| ### PhlexLifeline Must Not Use Py_TPFLAGS_HAVE_GC | ||
|
|
||
| `PhlexLifeline` does not participate in reference cycles — its |
There was a problem hiding this comment.
It's unlikely yes, since the view is freshly created for the purposes of the lifeline, but the GC doesn't run unless it holds the GIL, so I don't see the issue here.
| (net +0) and only protects borrowed views during the Python call. | ||
| Adding unbalanced INCREF/DECREF introduces leaks. | ||
|
|
||
| 3. **Do not return `(intptr_t)nullptr` from converters.** It acts as |
There was a problem hiding this comment.
More importantly, the C++ exception is needed to enter the normal phlex error handling.
- Bypass caching issues
Apply only the genuine bug fixes from PR 313, without the reference counting changes that wlav correctly identified as introducing leaks: 1. Make callv() use lifeline_transform() symmetrically with call() 2. Replace null returns with exceptions in VECTOR_CONVERTER error paths (null intptr_t acts as PyObject_CallFunctionObjArgs sentinel) 3. Fix ll_new to return nullptr on allocation failure instead of falling through to null dereference Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
The previous PR #313 document described an incorrect model with XINCREF/XDECREF wrapping. This version documents the correct one-to-one ownership model: each converter creates a reference, each consumer DECREFs it exactly once. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
…reuse When stores_.insert() creates an entry but the subsequent call() throws, the null product_store_ptr persists in the cache. On cache reuse, this null pointer is sent downstream, causing a SEGFAULT. Fix: wrap the call and product store creation in try/catch, erasing the stale entry on exception before re-throwing. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
When PyArray_SimpleNewFromData or PhlexLifeline_Type.tp_new fails, use msg_from_py_error() to extract and clear the Python error state before throwing std::runtime_error. This prevents stale PyErr_Occurred() state from confusing later Python C API usage and includes the Python diagnostic text in the thrown exception message. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Prevent null PyObject* from being silently passed as the argument-list sentinel to PyObject_CallFunctionObjArgs, which causes Python functions to receive fewer arguments than expected. Instead, throw with a diagnostic message identifying whether the null came from the intptr_t arg itself or from a PhlexLifeline with null m_view. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Separate lifeline_transform() into a pre-call phase that stores transformed args in a local array and INCREFs them, and a post-call phase that DECREFs them. This protects borrowed m_view references from PhlexLifeline objects during the Python call. The overall reference counting remains one-to-one: each input is consumed by decref_all() after the call. The INCREF/DECREF pair is balanced (net +0) and only serves to keep the borrowed m_view alive during PyObject_CallFunctionObjArgs. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
PhlexLifeline does not participate in reference cycles: m_view points to a numpy array (not GC-tracked) and nothing references the lifeline from tracked Python objects. With Py_TPFLAGS_HAVE_GC, PyType_GenericAlloc tracked every freshly allocated PhlexLifeline in the garbage collector, making it eligible for GC cycle detection. During cycle detection the collector visits every tracked object -- potentially while the object is still being initialized in ll_new or while a TBB worker thread is concurrently dereferencing m_view. Removing the flag avoids this class of race entirely and eliminates a spurious PyObject_GC_UnTrack call on destruction. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
5ce56f5 to
382884b
Compare
PR #213 introduced Python vector/list tests that exposed multiple latent bugs in the Python plugin's reference counting, error handling, and exception safety. PR #313 attempted a fix but was reverted after wlav identified that its XINCREF/XDECREF approach leaked references. This PR cherry-picks the genuine bug fixes from #313 and adds additional hardening.
Fixes applied
callv()/call()symmetry (modulewrap.cpp):callv()was not usinglifeline_transform()to unwrap PhlexLifeline arguments, unlikecall(). Both paths are now symmetric.VECTOR_CONVERTERnull sentinel (modulewrap.cpp): Error paths returned(intptr_t)nullptr, whichPyObject_CallFunctionObjArgsinterprets as the argument-list terminator, silently truncating arguments. Now throwsstd::runtime_errorwith Python error state extracted viamsg_from_py_error().ll_newnull deref (lifelinewrap.cpp): Ontp_allocfailure, fell through to dereference null. Now returnsnullptr.declared_transformexception safety (declared_transform.hpp):stores_.insert()creates a null cache entry; if the subsequentcall()throws, the null persists and causes SEGFAULT on cache reuse. Wrapped in try/catch to erase stale entries.lifeline_transformnull guard (modulewrap.cpp): Added defensive throws when inputintptr_tis 0 or PhlexLifeline'sm_viewis null, preventing silent argument truncation.Borrowed reference protection (
modulewrap.cpp):lifeline_transform()returns a borrowedm_viewreference. Now stored in a local array with balanced INCREF/DECREF around the Python call to keep it alive during execution.PhlexLifeline GC flag removal (
lifelinewrap.cpp): RemovedPy_TPFLAGS_HAVE_GC— PhlexLifeline doesn't participate in reference cycles, so GC tracking was unnecessary overhead. Removed associatedll_traverse,ll_clear, andPyObject_GC_UnTrack.Status: incomplete
The
py:vectypesSEGFAULT is not yet fixed. Extensive investigation (GDB, TSAN, diagnostic tracing, 100-iteration stress tests) confirmed the root cause is use-after-free of PyObjects stored as rawintptr_tinproduct_store. Removing allPy_DECREFcalls eliminates the crash (0/100 failures vs ~50%), confirming the ownership model is fundamentally broken:intptr_thas no destructor, so freed PyObject addresses persist as dangling pointers in thedeclared_transformcache.The fix requires either wrapping
intptr_tin a RAII type with ref-counting semantics, or restructuring ownership so only one consumer DECREFs each PyObject.Documentation
Added
python-refcounting.mddocumenting the intended one-to-one ownership model and known limitations.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
cli.github.com/usr/lib/apt/methods/https /usr/lib/apt/methods/https 6530858abe47310e4bca10731f385e45c10/bc8ae198dc87ab971a701b8ae79a29c999e2a4e5d7f2993f�� /type_deduction.--eh-frame-hdr ++ 1556�� -DSPDLOG_FMT_EXT-export-dynamic rapper as ED_LIB -DSPDLOG_lsb_release -isystem c10/log.json bug:/opt/spack-e/lib/x86_64-linux-gnu/crti.o -DBO�� -debug -DBOOST_JSON_DYN_LINK s/phlex-ci/.spack-env/view/lib OOST_JSON_NO_LIB/usr/bin/md5sum /include .dir/MC_truth_algorithm.cpp.o ck-environments/phlex-ci/.spack(dns block)If you need me to access, download, or install something from one of these locations, you can either:
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.