-
-
Notifications
You must be signed in to change notification settings - Fork 200
fix: split inproc handler thread #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* use `std::nothrow` `new` consistently to keep exception-free semantics for allocation * rename static crashpad_handler to have no module-public prefix * use `nullptr` for arguments where we previously used 0 to clarify that those are pointers * eliminate the `memset()` of the `crashpad_state_t` initialization since it now contains non-trivially constructable fields (`std::atomic`) and replace it with `new` and an empty value initializer.
…ld, since libraries like libunwind.a might be packaged without PIC.
…ms with architecture prefixes (32-bit Linux)
…stack also ensure to get the first frame harmonize libunwind usage
…eader in the libunwind walker for Linux and log as much as possible to understand where the actual crash happens
…in unmapped memory
…-trace from an arbitrary frame-pointer
…nd running the deferred code directly inside the signal handler. Nothing changes for them.
…phore on the return channel and let the OS block and wait. Also check the return value of startup_handler_thread in the initialization and propagate the failure.
…rancy guard * up to now, we've been serializing signal handling even though we didn't know whether it was a runtime signal or one we should be handling * this meant that we blocked all our critical sections during a managed exception * it also meant that we blocked any concurrent managed exceptions * it also meant that we introduced a race window during the time when we chained, because incoming signal on other threads would have gotten next in line, before we even completed the current signal handler by moving it completely outside our synchronization we truly chain at start and don't interfere until we know we must.
| } | ||
| if (sentry__atomic_fetch(&g_handler_should_exit)) { | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Handler thread exits without processing crash
The handler thread checks g_handler_should_exit immediately after waking from the semaphore, before checking g_handler_has_work. If shutdown is initiated after the signal handler signals the semaphore but before the handler thread processes the work flag, the crash event will be lost because the thread exits without processing it. The same issue exists on UNIX at lines 833-835. The check for g_handler_should_exit needs to happen after verifying and processing any pending work to ensure crashes are never dropped during shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to discussion about this. I am a big fan of letting the shutdown request overrule any others. In this case, it is unlikely they will happen at the same time, but it would have to be either-or. As such, it isn't really a bug, but rather a policy decision.
| # endif | ||
|
|
||
| # ifdef SENTRY_PLATFORM_UNIX | ||
| sentry__enter_signal_handler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Concurrent signal handlers corrupt shared handler state
On UNIX, dispatch_ucontext calls sentry__leave_signal_handler() at line 1064 before waiting for the handler thread to complete. This allows a second signal to arrive and call sentry__enter_signal_handler() successfully, then proceed to overwrite the global g_handler_state structure at lines 1042-1061 while the handler thread is still reading from it at line 853. The single global g_handler_state variable has no synchronization protecting concurrent access between multiple signal handlers and the handler thread, leading to potential data corruption when multiple crashes occur in quick succession across different threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, and was a recurring topic during the development of the changes in this PR. I first wanted to have feedback on how to proceed with the rest. The solution here will be a two-stage blocking mechanism, which I have successfully experimented with in previous commits. However, since the signal handler blocking must also support the other backend handlers, I wanted to have a first review.
|
@jpnurmi: I primarily added you regarding the chain-at-start handler strategy. The most significant change in that regard is that we no longer block anything when chaining at the start (see 6b6e545 for details). @vaind: I primarily added you here because I know you consume |
This is a more elaborate, long-term fix to getsentry/sentry-java#4830 than #1444.
It also finishes the work done here: #1088
And fixes the issues raised here: #1353
and here: #906
So, while the driver for this PR is a downstream issue that exposes the signal-unsafety of some parts of the current
inprocimplementation, it also addresses a much broader range of concerns that regularly affectinprocusers on all platforms.At a high level, it introduces a separate handler thread for
inproc, which the signal handler (orUEFon Windows) wakes after it exchanges crash context data.The idea is that we minimize signal handler/UEF to do the least amount of syscall stuff (or at least the subset documented in the signal-safety man-page), while the handler thread can execute functions outside that range (with limitations, since thread sync and heap allocations are still problematic). This allows us to reuse
stdiofunctionality like formatters without running squarely into UB territory or having to rewrite all utilities to async-signal-safe versions, as in #1444.There are a few considerable changes to mention:
backtrace()or any unwinder that runs from the "current" instruction address is entirely useless (ignoring the fact thatbacktrace()was always signal-unsafe to begin with, which itself was the source of crashes, hangs or just empty stack traces).inproc, which we already partially acknowledged in Using libunwind for mac, since backtrace do not expect thread context… #1088 and fix: support musl on Linux #1233.libunwind(thenognuimplementation, not thellvmone, which is a pure C++ exception unwinder), which is a breaking change (at least in the sense that users now require an additional dependency at build and runtime). This means that the "general" Linux usage is now the same as with themusl libcenvironments.arm64andx86-64, and use the system-providedlibunwindfor the default stack-trace from a call-site. It turned out that the system-providedlibunwindwasn't safe enough to use in the context of the signal handler (either led to hangs or had issues with escaping the trampoline). This means users can now useinprocon macOS again (if their code is compiled without omitting frame pointers, which is always the case by default on macOS).Further improvements/fixes (summarizing the 30 commits, which I didn't want to squash):
libunwind-based unwinder modules now also validate retrieved ucontext pointers against memory mapping (for Linux and macOS)__syncfunctions and replaced them with__atomic(especially the signal handler blocking logic and the spinlock)newwithstd::nothrowthroughout the affected backend code (including the initialization ofcrashpad_state_t, which still usedmallocandmemsetalthough it hasstd::atomicmembers)TODOs:
x86-64stackwalker formacOS(and clean up the code)libbacktracefallback at all and how to handle it.inprocinprocnot working onmacOSlibunwinddependency on Linux