-
Notifications
You must be signed in to change notification settings - Fork 295
Fix background crashes #3313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix background crashes #3313
Conversation
|
@jb55, I changed my mind about this approach, and converted this PR back to a draft. Please do not merge it yet. I started trying to recreate the crash more consistently via an automated test so that we can prevent future regressions and to make sure this fix is rigorous and thorough. Even though I am confident that I understand the root cause, recreating this in a controlled environment has proven to be quite difficult (The time window for the crash to occur is very small and requires lots of things to happen in a particular sequence). Moreover, after spending some time trying to create automated tests to recreate this, I realized that perhaps we should fix this at the I am now starting to investigate and brainstorm how I can fix it at that level. |
b238051 to
6f80a17
Compare
|
@jb55, so far I was able to bring the repro rate down from about 4 crashes out of 7 test runs (4/7) down to 1 crash out of 16 test runs (1/16) with the fixes in this PR. Note: Each "test run" runs about 10K iterations of simultaneous queries and NostrDB close/open calls. Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general? |
|
Does this review pass your laugh test @danieldaquino ? Given the goal—preventing new streams from sneaking in during the short
Once those are fixed, consider these design tweaks:
|
|
On Mon, Nov 17, 2025 at 06:21:12PM -0800, Daniel D’Aquino wrote:
danieldaquino left a comment (damus-io/damus#3313)
Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?
sorry I have been travelling, i'll try to take a look soon on my flight
|
@jb55, quick update, the remaining crash I am seeing during the test may be a different crash with a separate cause (despite the fact that it looks similar). I will try to reproduce the remaining crash with the address sanitizer turned ON to get more details. In any case, whenever you have a chance please let me know your thoughts on the approach taken! |
Those are unused and it causes awkward implementations when different error types need to be used. Signed-off-by: Daniel D’Aquino <[email protected]>
…un to close We have mechanisms in place to close NostrDB streams when the database needs to close; however, there is a short time window where those streams are closing down but the database still has its "open" property set to `true`, which means that new NostrDB streams may open. If that happens, those new streams will still be active when NostrDB gets closed down, potentially causing memory crashes. This was found by inspecting several crash logs and noticing that: - most of the `ndb.close` calls are coming from the general backgrounding task (not the last resort backgrounding task), where all old tasks are guaranteed to have closed (we wait for all of them to close before proceeding to closing NostrDB). - the stack traces of the crashed threads show that, in most cases, the stream crashes while they are in the query stage (which means that those must have been very recently opened). The issue was mitigated by signalling that NostrDB has closed (without actually closing it) before cancelling any streaming tasks and officially closing NostrDB. This way, new NostrDB streaming tasks will notice that the database is closed and will wait for it to reopen. No changelog entry is needed as this issue was introduced after our last public release. Changelog-None Signed-off-by: Daniel D’Aquino <[email protected]>
Signed-off-by: Daniel D’Aquino <[email protected]>
…rder This adds a sync mechanism in Ndb.swift to coordinate certain usage of nostrdb.c calls and the need to close nostrdb due to app lifecycle requirements. Furthermore, it fixes the order of operations when re-opening NostrDB, to avoid race conditions where a query uses an older Ndb generation. This sync mechanism allows multiple queries to happen simultaneously (from the Swift-side), while preventing ndb from simultaneously closing during such usages. It also does that while keeping the Ndb interface sync and nonisolated, which keeps the API easy to use from Swift/SwiftUI and allows for parallel operations to occur. If Swift Actors were to be used (e.g. creating an NdbActor), the Ndb.swift interface would change in such a way that it would propagate the need for several changes throughout the codebase, including loading logic in some ViewModels. Furthermore, it would likely decrease performance by forcing Ndb.swift operations to run sequentially when they could run in parallel. A changelog is not needed since the background crashes were not widely spread in the last public release. Changelog-None Closes: damus-io#3245 Signed-off-by: Daniel D’Aquino <[email protected]>
6f80a17 to
17e2721
Compare
|
By analyzing the crash with Address Sanitizer enabled, I was able to find the cause for the remaining crashes arising out roughly 1 out of 16 test runs (i.e. 1 out of 160K cycles of concurrent queries and ndb open/close operations). Those were TOCTOU race conditions around those ndb users (
I have addressed this issue by adding those checks inside the I reran the test with the new fix for 20 iterations (i.e. about 200K cycles of concurrent queries and ndb open/close operations), and did not run into any crashes this time. I am not sure I am 100% happy with my design yet though, perhaps it is not elegant enough, or maybe it's just the fact that I am only using it in a few high-risk functions and not globally. I think I need to sleep on it. @jb55, please let me know if you have any comments on the design of the syncing mechanism and its application. |
|
I included this on an experimental TestFlight build: 1.16 (1215) for some longer-term-use testing. |
Summary
We have mechanisms in place to close NostrDB streams when the database needs to close; however, there is a short time window where those streams are closing down but the database still has its "open" property set to
true, which means that new NostrDB streams may open. If that happens, those new streams will still be active when NostrDB gets closed down, causing memory crashes.This was found by inspecting several crash logs and noticing that:
ndb.closecalls are coming from the general backgrounding task (not the last resort backgrounding task), where all old tasks are guaranteed to have closed (we wait for all of them to close before proceeding to closing NostrDB).The issue was fixed by signalling that NostrDB has closed (without actually closing it) before cancelling any streaming tasks and officially closing NostrDB. This way, new NostrDB streaming tasks will notice that the database is closed and will wait for it to reopen.
No changelog entry is needed as this issue was introduced after our last public release.
Changelog-None
Closes: #3245
Checklist
Standard PR Checklist
Closes:orFixes:tags in the commit messages wherever applicable, or made sure those are not needed. See Submitting patchesTest report
TBD, I will run this for a few days on my phone to see if the background crash occurs.
Results:
Other notes
[Please provide any other information that you think is relevant to this PR.]