Fix background crashes #3313

danieldaquino · 2025-11-04T00:11:08Z

Summary

We have mechanisms in place to close NostrDB streams when the database needs to close; however, there is a short time window where those streams are closing down but the database still has its "open" property set to true, which means that new NostrDB streams may open. If that happens, those new streams will still be active when NostrDB gets closed down, causing memory crashes.

This was found by inspecting several crash logs and noticing that:

most of the ndb.close calls are coming from the general backgrounding task (not the last resort backgrounding task), where all old tasks are guaranteed to have closed (we wait for all of them to close before proceeding to closing NostrDB).
the stack traces of the crashed threads show that, in most cases, the stream crashes while they are in the query stage (which means that those must have been very recently opened).

The issue was fixed by signalling that NostrDB has closed (without actually closing it) before cancelling any streaming tasks and officially closing NostrDB. This way, new NostrDB streaming tasks will notice that the database is closed and will wait for it to reopen.

No changelog entry is needed as this issue was introduced after our last public release.

Changelog-None
Closes: #3245

Checklist

Standard PR Checklist

I have read (or I am familiar with) the Contribution Guidelines
I have tested the changes in this PR
I have profiled the changes to ensure there are no performance regressions, or I do not need to profile the changes.
- Utilize Xcode profiler to measure performance impact of code changes. See https://developer.apple.com/videos/play/wwdc2025/306
- If not needed, provide reason: Minimal changes
I have opened or referred to an existing github issue related to this change.
My PR is either small, or I have split it into smaller logical commits that are easier to review
I have added the signoff line to all my commits. See Signing off your work
I have added appropriate changelog entries for the changes in this PR. See Adding changelog entries
- I do not need to add a changelog entry. Reason: Issue was introduced after our last public release.
I have added appropriate Closes: or Fixes: tags in the commit messages wherever applicable, or made sure those are not needed. See Submitting patches

Test report

TBD, I will run this for a few days on my phone to see if the background crash occurs.

Results:

PASS
Partial PASS
- Details: [Please provide details of the partial pass]

Other notes

[Please provide any other information that you think is relevant to this PR.]

danieldaquino · 2025-11-06T22:23:02Z

@jb55, I changed my mind about this approach, and converted this PR back to a draft. Please do not merge it yet.

I started trying to recreate the crash more consistently via an automated test so that we can prevent future regressions and to make sure this fix is rigorous and thorough.

Even though I am confident that I understand the root cause, recreating this in a controlled environment has proven to be quite difficult (The time window for the crash to occur is very small and requires lots of things to happen in a particular sequence).

Moreover, after spending some time trying to create automated tests to recreate this, I realized that perhaps we should fix this at the Ndb.swift or nostrdb.c level — considering that we want NostrDB to become a library. Otherwise other users of the NostrDB Swift library can end up running into this.

I am now starting to investigate and brainstorm how I can fix it at that level.

danieldaquino · 2025-11-18T02:20:50Z

@jb55, so far I was able to bring the repro rate down from about 4 crashes out of 7 test runs (4/7) down to 1 crash out of 16 test runs (1/16) with the fixes in this PR.

Note: Each "test run" runs about 10K iterations of simultaneous queries and NostrDB close/open calls.

Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?

alltheseas · 2025-11-18T03:14:15Z

Does this review pass your laugh test @danieldaquino ?

Given the goal—preventing new streams from sneaking in during the short
“closing” window—I’d still address the earlier findings before evaluating the overall design:

markClosed() currently flips is_closed to true, but close() immediately returns when is_closed is true.
As written, your proposed “signal closed before actually closing” flow just neuters the real close and
leaves the LMDB handle open. If you want two phases (soft-close vs. teardown), split the state (e.g.,
is_acceptingWork vs. is_destroyed) instead of reusing closed for both.
The force‑try around waitUntilNdbCanClose means any long-running reader (>2 s) still crashes the app, so
the new guard rail doesn’t actually prevent the crash you’re chasing.
The lock must be marked open for every initializer (including Ndb(ndb:) / Ndb.empty) or readers will time
out immediately, making it impossible to exercise the race you’re trying to mitigate.
The “10 k iterations” stress test currently runs two unsynchronized loops and never awaits the detached
task. It keeps mutating state beyond the test lifetime and captures non‑Sendable objects in @sendable
closures, so it’s not a reliable regression signal yet.

Once those are fixed, consider these design tweaks:

Model the lifecycle explicitly: e.g., enum State { acceptingWork, draining, destroyed } protected by
the mutex. Readers check state == .acceptingWork, backgrounding switches to .draining, and close() only
tears down once state reaches .destroyed.
Replace the try!/timeouts with a two-phase approach: a short wait to let active readers drain, followed
by an unprotected close if the app is about to be suspended. That keeps background timeouts from turning
into fatal crashes.
Push the “reopen/close coordination” into Ndb itself: expose an async API that handles the drain/close/
reopen sequence so NostrNetworkManager doesn’t have to juggle semaphore semantics.

jb55 · 2025-11-19T16:19:06Z

On Mon, Nov 17, 2025 at 06:21:12PM -0800, Daniel D’Aquino wrote: danieldaquino left a comment (damus-io/damus#3313) Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?

sorry I have been travelling, i'll try to take a look soon on my flight

danieldaquino · 2025-11-19T23:14:15Z

On Mon, Nov 17, 2025 at 06:21:12PM -0800, Daniel D’Aquino wrote: danieldaquino left a comment (damus-io/damus#3313) Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?
sorry I have been travelling, i'll try to take a look soon on my flight

@jb55, quick update, the remaining crash I am seeing during the test may be a different crash with a separate cause (despite the fact that it looks similar). I will try to reproduce the remaining crash with the address sanitizer turned ON to get more details.

In any case, whenever you have a chance please let me know your thoughts on the approach taken!

Those are unused and it causes awkward implementations when different error types need to be used. Signed-off-by: Daniel D’Aquino <[email protected]>

…un to close We have mechanisms in place to close NostrDB streams when the database needs to close; however, there is a short time window where those streams are closing down but the database still has its "open" property set to `true`, which means that new NostrDB streams may open. If that happens, those new streams will still be active when NostrDB gets closed down, potentially causing memory crashes. This was found by inspecting several crash logs and noticing that: - most of the `ndb.close` calls are coming from the general backgrounding task (not the last resort backgrounding task), where all old tasks are guaranteed to have closed (we wait for all of them to close before proceeding to closing NostrDB). - the stack traces of the crashed threads show that, in most cases, the stream crashes while they are in the query stage (which means that those must have been very recently opened). The issue was mitigated by signalling that NostrDB has closed (without actually closing it) before cancelling any streaming tasks and officially closing NostrDB. This way, new NostrDB streaming tasks will notice that the database is closed and will wait for it to reopen. No changelog entry is needed as this issue was introduced after our last public release. Changelog-None Signed-off-by: Daniel D’Aquino <[email protected]>

Signed-off-by: Daniel D’Aquino <[email protected]>

…rder This adds a sync mechanism in Ndb.swift to coordinate certain usage of nostrdb.c calls and the need to close nostrdb due to app lifecycle requirements. Furthermore, it fixes the order of operations when re-opening NostrDB, to avoid race conditions where a query uses an older Ndb generation. This sync mechanism allows multiple queries to happen simultaneously (from the Swift-side), while preventing ndb from simultaneously closing during such usages. It also does that while keeping the Ndb interface sync and nonisolated, which keeps the API easy to use from Swift/SwiftUI and allows for parallel operations to occur. If Swift Actors were to be used (e.g. creating an NdbActor), the Ndb.swift interface would change in such a way that it would propagate the need for several changes throughout the codebase, including loading logic in some ViewModels. Furthermore, it would likely decrease performance by forcing Ndb.swift operations to run sequentially when they could run in parallel. A changelog is not needed since the background crashes were not widely spread in the last public release. Changelog-None Closes: damus-io#3245 Signed-off-by: Daniel D’Aquino <[email protected]>

danieldaquino · 2025-11-22T03:47:54Z

By analyzing the crash with Address Sanitizer enabled, I was able to find the cause for the remaining crashes arising out roughly 1 out of 16 test runs (i.e. 1 out of 160K cycles of concurrent queries and ndb open/close operations).

Those were TOCTOU race conditions around those ndb users (query and NdbTxn.deinit). Rough sequence:

Thread A: Function checks if ndb is open and whether the current ndb generation number matches with its own. It does, so the function continues.
Thread B: Ndb.close() acquires semaphore lock to close NostrDB, and closes it.
Thread A: Tries to acquire semaphore lock to keep NostrDB open. Function waits to acquire lock.
Thread B: Quickly reopens NostrDB (a new generation), freeing the semaphore lock for Ndb users.
Thread A: Acquires semaphore lock to keep NostrDB open before timing out. Function proceeds with its operations.
Thread A: Function does not check that the NostrDB generation has changed, and uses ndb C objects from the previous generation that have already been deallocated, causing a use-after-free heap memory error and crashing.

I have addressed this issue by adding those checks inside the keepNdbOpen closures, where the semaphore lock can guarantee NostrDB will not restart under it.

I reran the test with the new fix for 20 iterations (i.e. about 200K cycles of concurrent queries and ndb open/close operations), and did not run into any crashes this time.
Damus: 17e2721

I am not sure I am 100% happy with my design yet though, perhaps it is not elegant enough, or maybe it's just the fact that I am only using it in a few high-risk functions and not globally. I think I need to sleep on it. @jb55, please let me know if you have any comments on the design of the syncing mechanism and its application.

danieldaquino · 2025-11-25T01:06:38Z

I included this on an experimental TestFlight build: 1.16 (1215) for some longer-term-use testing.

danieldaquino requested a review from jb55 November 4, 2025 00:11

danieldaquino changed the title ~~Prevent new NostrDB streaming tasks from opening when NostrDB has begun to close~~ Fix background crashes Nov 4, 2025

danieldaquino marked this pull request as draft November 6, 2025 22:08

danieldaquino force-pushed the gh-3245 branch 4 times, most recently from b238051 to 6f80a17 Compare November 18, 2025 01:17

danieldaquino added 4 commits November 21, 2025 15:01

Remove typed throws in some Ndb functions

57716f2

Those are unused and it causes awkward implementations when different error types need to be used. Signed-off-by: Daniel D’Aquino <[email protected]>

Add tests around a app lifecycle race condition

5109e5d

Signed-off-by: Daniel D’Aquino <[email protected]>

danieldaquino force-pushed the gh-3245 branch from 6f80a17 to 17e2721 Compare November 22, 2025 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix background crashes #3313

Fix background crashes #3313

Uh oh!

danieldaquino commented Nov 4, 2025

Uh oh!

danieldaquino commented Nov 6, 2025

Uh oh!

danieldaquino commented Nov 18, 2025

Uh oh!

alltheseas commented Nov 18, 2025

Uh oh!

jb55 commented Nov 19, 2025 via email

Uh oh!

danieldaquino commented Nov 19, 2025

Uh oh!

danieldaquino commented Nov 22, 2025

Uh oh!

danieldaquino commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix background crashes #3313

Are you sure you want to change the base?

Fix background crashes #3313

Uh oh!

Conversation

danieldaquino commented Nov 4, 2025

Summary

Checklist

Standard PR Checklist

Test report

Other notes

Uh oh!

danieldaquino commented Nov 6, 2025

Uh oh!

danieldaquino commented Nov 18, 2025

Uh oh!

alltheseas commented Nov 18, 2025

Uh oh!

jb55 commented Nov 19, 2025 via email

Uh oh!

danieldaquino commented Nov 19, 2025

Uh oh!

danieldaquino commented Nov 22, 2025

Uh oh!

danieldaquino commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants