Skip to content

Conversation

@danieldaquino
Copy link
Collaborator

Summary

We have mechanisms in place to close NostrDB streams when the database needs to close; however, there is a short time window where those streams are closing down but the database still has its "open" property set to true, which means that new NostrDB streams may open. If that happens, those new streams will still be active when NostrDB gets closed down, causing memory crashes.

This was found by inspecting several crash logs and noticing that:

  • most of the ndb.close calls are coming from the general backgrounding task (not the last resort backgrounding task), where all old tasks are guaranteed to have closed (we wait for all of them to close before proceeding to closing NostrDB).
  • the stack traces of the crashed threads show that, in most cases, the stream crashes while they are in the query stage (which means that those must have been very recently opened).

The issue was fixed by signalling that NostrDB has closed (without actually closing it) before cancelling any streaming tasks and officially closing NostrDB. This way, new NostrDB streaming tasks will notice that the database is closed and will wait for it to reopen.

No changelog entry is needed as this issue was introduced after our last public release.

Changelog-None
Closes: #3245

Checklist

Standard PR Checklist

  • I have read (or I am familiar with) the Contribution Guidelines
  • I have tested the changes in this PR
  • I have profiled the changes to ensure there are no performance regressions, or I do not need to profile the changes.
  • I have opened or referred to an existing github issue related to this change.
  • My PR is either small, or I have split it into smaller logical commits that are easier to review
  • I have added the signoff line to all my commits. See Signing off your work
  • I have added appropriate changelog entries for the changes in this PR. See Adding changelog entries
    • I do not need to add a changelog entry. Reason: Issue was introduced after our last public release.
  • I have added appropriate Closes: or Fixes: tags in the commit messages wherever applicable, or made sure those are not needed. See Submitting patches

Test report

TBD, I will run this for a few days on my phone to see if the background crash occurs.

Results:

  • PASS
  • Partial PASS
    • Details: [Please provide details of the partial pass]

Other notes

[Please provide any other information that you think is relevant to this PR.]

@danieldaquino danieldaquino requested a review from jb55 November 4, 2025 00:11
@danieldaquino danieldaquino changed the title Prevent new NostrDB streaming tasks from opening when NostrDB has begun to close Fix background crashes Nov 4, 2025
@danieldaquino danieldaquino marked this pull request as draft November 6, 2025 22:08
@danieldaquino
Copy link
Collaborator Author

@jb55, I changed my mind about this approach, and converted this PR back to a draft. Please do not merge it yet.

I started trying to recreate the crash more consistently via an automated test so that we can prevent future regressions and to make sure this fix is rigorous and thorough.

Even though I am confident that I understand the root cause, recreating this in a controlled environment has proven to be quite difficult (The time window for the crash to occur is very small and requires lots of things to happen in a particular sequence).

Moreover, after spending some time trying to create automated tests to recreate this, I realized that perhaps we should fix this at the Ndb.swift or nostrdb.c level — considering that we want NostrDB to become a library. Otherwise other users of the NostrDB Swift library can end up running into this.

I am now starting to investigate and brainstorm how I can fix it at that level.

@danieldaquino danieldaquino force-pushed the gh-3245 branch 4 times, most recently from b238051 to 6f80a17 Compare November 18, 2025 01:17
@danieldaquino
Copy link
Collaborator Author

@jb55, so far I was able to bring the repro rate down from about 4 crashes out of 7 test runs (4/7) down to 1 crash out of 16 test runs (1/16) with the fixes in this PR.

Note: Each "test run" runs about 10K iterations of simultaneous queries and NostrDB close/open calls.

Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?

@alltheseas
Copy link
Collaborator

Does this review pass your laugh test @danieldaquino ?

Given the goal—preventing new streams from sneaking in during the short
“closing” window—I’d still address the earlier findings before evaluating the overall design:

  • markClosed() currently flips is_closed to true, but close() immediately returns when is_closed is true.
    As written, your proposed “signal closed before actually closing” flow just neuters the real close and
    leaves the LMDB handle open. If you want two phases (soft-close vs. teardown), split the state (e.g.,
    is_acceptingWork vs. is_destroyed) instead of reusing closed for both.
  • The force‑try around waitUntilNdbCanClose means any long-running reader (>2 s) still crashes the app, so
    the new guard rail doesn’t actually prevent the crash you’re chasing.
  • The lock must be marked open for every initializer (including Ndb(ndb:) / Ndb.empty) or readers will time
    out immediately, making it impossible to exercise the race you’re trying to mitigate.
  • The “10 k iterations” stress test currently runs two unsynchronized loops and never awaits the detached
    task. It keeps mutating state beyond the test lifetime and captures non‑Sendable objects in @sendable
    closures, so it’s not a reliable regression signal yet.

Once those are fixed, consider these design tweaks:

  1. Model the lifecycle explicitly: e.g., enum State { acceptingWork, draining, destroyed } protected by
    the mutex. Readers check state == .acceptingWork, backgrounding switches to .draining, and close() only
    tears down once state reaches .destroyed.
  2. Replace the try!/timeouts with a two-phase approach: a short wait to let active readers drain, followed
    by an unprotected close if the app is about to be suspended. That keeps background timeouts from turning
    into fatal crashes.
  3. Push the “reopen/close coordination” into Ndb itself: expose an async API that handles the drain/close/
    reopen sequence so NostrNetworkManager doesn’t have to juggle semaphore semantics.

@jb55
Copy link
Collaborator

jb55 commented Nov 19, 2025 via email

@danieldaquino
Copy link
Collaborator Author

On Mon, Nov 17, 2025 at 06:21:12PM -0800, Daniel D’Aquino wrote: danieldaquino left a comment (damus-io/damus#3313) Unfortunately I still see occasional edge cases where it still crashes with a memory error. Before I move on with the next steps, what do you think of the general approach and concurrency design decisions taken here? Do you have any insights or ideas on ways we can design these code paths to be more thread-safe in general?
sorry I have been travelling, i'll try to take a look soon on my flight

@jb55, quick update, the remaining crash I am seeing during the test may be a different crash with a separate cause (despite the fact that it looks similar). I will try to reproduce the remaining crash with the address sanitizer turned ON to get more details.

In any case, whenever you have a chance please let me know your thoughts on the approach taken!

Those are unused and it causes awkward implementations when different
error types need to be used.

Signed-off-by: Daniel D’Aquino <[email protected]>
…un to close

We have mechanisms in place to close NostrDB streams when the database needs to
close; however, there is a short time window where those streams are
closing down but the database still has its "open" property set to `true`,
which means that new NostrDB streams may open. If that happens, those
new streams will still be active when NostrDB gets closed down,
potentially causing memory crashes.

This was found by inspecting several crash logs and noticing that:
- most of the `ndb.close` calls are coming from the general
  backgrounding task (not the last resort backgrounding task),
  where all old tasks are guaranteed to have closed (we wait for all of
  them to close before proceeding to closing NostrDB).
- the stack traces of the crashed threads show that, in most cases, the
  stream crashes while they are in the query stage (which means that
  those must have been very recently opened).

The issue was mitigated by signalling that NostrDB has closed (without
actually closing it) before cancelling any streaming tasks and officially
closing NostrDB. This way, new NostrDB streaming tasks will notice that
the database is closed and will wait for it to reopen.

No changelog entry is needed as this issue was introduced after our last public
release.

Changelog-None
Signed-off-by: Daniel D’Aquino <[email protected]>
…rder

This adds a sync mechanism in Ndb.swift to coordinate certain usage of
nostrdb.c calls and the need to close nostrdb due to app lifecycle
requirements. Furthermore, it fixes the order of operations when
re-opening NostrDB, to avoid race conditions where a query uses an older
Ndb generation.

This sync mechanism allows multiple queries to happen simultaneously
(from the Swift-side), while preventing ndb from simultaneously closing
during such usages. It also does that while keeping the Ndb interface
sync and nonisolated, which keeps the API easy to use from
Swift/SwiftUI and allows for parallel operations to occur.

If Swift Actors were to be used (e.g. creating an NdbActor), the Ndb.swift
interface would change in such a way that it would propagate the need for
several changes throughout the codebase, including loading logic in
some ViewModels. Furthermore, it would likely decrease performance by
forcing Ndb.swift operations to run sequentially when they could run in
parallel.

A changelog is not needed since the background crashes were not widely
spread in the last public release.

Changelog-None
Closes: damus-io#3245
Signed-off-by: Daniel D’Aquino <[email protected]>
@danieldaquino
Copy link
Collaborator Author

By analyzing the crash with Address Sanitizer enabled, I was able to find the cause for the remaining crashes arising out roughly 1 out of 16 test runs (i.e. 1 out of 160K cycles of concurrent queries and ndb open/close operations).

Those were TOCTOU race conditions around those ndb users (query and NdbTxn.deinit). Rough sequence:

  1. Thread A: Function checks if ndb is open and whether the current ndb generation number matches with its own. It does, so the function continues.
  2. Thread B: Ndb.close() acquires semaphore lock to close NostrDB, and closes it.
  3. Thread A: Tries to acquire semaphore lock to keep NostrDB open. Function waits to acquire lock.
  4. Thread B: Quickly reopens NostrDB (a new generation), freeing the semaphore lock for Ndb users.
  5. Thread A: Acquires semaphore lock to keep NostrDB open before timing out. Function proceeds with its operations.
  6. Thread A: Function does not check that the NostrDB generation has changed, and uses ndb C objects from the previous generation that have already been deallocated, causing a use-after-free heap memory error and crashing.

I have addressed this issue by adding those checks inside the keepNdbOpen closures, where the semaphore lock can guarantee NostrDB will not restart under it.

I reran the test with the new fix for 20 iterations (i.e. about 200K cycles of concurrent queries and ndb open/close operations), and did not run into any crashes this time.
Damus: 17e2721

I am not sure I am 100% happy with my design yet though, perhaps it is not elegant enough, or maybe it's just the fact that I am only using it in a few high-risk functions and not globally. I think I need to sleep on it. @jb55, please let me know if you have any comments on the design of the syncing mechanism and its application.

@danieldaquino
Copy link
Collaborator Author

I included this on an experimental TestFlight build: 1.16 (1215) for some longer-term-use testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Background crashes

3 participants