Skip to content

Conversation

@falkoschindler
Copy link
Contributor

@falkoschindler falkoschindler commented Nov 8, 2025

Motivation

This PR finally solves the KeyError described in #5119, which has been silenced in #5090 without understanding the root cause. As it turned out, Client.prune_instances falsely treated reconnecting clients as stale and deleted them. After the client gave up waiting for a reconnect, it got deleted as second time.

Implementation

This PR

  1. checks if there are client._delete_tasks, and
  2. removes the condition if self.id in Client.instances from Lots of KeyError after Redis connection close fix #5090.

This should fix the root cause and stops hiding KeyErrors in case clients are still deleted multiple times.

We could also make delete() idempotent so that it can be called repeatedly. But as this bug shows, it's better to crash if an assumption doesn't hold than to hide it with too many precautions.

Progress

  • I chose a meaningful title that completes the sentence: "If applied, this PR will..."
  • The implementation is complete.
  • Pytests have been added.
  • test_user_storage_is_pruned is failing. Why?
  • Documentation is not necessary.

@falkoschindler falkoschindler added this to the 3.3 milestone Nov 8, 2025
@falkoschindler falkoschindler requested a review from evnchn November 8, 2025 22:24
@falkoschindler falkoschindler added bug Type/scope: A problem that needs fixing review Status: PR is open and needs review 🟠 major Priority: Important, but not urgent labels Nov 8, 2025
@falkoschindler falkoschindler linked an issue Nov 8, 2025 that may be closed by this pull request
3 tasks
@evnchn
Copy link
Collaborator

evnchn commented Nov 9, 2025

I'm thinking of a test for this.

Meanwhile I'm thinking also, how can we speed up the debugging should it come next time with improvements in the codebase and tooling?

  • Client.prune_instances is "meant to find clients that only requested the HTTP response and never created a websocket connection" is non-obvious from first sight and the docstring "Prune stale clients" would imply that it is responsible for pruning all stale clients.
    • We should update that. I'm proposing "Prune stale clients which never connected via websocket. Those which did, the _delete_tasks ensures that they self-destruct"
    • Possibly we can refactor the if-condition at line 405-408 to if never_socket(client) and is_stale(client): by defining two helper functions?
  • For VSCode, is there any way to mark @property such as has_socket_connection in a different color?

@evnchn
Copy link
Collaborator

evnchn commented Nov 9, 2025

14a6954 mirrors case 1 in #5119 (comment).

Can't manage to do case 2 though, but case 1 is more representative of the original issue (KeyError in the logs) anyways.

Copy link
Collaborator

@evnchn evnchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should pass (I already see test_client_lifecycle.py passing, if you're unsure give it 20 minutes from now). Let's merge this fast.

Bonus items in #5422 (comment) should be non-blocking can be addressed anytime afterwards.

@evnchn
Copy link
Collaborator

evnchn commented Nov 9, 2025

Question: Why the dual-mode?

  • Client to be destroyed by Client.prune_instances before socket connection.
  • Client self-destroy after socket connection.

Can we populate a delete task in _delete_task on time-0 so that client always self-destroy? Then we can get rid of Client.prune_instances entirely.

@falkoschindler
Copy link
Contributor Author

@evnchn Interesting idea to get rid of the pruning loop! I'm just a bit concerned about the amount of code change to get it working. Before the first handshake we have no document_id, so we don't have an obvious key for populating the dictionaries. But in principle it should be possible.

By the way, I wonder if we actually need document IDs now that we don't have shared clients anymore (ignoring PR #5421). So maybe there is even more room for simplification.

Anyway, let's leave this PR as it is for now, allowing to merge this critical bugfix. We can try refactoring the client lifecycle in a separate PR.

@evnchn
Copy link
Collaborator

evnchn commented Nov 9, 2025

document_id is public API so axing it will come at 4.0 aka later (TM), or never, if #5421 catches on.

Other than that, code simplifications are always welcome.

In fact I wonder whether getting rid of the 60-second client scanning can get rid of some periodic lag spikes, which I have sometimes noticed with NiceGUI apps 🤔

@falkoschindler
Copy link
Contributor Author

document_id is public API

As far as I can tell it's only exchanged between client and server, but not stored as a public member in client. And all dictionaries, that use it as a key, are private.

@falkoschindler falkoschindler merged commit 9555de9 into main Nov 9, 2025
9 checks passed
@falkoschindler falkoschindler deleted the client-deletion branch November 9, 2025 11:15
@evnchn
Copy link
Collaborator

evnchn commented Nov 9, 2025

But we have https://github.com/zauberzeug/nicegui/blob/main/nicegui%2Fstatic%2Fnicegui.js#L327

Not underscore or anything, straight up in window

@falkoschindler
Copy link
Contributor Author

Ah ok, I didn't consider the JavaScript world as public. At least NiceGUI is trying to hide it from the user as much as possible. But at the same time we encourage to run_javascript etc., so it is debatable for sure.

@evnchn
Copy link
Collaborator

evnchn commented Nov 9, 2025

Hmm why is _cancel_delete_task associated with each document_id?

#5421 is going to be very unreliable as a result, since:

  1. document A connect
  2. document A disconnects
  3. document B connect, but does not cancel the self-destruct task stored in self._delete_tasks caused by (2)
  4. document B client deleted in the middle of active client usage

What gives? Possibly due to auto-index client (which never dies) we have flawed-logic from the getgo? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Type/scope: A problem that needs fixing 🟠 major Priority: Important, but not urgent review Status: PR is open and needs review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Still seeing KeyError: 'XXX.....' after 2.24.0 bug fix

3 participants