crash on add_to_keep_alive_queue and add_to_active_queue

# Crash: `ink_release_assert` in `UnixNetVConnection::add_to_keep_alive_queue()` / `add_to_active_queue()` — NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)

## Summary

On a forward-proxy deployment doing TLS bump (MITM) + caching of decrypted HTTPS objects, `traffic_server` aborts intermittently with an `ink_release_assert` fired from `UnixNetVConnection::add_to_keep_alive_queue()` or `UnixNetVConnection::add_to_active_queue()`:

```
BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.
BUG: It must have acquired the NetHandler's lock before doing anything on active_queue.
```

Root cause: the per-thread NetHandler queues (`active_queue` / `keep_alive_queue`) are mutated from a thread that is **not** the owner thread of the underlying `NetVConnection` (`vc->thread`). The `MUTEX_TRY_LOCK(nh->mutex, ...)` in those functions fails because the owner thread currently holds `nh->mutex`, and the code aborts instead of deferring/redirecting the operation to the owner thread.

This reproduces for **both HTTP/1.1 and HTTP/2**, and is **not** correlated with high load — it happens under normal, low traffic (< 1000 rps, ~7–8k concurrent connections, 32 ET_NET threads on a 32-core box; CPU only ~200–300%).

## Version

- Apache Traffic Server **10.1.2** (build Jun 24 2026), Linux x86_64, OpenSSL 1.1.1k, jemalloc.

## Configuration (relevant bits)

```yaml
records:
  exec_thread:
    autoconfig: { enabled: 1, scale: 1.0 }
    limit: 32                       # 32 ET_NET threads on a 32-core box
  http:
    server_ports: >
      3080:ip-in=0.0.0.0:proto=http;http2
      3443:ip-in=127.0.0.1:ssl:proto=http;http2
      3180:ip-in=0.0.0.0:tr-in:proto=http;http2:pp
      3444:ip-in=0.0.0.0:ssl:proto=http;http2:tr-in:pp
```

Forward proxy: client `CONNECT` on 3080 → local TLS listener 3443 → `certifier` plugin generates per-SNI certs (MITM) → decrypted HTTP objects are cached. HTTP/2 enabled on all inbound ports; PROXY protocol enabled on 3180/3444.

## Crash stacks (3 occurrences, same root cause)

### (A) HTTP/2, teardown path — `release_stream` → keep_alive_queue

```
_ink_assert
UnixNetVConnection::add_to_keep_alive_queue()
Http2ConnectionState::release_stream()
Http2Stream::~Http2Stream()
Http2Stream::terminate_if_possible()
Http2Stream::transaction_done()
HttpSM::kill_this()
HttpSM::main_handler()
HttpTunnel::main_handler()
CacheVC::openWriteMain(int, Event*)          <-- driven by cache-write completion
EThread::process_event()
EThread::execute_regular()
```

### (B) HTTP/1.1, teardown path — `transaction_done` → keep_alive_queue

```
_ink_assert
UnixNetVConnection::add_to_keep_alive_queue()
Http1ClientTransaction::transaction_done()
HttpSM::kill_this()
HttpSM::main_handler()
HttpTunnel::main_handler()
CacheVC::openWriteMain(int, Event*)          <-- driven by cache-write completion
EThread::process_event()
EThread::process_queue()
EThread::execute_regular()
```

### (C) HTTP/2, setup path — `create_stream` → active_queue

```
_ink_assert
UnixNetVConnection::add_to_active_queue()
Http2ConnectionState::create_stream()
Http2ConnectionState::rcv_headers_frame()
Http2ConnectionState::rcv_frame()
Http2CommonSession::do_complete_frame_read()
Http2CommonSession::do_process_frame_read()
Http2CommonSession::state_start_frame_read()
Http2ClientSession::main_event_handler()
EThread::process_event()
EThread::execute_regular()                   <-- dispatched from the event queue, not the net poll
```

The failing frame in all three:

```cpp
// UnixNetVConnection.cc
void UnixNetVConnection::add_to_keep_alive_queue() {
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) { nh->add_to_keep_alive_queue(this); }
  else { ink_release_assert(!"BUG: ... keep_alive_queue."); }   // <-- abort
}
bool UnixNetVConnection::add_to_active_queue() {
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) { return nh->add_to_active_queue(this); }
  else { ink_release_assert(!"BUG: ... active_queue."); }       // <-- abort
}
```

## Root cause analysis

1. Each ET_NET thread owns one `NetHandler`; at `initialize_thread_for_net()`, `nh->thread = thread` (bound once, never changes) and `nh->mutex = new_ProxyMutex()`. The `active_queue` / `keep_alive_queue` and their size counters are **per-thread private data**, meant to be mutated only on `nh->thread` (== the netvc owner thread `O`, i.e. `vc->thread`).

2. `add_to_*_queue()` uses `MUTEX_TRY_LOCK(nh->mutex)` + `ink_release_assert` as a **thread-affinity contract check**, not as optimistic lock retry: on the owner thread the lock is already held (reentrant → success); a foreign thread `T` fails the try-lock because `O` currently holds it. The abort therefore means "this ran on the wrong thread", not "transient lock contention". Retrying to acquire the lock would be wrong — mutating another thread's per-thread queue/counters is unsafe even if the lock were obtained.

3. **HTTP/1.1** partially guards this via `ProxyTransaction::adjust_thread()` (which reschedules onto the fixed `vc->thread`), called from `HttpSM::state_cache_open_write()`. But the cache-write **completion** teardown path (`CacheVC::openWriteMain` → `HttpTunnel` → `HttpSM::kill_this` → `Http1ClientTransaction::transaction_done` → `Http1ClientSession::release` → `add_to_keep_alive_queue`) does **not** pass through `state_cache_open_write` again, so it can run on a non-owner thread → stack (B).

4. **HTTP/2** aligns stream processing to `Http2Stream::_thread` (the stream's creating thread) via `_switch_thread_if_not_on_right_thread()`, which is **independent** from the underlying session netvc's owner thread. When teardown (`release_stream`) or setup (`create_stream`) touches the **session netvc's** queue while running on the stream thread `T ≠ O`, it aborts → stacks (A) and (C).

5. **Why intermittent:** most frame processing is delivered on the owner thread via `NetHandler::process_ready_list` (holding `nh->mutex`), so `T == O`. But when a frame is not fully available, `Http2CommonSession::state_complete_frame_read()` reschedules via:
   ```cpp
   this->_reenable_event =
       this->get_mutex()->thread_holding->schedule_in(this->get_proxy_session(),
                                                       HRTIME_MSECONDS(1),
                                                       HTTP2_SESSION_EVENT_REENABLE, vio);
   ```
   i.e. onto **whatever thread currently holds the session mutex**, not the netvc owner thread. Resuming there later runs `create_stream` off-thread → stack (C). Note stack (C) is dispatched from `EThread::execute_regular()` via the event queue (not the net poll path), consistent with a scheduled reenable event.

## Why this is rarely reported

The `ink_release_assert` was introduced in 2019 (#4379 / #4787) to turn silent cross-thread races into explicit aborts and has been unchanged since. Over 8.x/9.x the common paths were aligned one-by-one (`adjust_thread`, #5120, #6843, …), while a general fix (#5950) was abandoned. The remaining unguarded paths require an uncommon combination — forward proxy + TLS bump + caching of decrypted HTTPS + PROXY protocol + HTTP/2 + fragmented frames triggering the reenable reschedule — which typical reverse-proxy/CDN deployments don't hit.

## Related issues / PRs

- #4379 / #4787 — introduced the asserts on keep_alive/active queue thread-safety.
- #5943 — ATS 9 NetHandler lock abort.
- #5950 — "Fix assert due to unheld nh->mutex" (abandoned).
- #4504 — Crash on `Http2ConnectionState::release_stream()`.
- #5120 / #6843 — reschedule work back onto the original/owner thread (the correct pattern).

## Suggested fix

**(1) Correct / upstream-style:** ensure these queue operations always run on `vc->thread`. Before calling `add_to_*_queue()` on the session netvc, if `this_ethread() != vc->thread`, reschedule onto the **fixed** `vc->thread` (like `adjust_thread()`, using the owner thread rather than the transient mutex holder). Lifetime of the netvc on the teardown path needs care.

**(2) Minimal / defensive (what we run in production):** guard at the lowest level so an off-thread call degrades gracefully instead of aborting:

```cpp
void UnixNetVConnection::add_to_keep_alive_queue() {
  if (this->thread != this_ethread()) return;         // not owner thread: skip; InactivityCop still reclaims it
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) nh->add_to_keep_alive_queue(this);
  else ink_release_assert(!"BUG: ... keep_alive_queue.");
}
bool UnixNetVConnection::add_to_active_queue() {
  if (this->thread != this_ethread()) return true;    // not owner thread: skip accounting but ALLOW the stream
  bool result = false;
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) result = nh->add_to_active_queue(this);
  else ink_release_assert(!"BUG: ... active_queue.");
  return result;
}
```

Key point: `add_to_active_queue()` **must return `true`** on the off-thread path — its return value gates `create_stream()`, and returning `false` would wrongly refuse valid streams as "maxed out active connections". Connection lifetime is unaffected: reclamation is done by `InactivityCop` over `open_list`, independent of these queues. The only cost is that a rare off-thread connection is not accounted in the per-thread active/keep-alive LRU bookkeeping and may bypass the per-thread active-connection limit (negligible at 8k/40k).

## Workaround status in production

Running with the defensive guard (option 2) on `add_to_keep_alive_queue()` and `add_to_active_queue()` only. Crashes stopped; no connection leaks observed. The two `remove_from_*_queue()` variants have the same assert but were not observed to fire in production, so they are left unchanged for now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

crash on add_to_keep_alive_queue and add_to_active_queue #13358

Crash: `ink_release_assert` in `UnixNetVConnection::add_to_keep_alive_queue()` / `add_to_active_queue()` — NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)

Summary

Version

Configuration (relevant bits)

Crash stacks (3 occurrences, same root cause)

(A) HTTP/2, teardown path — `release_stream` → keep_alive_queue

(B) HTTP/1.1, teardown path — `transaction_done` → keep_alive_queue

(C) HTTP/2, setup path — `create_stream` → active_queue

Root cause analysis

Why this is rarely reported

Related issues / PRs

Suggested fix

Workaround status in production

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

crash on add_to_keep_alive_queue and add_to_active_queue #13358

Description

Crash: ink_release_assert in UnixNetVConnection::add_to_keep_alive_queue() / add_to_active_queue() — NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)

Summary

Version

Configuration (relevant bits)

Crash stacks (3 occurrences, same root cause)

(A) HTTP/2, teardown path — release_stream → keep_alive_queue

(B) HTTP/1.1, teardown path — transaction_done → keep_alive_queue

(C) HTTP/2, setup path — create_stream → active_queue

Root cause analysis

Why this is rarely reported

Related issues / PRs

Suggested fix

Workaround status in production

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Crash: `ink_release_assert` in `UnixNetVConnection::add_to_keep_alive_queue()` / `add_to_active_queue()` — NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)

(A) HTTP/2, teardown path — `release_stream` → keep_alive_queue

(B) HTTP/1.1, teardown path — `transaction_done` → keep_alive_queue

(C) HTTP/2, setup path — `create_stream` → active_queue