Skip to content

crash on add_to_keep_alive_queue and add_to_active_queue #13358

Description

@wuxcer

Crash: ink_release_assert in UnixNetVConnection::add_to_keep_alive_queue() / add_to_active_queue() — NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)

Summary

On a forward-proxy deployment doing TLS bump (MITM) + caching of decrypted HTTPS objects, traffic_server aborts intermittently with an ink_release_assert fired from UnixNetVConnection::add_to_keep_alive_queue() or UnixNetVConnection::add_to_active_queue():

BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.
BUG: It must have acquired the NetHandler's lock before doing anything on active_queue.

Root cause: the per-thread NetHandler queues (active_queue / keep_alive_queue) are mutated from a thread that is not the owner thread of the underlying NetVConnection (vc->thread). The MUTEX_TRY_LOCK(nh->mutex, ...) in those functions fails because the owner thread currently holds nh->mutex, and the code aborts instead of deferring/redirecting the operation to the owner thread.

This reproduces for both HTTP/1.1 and HTTP/2, and is not correlated with high load — it happens under normal, low traffic (< 1000 rps, ~7–8k concurrent connections, 32 ET_NET threads on a 32-core box; CPU only ~200–300%).

Version

  • Apache Traffic Server 10.1.2 (build Jun 24 2026), Linux x86_64, OpenSSL 1.1.1k, jemalloc.

Configuration (relevant bits)

records:
  exec_thread:
    autoconfig: { enabled: 1, scale: 1.0 }
    limit: 32                       # 32 ET_NET threads on a 32-core box
  http:
    server_ports: >
      3080:ip-in=0.0.0.0:proto=http;http2
      3443:ip-in=127.0.0.1:ssl:proto=http;http2
      3180:ip-in=0.0.0.0:tr-in:proto=http;http2:pp
      3444:ip-in=0.0.0.0:ssl:proto=http;http2:tr-in:pp

Forward proxy: client CONNECT on 3080 → local TLS listener 3443 → certifier plugin generates per-SNI certs (MITM) → decrypted HTTP objects are cached. HTTP/2 enabled on all inbound ports; PROXY protocol enabled on 3180/3444.

Crash stacks (3 occurrences, same root cause)

(A) HTTP/2, teardown path — release_stream → keep_alive_queue

_ink_assert
UnixNetVConnection::add_to_keep_alive_queue()
Http2ConnectionState::release_stream()
Http2Stream::~Http2Stream()
Http2Stream::terminate_if_possible()
Http2Stream::transaction_done()
HttpSM::kill_this()
HttpSM::main_handler()
HttpTunnel::main_handler()
CacheVC::openWriteMain(int, Event*)          <-- driven by cache-write completion
EThread::process_event()
EThread::execute_regular()

(B) HTTP/1.1, teardown path — transaction_done → keep_alive_queue

_ink_assert
UnixNetVConnection::add_to_keep_alive_queue()
Http1ClientTransaction::transaction_done()
HttpSM::kill_this()
HttpSM::main_handler()
HttpTunnel::main_handler()
CacheVC::openWriteMain(int, Event*)          <-- driven by cache-write completion
EThread::process_event()
EThread::process_queue()
EThread::execute_regular()

(C) HTTP/2, setup path — create_stream → active_queue

_ink_assert
UnixNetVConnection::add_to_active_queue()
Http2ConnectionState::create_stream()
Http2ConnectionState::rcv_headers_frame()
Http2ConnectionState::rcv_frame()
Http2CommonSession::do_complete_frame_read()
Http2CommonSession::do_process_frame_read()
Http2CommonSession::state_start_frame_read()
Http2ClientSession::main_event_handler()
EThread::process_event()
EThread::execute_regular()                   <-- dispatched from the event queue, not the net poll

The failing frame in all three:

// UnixNetVConnection.cc
void UnixNetVConnection::add_to_keep_alive_queue() {
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) { nh->add_to_keep_alive_queue(this); }
  else { ink_release_assert(!"BUG: ... keep_alive_queue."); }   // <-- abort
}
bool UnixNetVConnection::add_to_active_queue() {
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) { return nh->add_to_active_queue(this); }
  else { ink_release_assert(!"BUG: ... active_queue."); }       // <-- abort
}

Root cause analysis

  1. Each ET_NET thread owns one NetHandler; at initialize_thread_for_net(), nh->thread = thread (bound once, never changes) and nh->mutex = new_ProxyMutex(). The active_queue / keep_alive_queue and their size counters are per-thread private data, meant to be mutated only on nh->thread (== the netvc owner thread O, i.e. vc->thread).

  2. add_to_*_queue() uses MUTEX_TRY_LOCK(nh->mutex) + ink_release_assert as a thread-affinity contract check, not as optimistic lock retry: on the owner thread the lock is already held (reentrant → success); a foreign thread T fails the try-lock because O currently holds it. The abort therefore means "this ran on the wrong thread", not "transient lock contention". Retrying to acquire the lock would be wrong — mutating another thread's per-thread queue/counters is unsafe even if the lock were obtained.

  3. HTTP/1.1 partially guards this via ProxyTransaction::adjust_thread() (which reschedules onto the fixed vc->thread), called from HttpSM::state_cache_open_write(). But the cache-write completion teardown path (CacheVC::openWriteMainHttpTunnelHttpSM::kill_thisHttp1ClientTransaction::transaction_doneHttp1ClientSession::releaseadd_to_keep_alive_queue) does not pass through state_cache_open_write again, so it can run on a non-owner thread → stack (B).

  4. HTTP/2 aligns stream processing to Http2Stream::_thread (the stream's creating thread) via _switch_thread_if_not_on_right_thread(), which is independent from the underlying session netvc's owner thread. When teardown (release_stream) or setup (create_stream) touches the session netvc's queue while running on the stream thread T ≠ O, it aborts → stacks (A) and (C).

  5. Why intermittent: most frame processing is delivered on the owner thread via NetHandler::process_ready_list (holding nh->mutex), so T == O. But when a frame is not fully available, Http2CommonSession::state_complete_frame_read() reschedules via:

    this->_reenable_event =
        this->get_mutex()->thread_holding->schedule_in(this->get_proxy_session(),
                                                        HRTIME_MSECONDS(1),
                                                        HTTP2_SESSION_EVENT_REENABLE, vio);

    i.e. onto whatever thread currently holds the session mutex, not the netvc owner thread. Resuming there later runs create_stream off-thread → stack (C). Note stack (C) is dispatched from EThread::execute_regular() via the event queue (not the net poll path), consistent with a scheduled reenable event.

Why this is rarely reported

The ink_release_assert was introduced in 2019 (#4379 / #4787) to turn silent cross-thread races into explicit aborts and has been unchanged since. Over 8.x/9.x the common paths were aligned one-by-one (adjust_thread, #5120, #6843, …), while a general fix (#5950) was abandoned. The remaining unguarded paths require an uncommon combination — forward proxy + TLS bump + caching of decrypted HTTPS + PROXY protocol + HTTP/2 + fragmented frames triggering the reenable reschedule — which typical reverse-proxy/CDN deployments don't hit.

Related issues / PRs

Suggested fix

(1) Correct / upstream-style: ensure these queue operations always run on vc->thread. Before calling add_to_*_queue() on the session netvc, if this_ethread() != vc->thread, reschedule onto the fixed vc->thread (like adjust_thread(), using the owner thread rather than the transient mutex holder). Lifetime of the netvc on the teardown path needs care.

(2) Minimal / defensive (what we run in production): guard at the lowest level so an off-thread call degrades gracefully instead of aborting:

void UnixNetVConnection::add_to_keep_alive_queue() {
  if (this->thread != this_ethread()) return;         // not owner thread: skip; InactivityCop still reclaims it
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) nh->add_to_keep_alive_queue(this);
  else ink_release_assert(!"BUG: ... keep_alive_queue.");
}
bool UnixNetVConnection::add_to_active_queue() {
  if (this->thread != this_ethread()) return true;    // not owner thread: skip accounting but ALLOW the stream
  bool result = false;
  MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
  if (lock.is_locked()) result = nh->add_to_active_queue(this);
  else ink_release_assert(!"BUG: ... active_queue.");
  return result;
}

Key point: add_to_active_queue() must return true on the off-thread path — its return value gates create_stream(), and returning false would wrongly refuse valid streams as "maxed out active connections". Connection lifetime is unaffected: reclamation is done by InactivityCop over open_list, independent of these queues. The only cost is that a rare off-thread connection is not accounted in the per-thread active/keep-alive LRU bookkeeping and may bypass the per-thread active-connection limit (negligible at 8k/40k).

Workaround status in production

Running with the defensive guard (option 2) on add_to_keep_alive_queue() and add_to_active_queue() only. Crashes stopped; no connection leaks observed. The two remove_from_*_queue() variants have the same assert but were not observed to fire in production, so they are left unchanged for now.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions