Crash: ink_release_assert in UnixNetVConnection::add_to_keep_alive_queue() / add_to_active_queue() — NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)
Summary
On a forward-proxy deployment doing TLS bump (MITM) + caching of decrypted HTTPS objects, traffic_server aborts intermittently with an ink_release_assert fired from UnixNetVConnection::add_to_keep_alive_queue() or UnixNetVConnection::add_to_active_queue():
BUG: It must have acquired the NetHandler's lock before doing anything on keep_alive_queue.
BUG: It must have acquired the NetHandler's lock before doing anything on active_queue.
Root cause: the per-thread NetHandler queues (active_queue / keep_alive_queue) are mutated from a thread that is not the owner thread of the underlying NetVConnection (vc->thread). The MUTEX_TRY_LOCK(nh->mutex, ...) in those functions fails because the owner thread currently holds nh->mutex, and the code aborts instead of deferring/redirecting the operation to the owner thread.
This reproduces for both HTTP/1.1 and HTTP/2, and is not correlated with high load — it happens under normal, low traffic (< 1000 rps, ~7–8k concurrent connections, 32 ET_NET threads on a 32-core box; CPU only ~200–300%).
Version
- Apache Traffic Server 10.1.2 (build Jun 24 2026), Linux x86_64, OpenSSL 1.1.1k, jemalloc.
Configuration (relevant bits)
records:
exec_thread:
autoconfig: { enabled: 1, scale: 1.0 }
limit: 32 # 32 ET_NET threads on a 32-core box
http:
server_ports: >
3080:ip-in=0.0.0.0:proto=http;http2
3443:ip-in=127.0.0.1:ssl:proto=http;http2
3180:ip-in=0.0.0.0:tr-in:proto=http;http2:pp
3444:ip-in=0.0.0.0:ssl:proto=http;http2:tr-in:pp
Forward proxy: client CONNECT on 3080 → local TLS listener 3443 → certifier plugin generates per-SNI certs (MITM) → decrypted HTTP objects are cached. HTTP/2 enabled on all inbound ports; PROXY protocol enabled on 3180/3444.
Crash stacks (3 occurrences, same root cause)
(A) HTTP/2, teardown path — release_stream → keep_alive_queue
_ink_assert
UnixNetVConnection::add_to_keep_alive_queue()
Http2ConnectionState::release_stream()
Http2Stream::~Http2Stream()
Http2Stream::terminate_if_possible()
Http2Stream::transaction_done()
HttpSM::kill_this()
HttpSM::main_handler()
HttpTunnel::main_handler()
CacheVC::openWriteMain(int, Event*) <-- driven by cache-write completion
EThread::process_event()
EThread::execute_regular()
(B) HTTP/1.1, teardown path — transaction_done → keep_alive_queue
_ink_assert
UnixNetVConnection::add_to_keep_alive_queue()
Http1ClientTransaction::transaction_done()
HttpSM::kill_this()
HttpSM::main_handler()
HttpTunnel::main_handler()
CacheVC::openWriteMain(int, Event*) <-- driven by cache-write completion
EThread::process_event()
EThread::process_queue()
EThread::execute_regular()
(C) HTTP/2, setup path — create_stream → active_queue
_ink_assert
UnixNetVConnection::add_to_active_queue()
Http2ConnectionState::create_stream()
Http2ConnectionState::rcv_headers_frame()
Http2ConnectionState::rcv_frame()
Http2CommonSession::do_complete_frame_read()
Http2CommonSession::do_process_frame_read()
Http2CommonSession::state_start_frame_read()
Http2ClientSession::main_event_handler()
EThread::process_event()
EThread::execute_regular() <-- dispatched from the event queue, not the net poll
The failing frame in all three:
// UnixNetVConnection.cc
void UnixNetVConnection::add_to_keep_alive_queue() {
MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
if (lock.is_locked()) { nh->add_to_keep_alive_queue(this); }
else { ink_release_assert(!"BUG: ... keep_alive_queue."); } // <-- abort
}
bool UnixNetVConnection::add_to_active_queue() {
MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
if (lock.is_locked()) { return nh->add_to_active_queue(this); }
else { ink_release_assert(!"BUG: ... active_queue."); } // <-- abort
}
Root cause analysis
-
Each ET_NET thread owns one NetHandler; at initialize_thread_for_net(), nh->thread = thread (bound once, never changes) and nh->mutex = new_ProxyMutex(). The active_queue / keep_alive_queue and their size counters are per-thread private data, meant to be mutated only on nh->thread (== the netvc owner thread O, i.e. vc->thread).
-
add_to_*_queue() uses MUTEX_TRY_LOCK(nh->mutex) + ink_release_assert as a thread-affinity contract check, not as optimistic lock retry: on the owner thread the lock is already held (reentrant → success); a foreign thread T fails the try-lock because O currently holds it. The abort therefore means "this ran on the wrong thread", not "transient lock contention". Retrying to acquire the lock would be wrong — mutating another thread's per-thread queue/counters is unsafe even if the lock were obtained.
-
HTTP/1.1 partially guards this via ProxyTransaction::adjust_thread() (which reschedules onto the fixed vc->thread), called from HttpSM::state_cache_open_write(). But the cache-write completion teardown path (CacheVC::openWriteMain → HttpTunnel → HttpSM::kill_this → Http1ClientTransaction::transaction_done → Http1ClientSession::release → add_to_keep_alive_queue) does not pass through state_cache_open_write again, so it can run on a non-owner thread → stack (B).
-
HTTP/2 aligns stream processing to Http2Stream::_thread (the stream's creating thread) via _switch_thread_if_not_on_right_thread(), which is independent from the underlying session netvc's owner thread. When teardown (release_stream) or setup (create_stream) touches the session netvc's queue while running on the stream thread T ≠ O, it aborts → stacks (A) and (C).
-
Why intermittent: most frame processing is delivered on the owner thread via NetHandler::process_ready_list (holding nh->mutex), so T == O. But when a frame is not fully available, Http2CommonSession::state_complete_frame_read() reschedules via:
this->_reenable_event =
this->get_mutex()->thread_holding->schedule_in(this->get_proxy_session(),
HRTIME_MSECONDS(1),
HTTP2_SESSION_EVENT_REENABLE, vio);
i.e. onto whatever thread currently holds the session mutex, not the netvc owner thread. Resuming there later runs create_stream off-thread → stack (C). Note stack (C) is dispatched from EThread::execute_regular() via the event queue (not the net poll path), consistent with a scheduled reenable event.
Why this is rarely reported
The ink_release_assert was introduced in 2019 (#4379 / #4787) to turn silent cross-thread races into explicit aborts and has been unchanged since. Over 8.x/9.x the common paths were aligned one-by-one (adjust_thread, #5120, #6843, …), while a general fix (#5950) was abandoned. The remaining unguarded paths require an uncommon combination — forward proxy + TLS bump + caching of decrypted HTTPS + PROXY protocol + HTTP/2 + fragmented frames triggering the reenable reschedule — which typical reverse-proxy/CDN deployments don't hit.
Related issues / PRs
Suggested fix
(1) Correct / upstream-style: ensure these queue operations always run on vc->thread. Before calling add_to_*_queue() on the session netvc, if this_ethread() != vc->thread, reschedule onto the fixed vc->thread (like adjust_thread(), using the owner thread rather than the transient mutex holder). Lifetime of the netvc on the teardown path needs care.
(2) Minimal / defensive (what we run in production): guard at the lowest level so an off-thread call degrades gracefully instead of aborting:
void UnixNetVConnection::add_to_keep_alive_queue() {
if (this->thread != this_ethread()) return; // not owner thread: skip; InactivityCop still reclaims it
MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
if (lock.is_locked()) nh->add_to_keep_alive_queue(this);
else ink_release_assert(!"BUG: ... keep_alive_queue.");
}
bool UnixNetVConnection::add_to_active_queue() {
if (this->thread != this_ethread()) return true; // not owner thread: skip accounting but ALLOW the stream
bool result = false;
MUTEX_TRY_LOCK(lock, nh->mutex, this_ethread());
if (lock.is_locked()) result = nh->add_to_active_queue(this);
else ink_release_assert(!"BUG: ... active_queue.");
return result;
}
Key point: add_to_active_queue() must return true on the off-thread path — its return value gates create_stream(), and returning false would wrongly refuse valid streams as "maxed out active connections". Connection lifetime is unaffected: reclamation is done by InactivityCop over open_list, independent of these queues. The only cost is that a rare off-thread connection is not accounted in the per-thread active/keep-alive LRU bookkeeping and may bypass the per-thread active-connection limit (negligible at 8k/40k).
Workaround status in production
Running with the defensive guard (option 2) on add_to_keep_alive_queue() and add_to_active_queue() only. Crashes stopped; no connection leaks observed. The two remove_from_*_queue() variants have the same assert but were not observed to fire in production, so they are left unchanged for now.
Crash:
ink_release_assertinUnixNetVConnection::add_to_keep_alive_queue()/add_to_active_queue()— NetHandler queue mutated off the owner thread (HTTP/1.1 and HTTP/2)Summary
On a forward-proxy deployment doing TLS bump (MITM) + caching of decrypted HTTPS objects,
traffic_serveraborts intermittently with anink_release_assertfired fromUnixNetVConnection::add_to_keep_alive_queue()orUnixNetVConnection::add_to_active_queue():Root cause: the per-thread NetHandler queues (
active_queue/keep_alive_queue) are mutated from a thread that is not the owner thread of the underlyingNetVConnection(vc->thread). TheMUTEX_TRY_LOCK(nh->mutex, ...)in those functions fails because the owner thread currently holdsnh->mutex, and the code aborts instead of deferring/redirecting the operation to the owner thread.This reproduces for both HTTP/1.1 and HTTP/2, and is not correlated with high load — it happens under normal, low traffic (< 1000 rps, ~7–8k concurrent connections, 32 ET_NET threads on a 32-core box; CPU only ~200–300%).
Version
Configuration (relevant bits)
Forward proxy: client
CONNECTon 3080 → local TLS listener 3443 →certifierplugin generates per-SNI certs (MITM) → decrypted HTTP objects are cached. HTTP/2 enabled on all inbound ports; PROXY protocol enabled on 3180/3444.Crash stacks (3 occurrences, same root cause)
(A) HTTP/2, teardown path —
release_stream→ keep_alive_queue(B) HTTP/1.1, teardown path —
transaction_done→ keep_alive_queue(C) HTTP/2, setup path —
create_stream→ active_queueThe failing frame in all three:
Root cause analysis
Each ET_NET thread owns one
NetHandler; atinitialize_thread_for_net(),nh->thread = thread(bound once, never changes) andnh->mutex = new_ProxyMutex(). Theactive_queue/keep_alive_queueand their size counters are per-thread private data, meant to be mutated only onnh->thread(== the netvc owner threadO, i.e.vc->thread).add_to_*_queue()usesMUTEX_TRY_LOCK(nh->mutex)+ink_release_assertas a thread-affinity contract check, not as optimistic lock retry: on the owner thread the lock is already held (reentrant → success); a foreign threadTfails the try-lock becauseOcurrently holds it. The abort therefore means "this ran on the wrong thread", not "transient lock contention". Retrying to acquire the lock would be wrong — mutating another thread's per-thread queue/counters is unsafe even if the lock were obtained.HTTP/1.1 partially guards this via
ProxyTransaction::adjust_thread()(which reschedules onto the fixedvc->thread), called fromHttpSM::state_cache_open_write(). But the cache-write completion teardown path (CacheVC::openWriteMain→HttpTunnel→HttpSM::kill_this→Http1ClientTransaction::transaction_done→Http1ClientSession::release→add_to_keep_alive_queue) does not pass throughstate_cache_open_writeagain, so it can run on a non-owner thread → stack (B).HTTP/2 aligns stream processing to
Http2Stream::_thread(the stream's creating thread) via_switch_thread_if_not_on_right_thread(), which is independent from the underlying session netvc's owner thread. When teardown (release_stream) or setup (create_stream) touches the session netvc's queue while running on the stream threadT ≠ O, it aborts → stacks (A) and (C).Why intermittent: most frame processing is delivered on the owner thread via
NetHandler::process_ready_list(holdingnh->mutex), soT == O. But when a frame is not fully available,Http2CommonSession::state_complete_frame_read()reschedules via:i.e. onto whatever thread currently holds the session mutex, not the netvc owner thread. Resuming there later runs
create_streamoff-thread → stack (C). Note stack (C) is dispatched fromEThread::execute_regular()via the event queue (not the net poll path), consistent with a scheduled reenable event.Why this is rarely reported
The
ink_release_assertwas introduced in 2019 (#4379 / #4787) to turn silent cross-thread races into explicit aborts and has been unchanged since. Over 8.x/9.x the common paths were aligned one-by-one (adjust_thread, #5120, #6843, …), while a general fix (#5950) was abandoned. The remaining unguarded paths require an uncommon combination — forward proxy + TLS bump + caching of decrypted HTTPS + PROXY protocol + HTTP/2 + fragmented frames triggering the reenable reschedule — which typical reverse-proxy/CDN deployments don't hit.Related issues / PRs
Http2ConnectionState::release_stream().Suggested fix
(1) Correct / upstream-style: ensure these queue operations always run on
vc->thread. Before callingadd_to_*_queue()on the session netvc, ifthis_ethread() != vc->thread, reschedule onto the fixedvc->thread(likeadjust_thread(), using the owner thread rather than the transient mutex holder). Lifetime of the netvc on the teardown path needs care.(2) Minimal / defensive (what we run in production): guard at the lowest level so an off-thread call degrades gracefully instead of aborting:
Key point:
add_to_active_queue()must returntrueon the off-thread path — its return value gatescreate_stream(), and returningfalsewould wrongly refuse valid streams as "maxed out active connections". Connection lifetime is unaffected: reclamation is done byInactivityCopoveropen_list, independent of these queues. The only cost is that a rare off-thread connection is not accounted in the per-thread active/keep-alive LRU bookkeeping and may bypass the per-thread active-connection limit (negligible at 8k/40k).Workaround status in production
Running with the defensive guard (option 2) on
add_to_keep_alive_queue()andadd_to_active_queue()only. Crashes stopped; no connection leaks observed. The tworemove_from_*_queue()variants have the same assert but were not observed to fire in production, so they are left unchanged for now.