Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .cirrus.tasks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -514,12 +514,19 @@ task:
# code not being exercised much. Thus specify a very small segment size
# here. Use a non-power-of-two segment size, given we currently allow
# that.
# --enable-wait-event-timing is tacked on to this entry so the timing
# build path (including the expected output at
# src/test/regress/expected/wait_event_timing.out) actually gets
# exercised by CI; without it, only the stub alt output
# wait_event_timing_1.out is consumed and any regression in the
# timing-enabled code is invisible to upstream.
configure_script: |
su postgres <<-EOF
set -e
./configure \
--enable-cassert --enable-injection-points --enable-debug \
--enable-tap-tests --enable-nls \
--enable-wait-event-timing \
--with-segsize-blocks=6 \
--with-libnuma \
--with-liburing \
Expand Down
32 changes: 32 additions & 0 deletions configure
Original file line number Diff line number Diff line change
Expand Up @@ -774,6 +774,7 @@ CC
enable_injection_points
PG_TEST_EXTRA
enable_tap_tests
enable_wait_event_timing
enable_dtrace
DTRACEFLAGS
DTRACE
Expand Down Expand Up @@ -850,6 +851,7 @@ enable_debug
enable_profiling
enable_coverage
enable_dtrace
enable_wait_event_timing
enable_tap_tests
enable_injection_points
with_blocksize
Expand Down Expand Up @@ -1551,6 +1553,8 @@ Optional Features:
--enable-profiling build with profiling enabled
--enable-coverage build with coverage testing instrumentation
--enable-dtrace build with DTrace support
--enable-wait-event-timing
build with wait event timing instrumentation
--enable-tap-tests enable TAP tests (requires Perl and IPC::Run)
--enable-injection-points
enable injection points (for testing)
Expand Down Expand Up @@ -3632,6 +3636,34 @@ fi



#
# --enable-wait-event-timing adds wait event timing instrumentation
#


# Check whether --enable-wait-event-timing was given.
if test "${enable_wait_event_timing+set}" = set; then :
enableval=$enable_wait_event_timing;
case $enableval in
yes)

$as_echo "#define USE_WAIT_EVENT_TIMING 1" >>confdefs.h

;;
no)
:
;;
*)
as_fn_error $? "no argument expected for --enable-wait-event-timing option" "$LINENO" 5
;;
esac

else
enable_wait_event_timing=no

fi



#
# TAP tests
Expand Down
8 changes: 8 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,14 @@ fi
AC_SUBST(DTRACEFLAGS)])
AC_SUBST(enable_dtrace)

#
# --enable-wait-event-timing adds wait event timing instrumentation
#
PGAC_ARG_BOOL(enable, wait-event-timing, no,
[build with wait event timing instrumentation],
[AC_DEFINE([USE_WAIT_EVENT_TIMING], 1,
[Define to 1 to build with wait event timing. (--enable-wait-event-timing)])])

#
# TAP tests
#
Expand Down
203 changes: 203 additions & 0 deletions doc/src/sgml/config.sgml
Original file line number Diff line number Diff line change
Expand Up @@ -9110,6 +9110,209 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>

<varlistentry id="guc-wait-event-capture" xreflabel="wait_event_capture">
<term><varname>wait_event_capture</varname> (<type>enum</type>)
<indexterm>
<primary><varname>wait_event_capture</varname> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Controls collection of wait event instrumentation data. Requires
the server to be compiled with
<option>--enable-wait-event-timing</option>. Possible values are
<literal>off</literal>, <literal>stats</literal>, and
<literal>trace</literal>; each level is a strict superset of the
previous one.
</para>
<para>
At <literal>stats</literal>, the server records per-backend wait
event statistics (counts, total and average durations, log2
histograms) visible in the
<link linkend="monitoring-pg-stat-wait-event-timing-view">
<structname>pg_stat_wait_event_timing</structname></link> view.
Two <function>clock_gettime()</function> calls are added around
every wait event transition, costing approximately
40&ndash;100&nbsp;ns each on modern hardware.
</para>
<para>
At <literal>trace</literal>, the server additionally records every
individual wait event into a per-session ring buffer (~4&nbsp;MB of
DSA per backend, allocated lazily on first enable), exposed via the
<link linkend="monitoring-pg-backend-wait-event-trace-view">
<structname>pg_backend_wait_event_trace</structname></link> view.
Each record carries either a wait event or a query-attribution
marker; consumers reconstruct which query owns which wait by
interleaving the two streams.
</para>
<para>
Two marker families are emitted into the ring:
<itemizedlist>
<listitem>
<para>
<literal>ExecStart</literal>/<literal>ExecEnd</literal> markers
bracket every executor invocation
(<function>ExecutorStart</function>/<function>ExecutorEnd</function>).
They are the primary attribution signal: every executable
statement, including those run inside parallel workers and
pipelined extended-protocol messages, is bracketed. Emission
requires <xref linkend="guc-compute-query-id"/> to produce a
non-zero <structfield>query_id</structfield>; otherwise the
markers are silently skipped. They are <emphasis>not</emphasis>
gated on <varname>track_activities</varname>.
</para>
</listitem>
<listitem>
<para>
<literal>QueryStart</literal>/<literal>QueryEnd</literal> markers
fire at top-level query identifier transitions and at the
transition to idle, providing inter-statement boundaries that
the executor markers cannot (e.g. the
<literal>ClientRead</literal> wait between statements). They
require both <xref linkend="guc-track-activities"/> and
<xref linkend="guc-compute-query-id"/> to be enabled.
</para>
</listitem>
</itemizedlist>
A <literal>WARNING</literal> is logged at the time
<varname>wait_event_capture</varname> is set to <literal>trace</literal>
if either prerequisite is missing.
</para>
<para>
The default is <literal>off</literal>. Only superusers and users
with the appropriate <literal>SET</literal> privilege can change
this setting.
</para>
<para>
The setting is gated to superuser by default because
<literal>trace</literal> mode allocates approximately 4&nbsp;MB
of dynamic shared memory per backend that enables it; an
unprivileged role enabling trace on every connection in a
large pool could consume substantial cluster-wide memory.
Read access to the resulting statistics is controlled
separately by membership in the
<link linkend="predefined-roles"><literal>pg_read_all_stats</literal></link>
role (which the <literal>pg_monitor</literal> role inherits),
so a monitoring operator can typically read
<structname>pg_stat_wait_event_timing</structname> but cannot
toggle <varname>wait_event_capture</varname> itself.
</para>
<para>
To delegate the ability to change this setting to a
non-superuser role &mdash; for example, the
<literal>pg_monitor</literal> role in environments where the
cluster owner is not the operator on call &mdash; use the
standard PostgreSQL <command>GRANT SET ON PARAMETER</command>
mechanism:
<programlisting>
GRANT SET ON PARAMETER wait_event_capture TO pg_monitor;
</programlisting>
After this, any role that has the <literal>pg_monitor</literal>
role membership can run
<command>SET wait_event_capture = stats</command> (or
<literal>= trace</literal>) for its own session. The grant is
per-installation policy rather than baked into the GUC, so
managed-PostgreSQL environments and self-hosted clusters can
choose independently whether monitoring roles should be able to
flip this on.
</para>
</listitem>
</varlistentry>

<varlistentry id="guc-wait-event-timing-max-tranches" xreflabel="wait_event_timing_max_tranches">
<term><varname>wait_event_timing_max_tranches</varname> (<type>integer</type>)
<indexterm>
<primary><varname>wait_event_timing_max_tranches</varname> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Sets the maximum number of distinct LWLock tranches whose timing
is recorded individually per backend. PostgreSQL maintains a
per-backend hash table that maps each tranche the backend
encounters to its histogram bucket; once the table fills, further
tranches encountered by that backend are counted against
<structfield>lwlock_overflow_count</structfield> in
<link linkend="monitoring-pg-stat-wait-event-timing-overflow-view">
<structname>pg_stat_wait_event_timing_overflow</structname></link>
and not individually timed. Sized at server start; this
parameter has no effect on builds compiled without
<option>--enable-wait-event-timing</option>. The default is
<literal>192</literal>; raise it if your installation loads many
extensions that register their own LWLock tranches and you
observe non-zero
<structfield>lwlock_overflow_count</structfield>.
</para>
<para>
The shared-memory cost is per-backend and proportional to this
setting. Each entry is approximately 152&nbsp;bytes (an
LWLock-timing histogram), and the slot table that resolves
tranche IDs adds another 4&nbsp;bytes per slot, with the slot
count rounded up to the next power of two of twice this value.
At default 192 entries (512 slots) the per-backend overhead is
roughly 31&nbsp;KB; at 512 entries (1024 slots) roughly
80&nbsp;KB. The total cluster-wide cost is paid only when the
first backend in the cluster sets
<xref linkend="guc-wait-event-capture"/> to a non-<literal>off</literal>
value, and remains allocated for the postmaster's lifetime
regardless of subsequent GUC changes. Builds compiled without
<option>--enable-wait-event-timing</option> pay zero memory for
this setting.
</para>
<para>
Setting can only be changed at server start. Only superusers
and users with the appropriate <literal>SET</literal> privilege
can change this setting.
</para>
</listitem>
</varlistentry>

<varlistentry id="guc-wait-event-trace-ring-size-kb" xreflabel="wait_event_trace_ring_size_kb">
<term><varname>wait_event_trace_ring_size_kb</varname> (<type>integer</type>)
<indexterm>
<primary><varname>wait_event_trace_ring_size_kb</varname> configuration parameter</primary>
</indexterm>
</term>
<listitem>
<para>
Per-backend size, in kilobytes, of the wait-event-trace ring
buffer allocated when a session sets
<xref linkend="guc-wait-event-capture"/> to
<literal>trace</literal>. Must be a power of two. Sized at
server start (<literal>PGC_POSTMASTER</literal>); all rings in
a given postmaster run have the same size. This parameter has
no effect on builds compiled without
<option>--enable-wait-event-timing</option>.
</para>
<para>
Each record is 32 bytes, so the record count is the kilobyte
value times 32. The default of <literal>4096</literal> KB
(= 131072 records, ~4&nbsp;MB) gives roughly 0.5&ndash;1
second of retention at peak wait-event rates of 200K/s.
Larger values give longer retention before the FIFO wrap
overwrites the oldest records; smaller values reduce
per-backend memory at high <varname>max_connections</varname>.
Allowed range is <literal>8</literal>&ndash;<literal>32768</literal>
KB (256 records to ~1 million records per ring).
</para>
<para>
Worst-case total memory is approximately
<varname>max_connections</varname> *
<varname>wait_event_trace_ring_size_kb</varname>, allocated
lazily from a cluster-wide DSA only as backends enable
<varname>wait_event_capture</varname> = <literal>trace</literal>.
Memory is reclaimed when backends exit and their slots are
recycled, or explicitly via
<function>pg_stat_clear_orphaned_wait_event_rings</function>.
</para>
<para>
Setting can only be changed at server start. Only superusers
and users with the appropriate <literal>SET</literal> privilege
can change this setting.
</para>
</listitem>
</varlistentry>

<varlistentry id="guc-track-functions" xreflabel="track_functions">
<term><varname>track_functions</varname> (<type>enum</type>)
<indexterm>
Expand Down
27 changes: 27 additions & 0 deletions doc/src/sgml/installation.sgml
Original file line number Diff line number Diff line change
Expand Up @@ -1594,6 +1594,33 @@ build-postgresql:
</listitem>
</varlistentry>

<varlistentry id="configure-option-enable-wait-event-timing">
<term><option>--enable-wait-event-timing</option></term>
<listitem>
<para>
Compiles in per-backend wait event timing instrumentation.
When enabled, every call to
<function>pgstat_report_wait_start()</function>/<function>pgstat_report_wait_end()</function>
records the wait duration and accumulates per-event statistics
(count, total time, histogram) in shared memory.
The overhead is two <function>clock_gettime(CLOCK_MONOTONIC)</function>
calls per wait event transition (~40&ndash;100&nbsp;ns via VDSO).
When not compiled in, the <varname>wait_event_capture</varname>
GUC still exists but only accepts <literal>off</literal>, and the
SQL functions return empty result sets.
The compile flag allocates approximately 120&nbsp;KB of shared
memory per backend slot for timing statistics (regardless of GUC
setting). At <varname>max_connections</varname>&nbsp;=&nbsp;200
this is roughly 26&nbsp;MB; at 1000 it is roughly 120&nbsp;MB.
Trace ring buffers are allocated lazily via DSA only when
<varname>wait_event_capture</varname> is set to
<literal>trace</literal> (~4&nbsp;MB per traced backend).
See <xref linkend="guc-wait-event-capture"/> for the runtime
control.
</para>
</listitem>
</varlistentry>

<varlistentry id="configure-option-enable-tap-tests">
<term><option>--enable-tap-tests</option></term>
<listitem>
Expand Down
Loading