Skip to content

fix: valgrind on ARM#21

Open
not-matthias wants to merge 7 commits into
masterfrom
cod-2985-arm-flamegraph-failures
Open

fix: valgrind on ARM#21
not-matthias wants to merge 7 commits into
masterfrom
cod-2985-arm-flamegraph-failures

Conversation

@not-matthias

@not-matthias not-matthias commented Jun 30, 2026

Copy link
Copy Markdown
Member

FYI: fde493a is 100% Claude generated and only used for debugging

@codspeed-hq

codspeed-hq Bot commented Jun 30, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 30 untouched benchmarks
⏩ 100 skipped benchmarks1


Comparing cod-2985-arm-flamegraph-failures (0a322a7) with master (ce9d871)

Open in CodSpeed

Footnotes

  1. 100 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@not-matthias not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch 4 times, most recently from 441b2f0 to 188e66b Compare July 2, 2026 17:04
The AArch64 B{L} decoder tagged the whole opcode group as Ijk_Call,
but only BL (bit 31 = 1, writes the link register) is a call; a plain
B (bit 31 = 0) is an ordinary unconditional branch.

Mislabelling B as a call made Callgrind treat every branch to a
function epilogue or tail target as a call. At -O0 a conditional like
`return n < 2 ? n : fib(...)` compiles the base case to `b <epilogue>`,
so each base case was counted as a recursive call -- inflating
recursive/cyclic call graphs and inventing phantom self-edges on arm64
(e.g. fib recursion 64 -> 98; mutual is_even/is_odd gaining self-loops).

Align plain B with B.cond and the register-indirect JMP, which already
use Ijk_Boring. Fixes the callgrind-utils recursion/mutual snapshot
failures.
@not-matthias not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch 2 times, most recently from 5e34b22 to 771904a Compare July 3, 2026 08:54
@not-matthias not-matthias marked this pull request as ready for review July 3, 2026 08:55
@greptile-apps

greptile-apps Bot commented Jul 3, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes Callgrind's call-graph accuracy on AArch64 through a set of coordinated changes across VEX, coregrind, and callgrind, and adds a new callgrind-utils Rust crate for parsing, rendering, and testing Callgrind output.

  • VEX fix (root cause): Plain B instructions were emitted as Ijk_Call instead of Ijk_Boring; only BL/BLR write the link register and are real calls. This single-line fix eliminates phantom recursive clones and corrupted call graphs on arm64.
  • ARM64 shadow stack (callstack.c + fn.c): push_call_stack now records the live guest X30 register as the return address (instead of statically computing "PC after the call instruction"), and _dl_tlsdesc_* TLS-descriptor trampolines are marked as PLT-skip-equivalent so they never surface as named graph nodes.
  • Frame-pointer fallback (m_stacktrace.c): Adds AAPCS64 X29-chain unwinding when CFI is absent (JIT pages, hand-written asm), mirroring the existing amd64 %rbp path; fixes truncated seed stacks at CALLGRIND_START_INSTRUMENTATION on arm64.
  • NIX_DEBUG_INFO_DIRS (readelf.c): Resolves separate debug files from Nix store paths before the FHS /usr/lib/debug fallback.

Confidence Score: 4/5

The core arm64 fixes are well-reasoned and backed by targeted snapshot tests; the new Rust crate is comprehensive. The two style-only findings are both in new code and carry no correctness risk.

All findings are style-level. The VEX B/BL reclassification is clearly correct, the X30 return-address tracking is carefully gated on the right conditions, and the frame-pointer fallback mirrors the proven amd64 path with appropriate guards. The only open question is a missing inline explanation for the next_pc == 1 terminator, which the changelog documents but the code does not. Nothing here is likely to cause incorrect behavior at runtime.

coregrind/m_stacktrace.c (unexplained 1 == next_pc guard), callgrind-utils/src/flamegraph.rs and callgrind-utils/src/parser/mod.rs (multi-clause boolean style)

Important Files Changed

Filename Overview
VEX/priv/guest_arm64_toIR.c Corrects B vs BL classification: plain B now emits Ijk_Boring instead of Ijk_Call, fixing phantom recursive call graph edges on arm64. Fix is minimal, correct, and well-commented.
callgrind/callstack.c ARM64-gated change reads the live guest X30 register as the recorded return address for real calls, replacing the static 'PC after call instruction' calculation that breaks across skipped-code splices. Logic is sound; the conditional on skip
callgrind/fn.c ARM64-gated: marks dl_tlsdesc* resolver trampolines as PLT-skip-equivalent, preventing them from leaking as named nodes when called from obj-skipped code.
coregrind/m_stacktrace.c Adds AAPCS64 frame-pointer chain fallback (X29/{saved-X29,saved-X30}) when CFI lookup fails, mirroring the amd64 %rbp path. Guards are well-chosen; the 1 == next_pc terminator lacks an inline explanation.
coregrind/m_debuginfo/readelf.c Adds NIX_DEBUG_INFO_DIRS colon-separated search before /usr/lib/debug fallback. The new try_buildid_dir helper correctly handles empty dirlen, edge-case colons, and memory ownership.
callgrind-utils/src/parser/mod.rs New Callgrind .out parser handling name compression, three independent ID spaces, inline fi/fe transitions, cfl/cfi alias, part/thread boundaries, and bare-cfn discard. One multi-clause boolean condition should be named per team style rule.
callgrind-utils/src/flamegraph.rs DFS-based folded-stack generator with cycle detection via on_path guard and proportional budget pruning. Logic is sound; one multi-clause guard should be extracted to a named variable per team style rule.
callgrind-utils/src/model.rs Clean data model with deduplication, aggregation of call_count and inclusive_cost across duplicate edges, and binary-search node lookup. from_parts is well-structured.
callgrind-utils/build.rs Build script drives configure+make for in-repo Callgrind, tracking only top-level callgrind/*.c/h to avoid re-triggering on test output files. CAPSTONE_DIR assertion gives a clear error on unconfigured trees.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["AArch64 BL/B instruction\n(VEX guest_arm64_toIR.c)"] -->|"bLink=1 → Ijk_Call\nbLink=0 → Ijk_Boring"| B["Callgrind dispatch\n(setup_bbcc)"]
    B -->|"jk_Call + real call"| C["push_call_stack\nRecord guest X30 as ret_addr\n(callstack.c ARM64 path)"]
    B -->|"jk_Call + splice or skip"| D["push_call_stack\nRecord ret_addr = 0\n(emulated/spliced frame)"]
    C --> E["Shadow call stack\n(ret_addr = architectural RA)"]
    D --> E
    E -->|"RET detected"| F["Pop matching frames\n(same-SP run, ret_addr match)"]
    G["CFI lookup\n(VG_use_CF_info)"] -->|"succeeds"| H["Emit IP with -1 bias"]
    G -->|"fails (JIT/asm)"| I["AAPCS64 FP chain fallback\nX29 → {saved X29, saved X30}\n(m_stacktrace.c)"]
    I --> H
    H --> J["Stack seed at\nCALLGRIND_START_INSTRUMENTATION"]
    J -->|"IP+1 → exact return targets"| E
    K["_dl_tlsdesc_* resolver\n(fn.c)"] -->|"fn->skip = skip_plt"| L["Transparent trampoline\n(never a named node)"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["AArch64 BL/B instruction\n(VEX guest_arm64_toIR.c)"] -->|"bLink=1 → Ijk_Call\nbLink=0 → Ijk_Boring"| B["Callgrind dispatch\n(setup_bbcc)"]
    B -->|"jk_Call + real call"| C["push_call_stack\nRecord guest X30 as ret_addr\n(callstack.c ARM64 path)"]
    B -->|"jk_Call + splice or skip"| D["push_call_stack\nRecord ret_addr = 0\n(emulated/spliced frame)"]
    C --> E["Shadow call stack\n(ret_addr = architectural RA)"]
    D --> E
    E -->|"RET detected"| F["Pop matching frames\n(same-SP run, ret_addr match)"]
    G["CFI lookup\n(VG_use_CF_info)"] -->|"succeeds"| H["Emit IP with -1 bias"]
    G -->|"fails (JIT/asm)"| I["AAPCS64 FP chain fallback\nX29 → {saved X29, saved X30}\n(m_stacktrace.c)"]
    I --> H
    H --> J["Stack seed at\nCALLGRIND_START_INSTRUMENTATION"]
    J -->|"IP+1 → exact return targets"| E
    K["_dl_tlsdesc_* resolver\n(fn.c)"] -->|"fn->skip = skip_plt"| L["Transparent trampoline\n(never a named node)"]
Loading

Reviews (1): Last reviewed commit: "feat(callgrind-utils): add perf_map symb..." | Re-trigger Greptile

Comment thread coregrind/m_stacktrace.c
Comment thread callgrind-utils/src/flamegraph.rs Outdated
Comment thread callgrind-utils/src/parser/mod.rs Outdated
@not-matthias not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch from 771904a to bb9053e Compare July 3, 2026 09:12
On arm64, bl/blr write the return address to X30 and SP does not move
across a call/return pair, so unlike on x86 the return detector cannot
fall back on SP progress and depends entirely on each frame's recorded
ret_addr. When a call entered skipped code (a libc PLT hop), the
skipped->nonskipped jump was pushed with setup_bbcc's spliced
'nonskipped' source BB, whose last jump is the very call that created
the skip frame below: the emulated frame duplicated that frame's
statically computed return address, the callee's single ret popped only
the top entry, and the leaked equal-SP skip frame starved the pop
budget of the next same-SP return. Misclassified returns were then
re-promoted into phantom calls back into the live caller, cloning
non-recursive functions as bogus 'N recursion levels
(complex_fractal_benchmark'2) and misattributing follow-up work
("free calls X").

Record the guest X30 -- the architectural return target -- for frames
entered by a real call, and record ret_addr = 0 for emulated/spliced
pushes so the return matcher absorbs them down the same-SP run and pops
the group at the frame of the real call underneath, which also restores
the pre-call nonskipped state. x86 keeps the static computation and is
behaviorally unchanged.

Verified by the arm64_plt_phantom_recursion / arm64_free_tailcall_phantom
fixture snapshots (flipped to the correct shapes), the structural guards
in arm64_fractal_alloc_no_free_misattribution, the no-phantom-clone
assertion in rust_fixture_full_trace, and an unchanged in-tree
'vg_regtest callgrind' pass/fail set.
VG_(get_StackTrace) on arm64 was CFI-only, so Callgrind's OFF->ON
shadow-stack seeding stopped at the first frame without unwind info --
notably CPython's -X perf JIT trampolines, which have no FDEs but keep
the AAPCS64 frame chain alive (that is their design: perf/samply walk
them by FP). The truncated seed left the seeded context stack one entry
deep after the innermost frame popped; bbcc.c's underflow check then
misread the fn-stack base sentinel as a signal marker on every return,
and handleUnderflow fabricated named nodes for obj-skipped interpreter
functions with inverted, full-cost edges (_ctypes_callproc ->
PyCFuncPtr_call -> _TAIL_CALL_* -> ... as the graph root). x86_64 never
hit this because its unwinder already has the %rbp fallback.

Follow the frame records {saved X29, saved X30} when CFI fails, with
guards: record in-stack and 8-aligned, SP must progress, next record
strictly higher (saved X29 == 0 accepted as chain terminator), pc 0/1
stops the walk. Caller IPs keep the CFI path's -1 bias.

Callgrind's seeder correspondingly records ret_addr = ips[frame+1] + 1,
undoing that bias so the arm64 return matcher (exact-X30 matching, no
SP movement on bl/ret) can pop seeded frames.

An A/B run confirmed raising CLG_RECON_MAX_FRAMES alone does not help:
the seed was CFI-truncated at ~8 frames, nowhere near the 256 cap.

Regression coverage: callgrind-utils/tests/objskip_seed_underflow.rs, a
minimal fixture whose asm trampoline maintains FP but has no CFI and
which starts instrumentation two obj-skipped frames deep; it asserts no
skipped frame leaks into the folded graph and that the workload parents
under the trampoline.
Every TLSDESC __thread access blr's into the dynamic linker's resolver,
which rets straight back into the middle of the accessing function. When
the access is made from an obj-skipped object (CPython under
pytest-codspeed), the skipped->nonskipped splice pushed the resolver
frame with ret_addr 0; its mid-function return could never match, the
RET-w/o-CALL promotion re-entered the skipped object with nonskipped
pointing at the resolver, and skipped cost plus call edges piled up
under _dl_tlsdesc_return -- pulling nearly whole Python flamegraphs
under that node, plus inverted return-direction edges.

Mark _dl_tlsdesc_* skipped (gated on --skip-plt, arm64 only), the same
transparent-trampoline class as _dl_runtime_resolve; pop_on_jump cannot
apply since these exit via a plain ret to a non-entry address. Skip
pushes are never spliced and record the architectural X30, so the
return pops cleanly and cost keeps flowing to the real non-skipped
caller.

Regression: callgrind-utils/tests/arm64_tls_access.rs (plain +
--obj-skip runs of the shared-lib TLS fixture); both tests fail on the
unfixed tool.
find_debug_file() only searched the /usr/lib/debug tree, which does not
exist on NixOS: Nix ships separate debug outputs under their own store
paths. Factor the build-id .build-id/xx/yyyy.debug probe into
try_buildid_dir() and honour NIX_DEBUG_INFO_DIRS -- the established
colon-separated convention also used by the nixpkgs gdb/lldb wrappers --
so split debug info resolves inside the dev-shell.
…megraph output

New Rust crate (edition 2024) that reads a Callgrind .out profile and
extracts call-graph topology (costs/addresses ignored), serializing to
canonical index-ref JSON for stable cross-platform callgraph diffing and
to folded/flamegraph stacks.

Node identity is the {object,file,function} tuple so same-named statics
stay distinct. Edges are emitted only on calls= lines; name compression
across the three ID spaces, the cfl/cfi alias, inline fi/fe
callee-context inheritance, and multi-part merge are handled. A redaction
pass strips volatile addresses/paths so snapshots stay portable.

Fixtures are compiled and profiled by the in-repo Callgrind through an
rstest harness (vg-in-place, --instr-atstart=no plus client requests
keep loader/libc frames out): base graphs (recursion, chain, diamond,
mutual), arm64 unwind fixtures (tail calls, recursion, TLS access,
alloc/free cycles, longjmp, phantom recursion), the objskip seeding
underflow regression, and the arm64 TLS-descriptor regression.
Folded-stack snapshots and structural assertions; clippy and rustfmt
clean.
…frames

Add the `perf_map` module: `CallGraph::symbolize_perf_map` resolves Callgrind's anonymous `0x...` JIT nodes to `py::<qualname>:<file>` via CPython's `/tmp/perf-<pid>.map`, written under `python3 -X perf`. Add the `fractal.py` fixture and `python_fractal_callgraph` test that profiles it live and snapshots the folded stacks and canonical JSON.
@not-matthias not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch from 9db3f68 to 0a322a7 Compare July 3, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant