fix: valgrind on ARM#21
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
441b2f0 to
188e66b
Compare
The AArch64 B{L} decoder tagged the whole opcode group as Ijk_Call,
but only BL (bit 31 = 1, writes the link register) is a call; a plain
B (bit 31 = 0) is an ordinary unconditional branch.
Mislabelling B as a call made Callgrind treat every branch to a
function epilogue or tail target as a call. At -O0 a conditional like
`return n < 2 ? n : fib(...)` compiles the base case to `b <epilogue>`,
so each base case was counted as a recursive call -- inflating
recursive/cyclic call graphs and inventing phantom self-edges on arm64
(e.g. fib recursion 64 -> 98; mutual is_even/is_odd gaining self-loops).
Align plain B with B.cond and the register-indirect JMP, which already
use Ijk_Boring. Fixes the callgrind-utils recursion/mutual snapshot
failures.
5e34b22 to
771904a
Compare
Greptile SummaryThis PR fixes Callgrind's call-graph accuracy on AArch64 through a set of coordinated changes across VEX, coregrind, and callgrind, and adds a new
Confidence Score: 4/5The core arm64 fixes are well-reasoned and backed by targeted snapshot tests; the new Rust crate is comprehensive. The two style-only findings are both in new code and carry no correctness risk. All findings are style-level. The VEX B/BL reclassification is clearly correct, the X30 return-address tracking is carefully gated on the right conditions, and the frame-pointer fallback mirrors the proven amd64 path with appropriate guards. The only open question is a missing inline explanation for the next_pc == 1 terminator, which the changelog documents but the code does not. Nothing here is likely to cause incorrect behavior at runtime. coregrind/m_stacktrace.c (unexplained 1 == next_pc guard), callgrind-utils/src/flamegraph.rs and callgrind-utils/src/parser/mod.rs (multi-clause boolean style) Important Files Changed
|
771904a to
bb9053e
Compare
On arm64, bl/blr write the return address to X30 and SP does not move
across a call/return pair, so unlike on x86 the return detector cannot
fall back on SP progress and depends entirely on each frame's recorded
ret_addr. When a call entered skipped code (a libc PLT hop), the
skipped->nonskipped jump was pushed with setup_bbcc's spliced
'nonskipped' source BB, whose last jump is the very call that created
the skip frame below: the emulated frame duplicated that frame's
statically computed return address, the callee's single ret popped only
the top entry, and the leaked equal-SP skip frame starved the pop
budget of the next same-SP return. Misclassified returns were then
re-promoted into phantom calls back into the live caller, cloning
non-recursive functions as bogus 'N recursion levels
(complex_fractal_benchmark'2) and misattributing follow-up work
("free calls X").
Record the guest X30 -- the architectural return target -- for frames
entered by a real call, and record ret_addr = 0 for emulated/spliced
pushes so the return matcher absorbs them down the same-SP run and pops
the group at the frame of the real call underneath, which also restores
the pre-call nonskipped state. x86 keeps the static computation and is
behaviorally unchanged.
Verified by the arm64_plt_phantom_recursion / arm64_free_tailcall_phantom
fixture snapshots (flipped to the correct shapes), the structural guards
in arm64_fractal_alloc_no_free_misattribution, the no-phantom-clone
assertion in rust_fixture_full_trace, and an unchanged in-tree
'vg_regtest callgrind' pass/fail set.
VG_(get_StackTrace) on arm64 was CFI-only, so Callgrind's OFF->ON
shadow-stack seeding stopped at the first frame without unwind info --
notably CPython's -X perf JIT trampolines, which have no FDEs but keep
the AAPCS64 frame chain alive (that is their design: perf/samply walk
them by FP). The truncated seed left the seeded context stack one entry
deep after the innermost frame popped; bbcc.c's underflow check then
misread the fn-stack base sentinel as a signal marker on every return,
and handleUnderflow fabricated named nodes for obj-skipped interpreter
functions with inverted, full-cost edges (_ctypes_callproc ->
PyCFuncPtr_call -> _TAIL_CALL_* -> ... as the graph root). x86_64 never
hit this because its unwinder already has the %rbp fallback.
Follow the frame records {saved X29, saved X30} when CFI fails, with
guards: record in-stack and 8-aligned, SP must progress, next record
strictly higher (saved X29 == 0 accepted as chain terminator), pc 0/1
stops the walk. Caller IPs keep the CFI path's -1 bias.
Callgrind's seeder correspondingly records ret_addr = ips[frame+1] + 1,
undoing that bias so the arm64 return matcher (exact-X30 matching, no
SP movement on bl/ret) can pop seeded frames.
An A/B run confirmed raising CLG_RECON_MAX_FRAMES alone does not help:
the seed was CFI-truncated at ~8 frames, nowhere near the 256 cap.
Regression coverage: callgrind-utils/tests/objskip_seed_underflow.rs, a
minimal fixture whose asm trampoline maintains FP but has no CFI and
which starts instrumentation two obj-skipped frames deep; it asserts no
skipped frame leaks into the folded graph and that the workload parents
under the trampoline.
Every TLSDESC __thread access blr's into the dynamic linker's resolver, which rets straight back into the middle of the accessing function. When the access is made from an obj-skipped object (CPython under pytest-codspeed), the skipped->nonskipped splice pushed the resolver frame with ret_addr 0; its mid-function return could never match, the RET-w/o-CALL promotion re-entered the skipped object with nonskipped pointing at the resolver, and skipped cost plus call edges piled up under _dl_tlsdesc_return -- pulling nearly whole Python flamegraphs under that node, plus inverted return-direction edges. Mark _dl_tlsdesc_* skipped (gated on --skip-plt, arm64 only), the same transparent-trampoline class as _dl_runtime_resolve; pop_on_jump cannot apply since these exit via a plain ret to a non-entry address. Skip pushes are never spliced and record the architectural X30, so the return pops cleanly and cost keeps flowing to the real non-skipped caller. Regression: callgrind-utils/tests/arm64_tls_access.rs (plain + --obj-skip runs of the shared-lib TLS fixture); both tests fail on the unfixed tool.
find_debug_file() only searched the /usr/lib/debug tree, which does not exist on NixOS: Nix ships separate debug outputs under their own store paths. Factor the build-id .build-id/xx/yyyy.debug probe into try_buildid_dir() and honour NIX_DEBUG_INFO_DIRS -- the established colon-separated convention also used by the nixpkgs gdb/lldb wrappers -- so split debug info resolves inside the dev-shell.
…megraph output
New Rust crate (edition 2024) that reads a Callgrind .out profile and
extracts call-graph topology (costs/addresses ignored), serializing to
canonical index-ref JSON for stable cross-platform callgraph diffing and
to folded/flamegraph stacks.
Node identity is the {object,file,function} tuple so same-named statics
stay distinct. Edges are emitted only on calls= lines; name compression
across the three ID spaces, the cfl/cfi alias, inline fi/fe
callee-context inheritance, and multi-part merge are handled. A redaction
pass strips volatile addresses/paths so snapshots stay portable.
Fixtures are compiled and profiled by the in-repo Callgrind through an
rstest harness (vg-in-place, --instr-atstart=no plus client requests
keep loader/libc frames out): base graphs (recursion, chain, diamond,
mutual), arm64 unwind fixtures (tail calls, recursion, TLS access,
alloc/free cycles, longjmp, phantom recursion), the objskip seeding
underflow regression, and the arm64 TLS-descriptor regression.
Folded-stack snapshots and structural assertions; clippy and rustfmt
clean.
…frames Add the `perf_map` module: `CallGraph::symbolize_perf_map` resolves Callgrind's anonymous `0x...` JIT nodes to `py::<qualname>:<file>` via CPython's `/tmp/perf-<pid>.map`, written under `python3 -X perf`. Add the `fractal.py` fixture and `python_fractal_callgraph` test that profiles it live and snapshots the folded stacks and canonical JSON.
9db3f68 to
0a322a7
Compare
FYI: fde493a is 100% Claude generated and only used for debugging