fix: valgrind on ARM by not-matthias · Pull Request #21 · CodSpeedHQ/valgrind-codspeed

not-matthias · 2026-06-30T17:25:03Z

FYI: fde493a is 100% Claude generated and only used for debugging

codspeed-hq · 2026-06-30T17:33:31Z

Merging this PR will not alter performance

✅ 30 untouched benchmarks
⏩ 100 skipped benchmarks¹

_{Comparing cod-2985-arm-flamegraph-failures (0a322a7) with master (ce9d871)}

100 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

The AArch64 B{L} decoder tagged the whole opcode group as Ijk_Call, but only BL (bit 31 = 1, writes the link register) is a call; a plain B (bit 31 = 0) is an ordinary unconditional branch. Mislabelling B as a call made Callgrind treat every branch to a function epilogue or tail target as a call. At -O0 a conditional like `return n < 2 ? n : fib(...)` compiles the base case to `b <epilogue>`, so each base case was counted as a recursive call -- inflating recursive/cyclic call graphs and inventing phantom self-edges on arm64 (e.g. fib recursion 64 -> 98; mutual is_even/is_odd gaining self-loops). Align plain B with B.cond and the register-indirect JMP, which already use Ijk_Boring. Fixes the callgrind-utils recursion/mutual snapshot failures.

greptile-apps · 2026-07-03T09:01:41Z

Greptile Summary

This PR fixes Callgrind's call-graph accuracy on AArch64 through a set of coordinated changes across VEX, coregrind, and callgrind, and adds a new callgrind-utils Rust crate for parsing, rendering, and testing Callgrind output.

VEX fix (root cause): Plain B instructions were emitted as Ijk_Call instead of Ijk_Boring; only BL/BLR write the link register and are real calls. This single-line fix eliminates phantom recursive clones and corrupted call graphs on arm64.
ARM64 shadow stack (callstack.c + fn.c): push_call_stack now records the live guest X30 register as the return address (instead of statically computing "PC after the call instruction"), and _dl_tlsdesc_* TLS-descriptor trampolines are marked as PLT-skip-equivalent so they never surface as named graph nodes.
Frame-pointer fallback (m_stacktrace.c): Adds AAPCS64 X29-chain unwinding when CFI is absent (JIT pages, hand-written asm), mirroring the existing amd64 %rbp path; fixes truncated seed stacks at CALLGRIND_START_INSTRUMENTATION on arm64.
NIX_DEBUG_INFO_DIRS (readelf.c): Resolves separate debug files from Nix store paths before the FHS /usr/lib/debug fallback.

Confidence Score: 4/5

The core arm64 fixes are well-reasoned and backed by targeted snapshot tests; the new Rust crate is comprehensive. The two style-only findings are both in new code and carry no correctness risk.

All findings are style-level. The VEX B/BL reclassification is clearly correct, the X30 return-address tracking is carefully gated on the right conditions, and the frame-pointer fallback mirrors the proven amd64 path with appropriate guards. The only open question is a missing inline explanation for the next_pc == 1 terminator, which the changelog documents but the code does not. Nothing here is likely to cause incorrect behavior at runtime.

coregrind/m_stacktrace.c (unexplained 1 == next_pc guard), callgrind-utils/src/flamegraph.rs and callgrind-utils/src/parser/mod.rs (multi-clause boolean style)

Important Files Changed

Filename	Overview
VEX/priv/guest_arm64_toIR.c	Corrects B vs BL classification: plain B now emits Ijk_Boring instead of Ijk_Call, fixing phantom recursive call graph edges on arm64. Fix is minimal, correct, and well-commented.
callgrind/callstack.c	ARM64-gated change reads the live guest X30 register as the recorded return address for real calls, replacing the static 'PC after call instruction' calculation that breaks across skipped-code splices. Logic is sound; the conditional on skip
callgrind/fn.c	ARM64-gated: marks dl_tlsdesc* resolver trampolines as PLT-skip-equivalent, preventing them from leaking as named nodes when called from obj-skipped code.
coregrind/m_stacktrace.c	Adds AAPCS64 frame-pointer chain fallback (X29/{saved-X29,saved-X30}) when CFI lookup fails, mirroring the amd64 %rbp path. Guards are well-chosen; the 1 == next_pc terminator lacks an inline explanation.
coregrind/m_debuginfo/readelf.c	Adds NIX_DEBUG_INFO_DIRS colon-separated search before /usr/lib/debug fallback. The new try_buildid_dir helper correctly handles empty dirlen, edge-case colons, and memory ownership.
callgrind-utils/src/parser/mod.rs	New Callgrind .out parser handling name compression, three independent ID spaces, inline fi/fe transitions, cfl/cfi alias, part/thread boundaries, and bare-cfn discard. One multi-clause boolean condition should be named per team style rule.
callgrind-utils/src/flamegraph.rs	DFS-based folded-stack generator with cycle detection via on_path guard and proportional budget pruning. Logic is sound; one multi-clause guard should be extracted to a named variable per team style rule.
callgrind-utils/src/model.rs	Clean data model with deduplication, aggregation of call_count and inclusive_cost across duplicate edges, and binary-search node lookup. from_parts is well-structured.
callgrind-utils/build.rs	Build script drives configure+make for in-repo Callgrind, tracking only top-level callgrind/*.c/h to avoid re-triggering on test output files. CAPSTONE_DIR assertion gives a clear error on unconfigured trees.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["AArch64 BL/B instruction\n(VEX guest_arm64_toIR.c)"] -->|"bLink=1 → Ijk_Call\nbLink=0 → Ijk_Boring"| B["Callgrind dispatch\n(setup_bbcc)"]
    B -->|"jk_Call + real call"| C["push_call_stack\nRecord guest X30 as ret_addr\n(callstack.c ARM64 path)"]
    B -->|"jk_Call + splice or skip"| D["push_call_stack\nRecord ret_addr = 0\n(emulated/spliced frame)"]
    C --> E["Shadow call stack\n(ret_addr = architectural RA)"]
    D --> E
    E -->|"RET detected"| F["Pop matching frames\n(same-SP run, ret_addr match)"]
    G["CFI lookup\n(VG_use_CF_info)"] -->|"succeeds"| H["Emit IP with -1 bias"]
    G -->|"fails (JIT/asm)"| I["AAPCS64 FP chain fallback\nX29 → {saved X29, saved X30}\n(m_stacktrace.c)"]
    I --> H
    H --> J["Stack seed at\nCALLGRIND_START_INSTRUMENTATION"]
    J -->|"IP+1 → exact return targets"| E
    K["_dl_tlsdesc_* resolver\n(fn.c)"] -->|"fn->skip = skip_plt"| L["Transparent trampoline\n(never a named node)"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["AArch64 BL/B instruction\n(VEX guest_arm64_toIR.c)"] -->|"bLink=1 → Ijk_Call\nbLink=0 → Ijk_Boring"| B["Callgrind dispatch\n(setup_bbcc)"]
    B -->|"jk_Call + real call"| C["push_call_stack\nRecord guest X30 as ret_addr\n(callstack.c ARM64 path)"]
    B -->|"jk_Call + splice or skip"| D["push_call_stack\nRecord ret_addr = 0\n(emulated/spliced frame)"]
    C --> E["Shadow call stack\n(ret_addr = architectural RA)"]
    D --> E
    E -->|"RET detected"| F["Pop matching frames\n(same-SP run, ret_addr match)"]
    G["CFI lookup\n(VG_use_CF_info)"] -->|"succeeds"| H["Emit IP with -1 bias"]
    G -->|"fails (JIT/asm)"| I["AAPCS64 FP chain fallback\nX29 → {saved X29, saved X30}\n(m_stacktrace.c)"]
    I --> H
    H --> J["Stack seed at\nCALLGRIND_START_INSTRUMENTATION"]
    J -->|"IP+1 → exact return targets"| E
    K["_dl_tlsdesc_* resolver\n(fn.c)"] -->|"fn->skip = skip_plt"| L["Transparent trampoline\n(never a named node)"]

_{Reviews (1): Last reviewed commit: "feat(callgrind-utils): add perf_map symb..." | Re-trigger Greptile}

On arm64, bl/blr write the return address to X30 and SP does not move across a call/return pair, so unlike on x86 the return detector cannot fall back on SP progress and depends entirely on each frame's recorded ret_addr. When a call entered skipped code (a libc PLT hop), the skipped->nonskipped jump was pushed with setup_bbcc's spliced 'nonskipped' source BB, whose last jump is the very call that created the skip frame below: the emulated frame duplicated that frame's statically computed return address, the callee's single ret popped only the top entry, and the leaked equal-SP skip frame starved the pop budget of the next same-SP return. Misclassified returns were then re-promoted into phantom calls back into the live caller, cloning non-recursive functions as bogus 'N recursion levels (complex_fractal_benchmark'2) and misattributing follow-up work ("free calls X"). Record the guest X30 -- the architectural return target -- for frames entered by a real call, and record ret_addr = 0 for emulated/spliced pushes so the return matcher absorbs them down the same-SP run and pops the group at the frame of the real call underneath, which also restores the pre-call nonskipped state. x86 keeps the static computation and is behaviorally unchanged. Verified by the arm64_plt_phantom_recursion / arm64_free_tailcall_phantom fixture snapshots (flipped to the correct shapes), the structural guards in arm64_fractal_alloc_no_free_misattribution, the no-phantom-clone assertion in rust_fixture_full_trace, and an unchanged in-tree 'vg_regtest callgrind' pass/fail set.

VG_(get_StackTrace) on arm64 was CFI-only, so Callgrind's OFF->ON shadow-stack seeding stopped at the first frame without unwind info -- notably CPython's -X perf JIT trampolines, which have no FDEs but keep the AAPCS64 frame chain alive (that is their design: perf/samply walk them by FP). The truncated seed left the seeded context stack one entry deep after the innermost frame popped; bbcc.c's underflow check then misread the fn-stack base sentinel as a signal marker on every return, and handleUnderflow fabricated named nodes for obj-skipped interpreter functions with inverted, full-cost edges (_ctypes_callproc -> PyCFuncPtr_call -> _TAIL_CALL_* -> ... as the graph root). x86_64 never hit this because its unwinder already has the %rbp fallback. Follow the frame records {saved X29, saved X30} when CFI fails, with guards: record in-stack and 8-aligned, SP must progress, next record strictly higher (saved X29 == 0 accepted as chain terminator), pc 0/1 stops the walk. Caller IPs keep the CFI path's -1 bias. Callgrind's seeder correspondingly records ret_addr = ips[frame+1] + 1, undoing that bias so the arm64 return matcher (exact-X30 matching, no SP movement on bl/ret) can pop seeded frames. An A/B run confirmed raising CLG_RECON_MAX_FRAMES alone does not help: the seed was CFI-truncated at ~8 frames, nowhere near the 256 cap. Regression coverage: callgrind-utils/tests/objskip_seed_underflow.rs, a minimal fixture whose asm trampoline maintains FP but has no CFI and which starts instrumentation two obj-skipped frames deep; it asserts no skipped frame leaks into the folded graph and that the workload parents under the trampoline.

Every TLSDESC __thread access blr's into the dynamic linker's resolver, which rets straight back into the middle of the accessing function. When the access is made from an obj-skipped object (CPython under pytest-codspeed), the skipped->nonskipped splice pushed the resolver frame with ret_addr 0; its mid-function return could never match, the RET-w/o-CALL promotion re-entered the skipped object with nonskipped pointing at the resolver, and skipped cost plus call edges piled up under _dl_tlsdesc_return -- pulling nearly whole Python flamegraphs under that node, plus inverted return-direction edges. Mark _dl_tlsdesc_* skipped (gated on --skip-plt, arm64 only), the same transparent-trampoline class as _dl_runtime_resolve; pop_on_jump cannot apply since these exit via a plain ret to a non-entry address. Skip pushes are never spliced and record the architectural X30, so the return pops cleanly and cost keeps flowing to the real non-skipped caller. Regression: callgrind-utils/tests/arm64_tls_access.rs (plain + --obj-skip runs of the shared-lib TLS fixture); both tests fail on the unfixed tool.

find_debug_file() only searched the /usr/lib/debug tree, which does not exist on NixOS: Nix ships separate debug outputs under their own store paths. Factor the build-id .build-id/xx/yyyy.debug probe into try_buildid_dir() and honour NIX_DEBUG_INFO_DIRS -- the established colon-separated convention also used by the nixpkgs gdb/lldb wrappers -- so split debug info resolves inside the dev-shell.

…megraph output New Rust crate (edition 2024) that reads a Callgrind .out profile and extracts call-graph topology (costs/addresses ignored), serializing to canonical index-ref JSON for stable cross-platform callgraph diffing and to folded/flamegraph stacks. Node identity is the {object,file,function} tuple so same-named statics stay distinct. Edges are emitted only on calls= lines; name compression across the three ID spaces, the cfl/cfi alias, inline fi/fe callee-context inheritance, and multi-part merge are handled. A redaction pass strips volatile addresses/paths so snapshots stay portable. Fixtures are compiled and profiled by the in-repo Callgrind through an rstest harness (vg-in-place, --instr-atstart=no plus client requests keep loader/libc frames out): base graphs (recursion, chain, diamond, mutual), arm64 unwind fixtures (tail calls, recursion, TLS access, alloc/free cycles, longjmp, phantom recursion), the objskip seeding underflow regression, and the arm64 TLS-descriptor regression. Folded-stack snapshots and structural assertions; clippy and rustfmt clean.

…frames Add the `perf_map` module: `CallGraph::symbolize_perf_map` resolves Callgrind's anonymous `0x...` JIT nodes to `py::<qualname>:<file>` via CPython's `/tmp/perf-<pid>.map`, written under `python3 -X perf`. Add the `fractal.py` fixture and `python_fractal_callgraph` test that profiles it live and snapshots the folded stacks and canonical JSON.

not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch 4 times, most recently from 441b2f0 to 188e66b Compare July 2, 2026 17:04

not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch 2 times, most recently from 5e34b22 to 771904a Compare July 3, 2026 08:54

not-matthias marked this pull request as ready for review July 3, 2026 08:55

not-matthias mentioned this pull request Jul 3, 2026

fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985) #20

Closed

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread coregrind/m_stacktrace.c

Comment thread callgrind-utils/src/flamegraph.rs Outdated

Comment thread callgrind-utils/src/parser/mod.rs Outdated

not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch from 771904a to bb9053e Compare July 3, 2026 09:12

not-matthias added 6 commits July 3, 2026 13:53

not-matthias force-pushed the cod-2985-arm-flamegraph-failures branch from 9db3f68 to 0a322a7 Compare July 3, 2026 12:02

not-matthias requested a review from GuillaumeLagrange July 3, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: valgrind on ARM#21

fix: valgrind on ARM#21
not-matthias wants to merge 7 commits into
masterfrom
cod-2985-arm-flamegraph-failures

not-matthias commented Jun 30, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jul 3, 2026

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

not-matthias commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

greptile-apps Bot commented Jul 3, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

not-matthias commented Jun 30, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 30, 2026 •

edited

Loading