[Buffers] Counter Buffers for Space Optimization when Latency Balancing#813
Open
ziadomalik wants to merge 13 commits intomainfrom
Open
[Buffers] Counter Buffers for Space Optimization when Latency Balancing#813ziadomalik wants to merge 13 commits intomainfrom
ziadomalik wants to merge 13 commits intomainfrom
Conversation
Member
|
Collaborator
Author
|
You were indeed correct, I still had that extra one-cycle bug, fixed it! |
Member
|
rebase + squash the commits? |
chore: resolve merge conflicts chore: memops are joins if no LSQ connection chore: remove stall debug asserts in export-rtl chore: fix HandshakePlaceBuffers chore: add clarifying comments chore: format chore: format chore: add algorithm back + export-rtl stalls (to remove later) chore(buffer-definition): bad approach, todo feat(buffers): fully functioning counter buffer chore: use most uptodate `build.sh` chore: move to constraints to constraint db, use better datatypes for determinism, cleanup structure fix: remove hacky `namespace boost` fix: data/aig chore: format + data/aig chore: manually apply the formatting to satisfy clang-format Delete include/dynamatic/Transforms/ResourceSharing/Crush.h this doesn't exist on main chore: rollback export-rtl + data/aig chore: fix extra cycle bug in hdl
ade4088 to
5678a70
Compare
Jiahui17
requested changes
Apr 20, 2026
7733fde to
cf7f9c4
Compare
Jiahui17
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The FPGA24 Paper that aims to latency and occupancy balance a dataflow circuit centers its optimization around a new type of buffer that can hold a token for
ncycles, which helps us now save space that we were consuming by placingnbuffers (one for each latency cycle) before.Summary of the Changes
HandshakeOps.td:Created
COUNTER_BUFFERis a which is a variant of BufferOp with two key attributes:HandshakeOps.cpp:numSlots == 1anddvLatency >= 1.FPGA24Buffers.cpp:L_c).N_c).extractResultmethod translates these into physical buffers:K = min(L_c, N_c)counter buffers are placed in series.dvLatency = floor(L_c / K)(with remainder distributed).L_c.N_c > K, addFIFO_BREAK_NONEslots provide storage in case we need more occupancy than buffers to carry latency.HandshakePlaceBuffers.cpp:HandshakeToHW.cpp:DV_LATENCYto the HW parameters.Updated the RTL config (
rtl-config-verilog.json)Modified the
buffers.pyandcounter_buffer.pyto spit out the HDL. (See below)RTL Architecture
Counter buffer:
1 x bitwidth data register + ceil(log2(dvLatency)) counter bits + 1 busy flip-flopState Transition Diagram
Note on the
DONEstate:The counter buffer must support back-to-back tokens without a dead cycle.
This mirrors the shift register's
ins_ready = ~outs_valid | outs_ready. The buffer signals readiness not only when idle, but also in the same cycle the output is being consumed. Not having this the buffer has a 1-cycle time between tokens in which it could be accepting new tokens, halving throughput. This was the root cause of the 2x latency regression (1005->2004onfir).