Skip to content

[Buffers] Counter Buffers for Space Optimization when Latency Balancing#813

Open
ziadomalik wants to merge 13 commits intomainfrom
exp/ziad/counter-buffer
Open

[Buffers] Counter Buffers for Space Optimization when Latency Balancing#813
ziadomalik wants to merge 13 commits intomainfrom
exp/ziad/counter-buffer

Conversation

@ziadomalik
Copy link
Copy Markdown
Collaborator

The FPGA24 Paper that aims to latency and occupancy balance a dataflow circuit centers its optimization around a new type of buffer that can hold a token for n cycles, which helps us now save space that we were consuming by placing n buffers (one for each latency cycle) before.

Summary of the Changes

  1. In HandshakeOps.td:
    Created COUNTER_BUFFER is a which is a variant of BufferOp with two key attributes:
  • numSlots = 1 (always stores exactly one token)
  • New: dvLatency >= 1 (the delay in clock cycles)
  1. In HandshakeOps.cpp:
  • The verifier enforces numSlots == 1 and dvLatency >= 1.
  1. In FPGA24Buffers.cpp:
  • LP1 computes the required extra latency per channel (L_c).
  • LP2 computes the optimal buffer placement to satisfy occupancy (N_c).
  • The extractResult method translates these into physical buffers:
    • K = min(L_c, N_c) counter buffers are placed in series.
    • Each gets dvLatency = floor(L_c / K) (with remainder distributed).
    • Their delays have to sum to L_c.
    • If N_c > K, add FIFO_BREAK_NONE slots provide storage in case we need more occupancy than buffers to carry latency.
  1. In HandshakePlaceBuffers.cpp:
  • Each counter buffer in counterBufferLatencies becomes a BufferOp with numSlots=1 and the specific dvLatency.
  1. In HandshakeToHW.cpp:
  • Added DV_LATENCY to the HW parameters.
  1. Updated the RTL config (rtl-config-verilog.json)

  2. Modified the buffers.py and counter_buffer.py to spit out the HDL. (See below)

RTL Architecture

Counter buffer: 1 x bitwidth data register + ceil(log2(dvLatency)) counter bits + 1 busy flip-flop

State Transition Diagram

State busy counter ins_ready outs_valid Transition
IDLE 0 - 1 0 ins_valid -> COUNTING
COUNTING 1 > 0 0 0 Decrement counter each cycle
DONE 1 0 outs_ready 1 outs_ready & ins_valid -> COUNTING (reload); outs_ready & !ins_valid -> IDLE

Note on the DONE state:
The counter buffer must support back-to-back tokens without a dead cycle.

assign ins_ready = ~busy | (done & outs_ready);

This mirrors the shift register's ins_ready = ~outs_valid | outs_ready. The buffer signals readiness not only when idle, but also in the same cycle the output is being consumed. Not having this the buffer has a 1-cycle time between tokens in which it could be accepting new tokens, halving throughput. This was the root cause of the 2x latency regression (1005 -> 2004 on fir).

@ziadomalik ziadomalik requested a review from Jiahui17 March 26, 2026 13:04
@Jiahui17
Copy link
Copy Markdown
Member

-- a one-slot buffer that holds the token for at least
-- LATENCY number of cycles. note that when LATENCY = 1,
-- the II of this unit is 1/2 (since it has only a
-- single slot)

entity delayer is
	generic(
			 INPUTS        : integer;
			 OUTPUTS       : integer;
			 DATA_SIZE_IN  : integer;
			 DATA_SIZE_OUT : integer;
			 LATENCY       : integer := 4
		 );
	port (
	clk, rst      : in  std_logic;
	dataInArray   : in  data_array(INPUTS - 1 downto 0)(DATA_SIZE_IN - 1 downto 0);
	dataOutArray  : out data_array(0 downto 0)(DATA_SIZE_OUT - 1 downto 0);
	pValidArray   : in  std_logic_vector(INPUTS - 1 downto 0);
	nReadyArray   : in  std_logic_vector(0 downto 0);
	validArray    : out std_logic_vector(0 downto 0);
	readyArray    : out std_logic_vector(INPUTS - 1 downto 0)
);

begin
	assert INPUTS  = 1 severity failure;
	assert OUTPUTS = 1 severity failure;
	assert LATENCY  > 0 severity failure;
	assert DATA_SIZE_IN  > 0 severity failure;
	assert DATA_SIZE_OUT > 0 severity failure;
end delayer;

architecture arch of delayer is
	constant counter_width : integer := integer(ceil(log2(real(LATENCY))));
	signal full_reg    : std_logic := '0';
	signal data_reg    : std_logic_vector(DATA_SIZE_IN-1 downto 0) := (others => '0');
	signal counter_reg  : unsigned (counter_width - 1 downto 0) := (others => '0');
	signal output_transfer : std_logic := '0';
	signal input_transfer  : std_logic := '0';

	signal valid_internal : std_logic := '0';
	signal ready_internal : std_logic := '0';
	constant COUNTER_ZERO : unsigned (counter_width - 1 downto 0) := (others => '0');

	signal one: std_logic_vector (0 downto 0) := "1";
	signal zero: std_logic_vector (0 downto 0) := "0";

	signal b_counter_latency : std_logic := '0';
begin

	output_transfer <= (valid_internal and nReadyArray(0));
	input_transfer <= (pValidArray(0) and ready_internal);
	ready_internal <= output_transfer or (not full_reg);

	b_counter_latency <= '1' when (counter_reg = to_unsigned(LATENCY - 1, counter_width)) else '0';

	validArray(0) <= valid_internal;
	readyArray(0) <= ready_internal;

	valid_internal <= (b_counter_latency and full_reg);
	dataOutArray(0) <= data_reg;

	p_update_counter : process (clk)
	begin
		if (rising_edge(clk)) then
			-- counter_reg starts at LATENCY - 1 (because it takes 1
			-- cycle to make full_reg = 1)
			if (rst) or (output_transfer) then
				counter_reg <= (others => '0');
			elsif (full_reg) and (not b_counter_latency) then
				counter_reg <= counter_reg + 1;
			end if;
		end if;
	end process;

	p_update_full_reg : process (clk)
	begin
		if (rising_edge(clk)) then
			if (rst) then
				full_reg <= '0';
			elsif (input_transfer) then
				full_reg <= '1';
			elsif (output_transfer) and (not input_transfer) then
				full_reg <= '0';
			end if;
		end if;
	end process;

	p_update_data_reg : process (clk)
	begin
		if (rising_edge(clk)) then
			if (rst) then
				data_reg <= (others => '0');
			elsif (input_transfer) then
				data_reg <= dataInArray(0);
			end if;
		end if;
	end process;


end arch;

@ziadomalik
Copy link
Copy Markdown
Collaborator Author

ziadomalik commented Mar 30, 2026

You were indeed correct, I still had that extra one-cycle bug, fixed it!

@Jiahui17
Copy link
Copy Markdown
Member

Jiahui17 commented Apr 1, 2026

rebase + squash the commits?

chore: resolve merge conflicts

chore: memops are joins if no LSQ connection

chore: remove stall debug asserts in export-rtl

chore: fix HandshakePlaceBuffers

chore: add clarifying comments

chore: format

chore: format

chore: add algorithm back + export-rtl stalls (to remove later)

chore(buffer-definition): bad approach, todo

feat(buffers): fully functioning counter buffer

chore: use most uptodate `build.sh`

chore: move to constraints to constraint db, use better datatypes for determinism, cleanup structure

fix: remove hacky `namespace boost`

fix: data/aig

chore: format + data/aig

chore: manually apply the formatting to satisfy clang-format

Delete include/dynamatic/Transforms/ResourceSharing/Crush.h

this doesn't exist on main

chore: rollback export-rtl + data/aig

chore: fix extra cycle bug in hdl
@ziadomalik ziadomalik force-pushed the exp/ziad/counter-buffer branch from ade4088 to 5678a70 Compare April 2, 2026 22:10
Comment thread include/dynamatic/Dialect/Handshake/HandshakeOps.td
Comment thread include/dynamatic/Transforms/BufferPlacement/BufferingSupport.h Outdated
Comment thread lib/Transforms/BufferPlacement/FPGA24Buffers.cpp
Comment thread include/dynamatic/Dialect/Handshake/HandshakeOps.td Outdated
@ziadomalik ziadomalik force-pushed the exp/ziad/counter-buffer branch from 7733fde to cf7f9c4 Compare April 21, 2026 14:08
Comment thread data/aig
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants