[llama] Use horizontal fusion trick from Attention for FeedForward

For the Attention module we can concatenate the weights and do one instead of three GEMMs for the input to gain a speedup, because each GEMM will be applied to the same input.

https://github.com/pytorch/ao/blob/22d6f97d8584dd14ac5d0c5bc4ddad9bf33553fe/torchao/_models/llama/model.py#L220-L225
and
https://github.com/pytorch/ao/blob/22d6f97d8584dd14ac5d0c5bc4ddad9bf33553fe/torchao/_models/llama/model.py#L230-L231

I suspect we can do the exact same thing for FeedFoward

https://github.com/pytorch/ao/blob/22d6f97d8584dd14ac5d0c5bc4ddad9bf33553fe/torchao/_models/llama/model.py#L262-L263


Task:
Implement the above trick and [rerun the benchmarks](https://github.com/pytorch/ao/blob/main/torchao/_models/llama/benchmarks.sh) to show gains. If you don't have access to an A100, another (ideally similar) GPU is fine too as a proxy. Also, if you can, try to confirm via a trace that indeed two GEMMs have been turned into one.

	def load_hook(self, state_dict, prefix, *args):
	if prefix + "wq.weight" in state_dict:
	wq = state_dict.pop(prefix + "wq.weight")
	wk = state_dict.pop(prefix + "wk.weight")
	wv = state_dict.pop(prefix + "wv.weight")
	state_dict[prefix + "wqkv.weight"] = torch.cat([wq, wk, wv])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llama] Use horizontal fusion trick from Attention for FeedForward #606

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	kv_size = self.n_local_heads * self.head_dim
	q, k, v = self.wqkv(x).split([self.dim, kv_size, kv_size], dim=-1)

	def forward(self, x: Tensor) -> Tensor:
	return self.w2(F.silu(self.w1(x)) * self.w3(x))

[llama] Use horizontal fusion trick from Attention for FeedForward #606

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions