Skip to content

Commit 0e04d98

Browse files
committed
update links
1 parent d0bd118 commit 0e04d98

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

ch04/04_gqa/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,21 @@ Grouped-Query Attention (GQA) has become the new standard replacement for a more
1212

1313
Here's a brief GQA summary. Unlike MHA, where each head also has its own set of keys and values, to reduce memory usage, GQA groups multiple heads to share the same key and value projections.
1414

15-
For example, as further illustrated in the figure below, if there are 3 key-value groups and 6 attention heads, then heads 1 and 2 share one set of keys and values, while heads 3 and 4, as well as heads 5 and 6, share another, respectively.
15+
For example, as further illustrated in the figure below, if there are 3 key-value groups and 6 attention heads, then heads 1 and 2 share one set of keys and values, while heads 3 and 4, as well as heads 5 and 6, share another, respectively.
1616

1717
 
1818

1919
![GQA](https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gqa-memory/1.webp?1)
2020

2121
 
2222

23-
This sharing of keys and values reduces the total number of key and value computations, which leads to lower memory usage and improved efficiency.
23+
This sharing of keys and values reduces the total number of key and value computations, which leads to lower memory usage and improved efficiency.
2424

2525
So, to summarize, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache.
2626

27-
While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the [original GQA paper](https://arxiv.org/abs/2305.13245) and the [Llama 2 paper](https://arxiv.org/abs/2307.09288)) show it performs comparably to standard MHA in terms of LLM modeling performance.
27+
While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the [original GQA paper](https://arxiv.org/abs/2305.13245) and the [Llama 2 paper](https://arxiv.org/abs/2307.09288)) show it performs comparably to standard MHA in terms of LLM modeling performance.
2828

29-
However, this assumes that the number of key-value groups is chosen carefully. However, if we set the number of key-value heads equal to the number of heads (this special case is known as multi-query attention), it will negatively affect the modeling performance.
29+
However, this assumes that the number of key-value groups is chosen carefully. However, if we set the number of key-value heads equal to the number of heads (this special case is known as multi-query attention), it will negatively affect the modeling performance.
3030

3131

3232

@@ -70,7 +70,7 @@ The savings when using GQA over MHA are further shown in the plot below for diff
7070

7171
 
7272

73-
And the following plot shows how the KV cache size grows with an increasing context length:
73+
And the following plot shows how the KV cache size grows with an increasing context length:
7474

7575
 
7676

@@ -87,7 +87,7 @@ You can reproduce these plots via `uv run plot_memory_estimates.py`.
8787

8888
The [gpt_with_kv_mha.py](gpt_with_kv_mha.py) and [gpt_with_kv_gqa.py](gpt_with_kv_gqa.py) scripts in this folder provide hands-on examples for comparing the MHA and GQA memory usage in the context of a GPT model implementation.
8989

90-
Note that GQA is also used in the [Llama 3](../ch05/07_gpt_to_llama), [Gemma 3](../ch05/12_gemma3), and [Qwen3](../ch05/11_qwen3) bonus materials. However, for simplicity, the code scripts in this folder modify the GPT architecture, which traditionally didn't use GQA.
90+
Note that GQA is also used in the [Llama 3](../../ch05/07_gpt_to_llama), [Gemma 3](../../ch05/12_gemma3), and [Qwen3](../../ch05/11_qwen3) bonus materials. However, for simplicity, the code scripts in this folder modify the GPT architecture, which traditionally didn't use GQA.
9191

9292
Note that the model is not trained and thus generates nonsensical text. However, you can use it as a drop-in replacement for the standard GPT model in chapters 5-7 and train it.
9393

@@ -97,7 +97,7 @@ Also, this implementation uses the KV cache explained in [another bonus section]
9797
uv run gpt_with_kv_mha.py \
9898
--max_new_tokens 32768 \
9999
--n_heads 24 \
100-
--n_layers 12
100+
--n_layers 12
101101

102102
...
103103

0 commit comments

Comments
 (0)