update links

rasbt · rasbt · commit 0e04d98c6c6b · 2025-10-11T08:21:45.000-05:00
diff --git a/ch04/04_gqa/README.md b/ch04/04_gqa/README.md
@@ -12,21 +12,21 @@ Grouped-Query Attention (GQA) has become the new standard replacement for a more
 
 Here's a brief GQA summary. Unlike MHA, where each head also has its own set of keys and values, to reduce memory usage, GQA groups multiple heads to share the same key and value projections.
 
-For example, as further illustrated in the figure below, if there are 3 key-value groups and 6 attention heads, then heads 1 and 2 share one set of keys and values, while heads 3 and 4, as well as heads 5 and 6, share another, respectively. 
+For example, as further illustrated in the figure below, if there are 3 key-value groups and 6 attention heads, then heads 1 and 2 share one set of keys and values, while heads 3 and 4, as well as heads 5 and 6, share another, respectively.
 
 &nbsp;
 
 ![GQA](https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gqa-memory/1.webp?1)
 
 &nbsp;
 
-This sharing of keys and values reduces the total number of key and value computations, which leads to lower memory usage and improved efficiency. 
+This sharing of keys and values reduces the total number of key and value computations, which leads to lower memory usage and improved efficiency.
 
 So, to summarize, the core idea behind GQA is to reduce the number of key and value heads by sharing them across multiple query heads. This (1) lowers the model's parameter count and (2) reduces the memory bandwidth usage for key and value tensors during inference since fewer keys and values need to be stored and retrieved from the KV cache.
 
-While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the [original GQA paper](https://arxiv.org/abs/2305.13245) and the [Llama 2 paper](https://arxiv.org/abs/2307.09288)) show it performs comparably to standard MHA in terms of LLM modeling performance. 
+While GQA is mainly a computational-efficiency workaround for MHA, ablation studies (such as those in the [original GQA paper](https://arxiv.org/abs/2305.13245) and the [Llama 2 paper](https://arxiv.org/abs/2307.09288)) show it performs comparably to standard MHA in terms of LLM modeling performance.
 
-However, this assumes that the number of key-value groups is chosen carefully. However, if we set the number of key-value heads equal to the number of heads (this special case is known as multi-query attention), it will negatively affect the modeling performance. 
+However, this assumes that the number of key-value groups is chosen carefully. However, if we set the number of key-value heads equal to the number of heads (this special case is known as multi-query attention), it will negatively affect the modeling performance.
 
 
 
@@ -70,7 +70,7 @@ The savings when using GQA over MHA are further shown in the plot below for diff
 
 &nbsp;
 
-And the following plot shows how the KV cache size grows with an increasing context length: 
+And the following plot shows how the KV cache size grows with an increasing context length:
 
 &nbsp;
 
@@ -87,7 +87,7 @@ You can reproduce these plots via `uv run plot_memory_estimates.py`.
 
 The [gpt_with_kv_mha.py](gpt_with_kv_mha.py) and [gpt_with_kv_gqa.py](gpt_with_kv_gqa.py) scripts in this folder provide hands-on examples for comparing the MHA and GQA memory usage in the context of a GPT model implementation.
 
-Note that GQA is also used in the [Llama 3](../ch05/07_gpt_to_llama), [Gemma 3](../ch05/12_gemma3), and [Qwen3](../ch05/11_qwen3) bonus materials. However, for simplicity, the code scripts in this folder modify the GPT architecture, which traditionally didn't use GQA. 
+Note that GQA is also used in the [Llama 3](../../ch05/07_gpt_to_llama), [Gemma 3](../../ch05/12_gemma3), and [Qwen3](../../ch05/11_qwen3) bonus materials. However, for simplicity, the code scripts in this folder modify the GPT architecture, which traditionally didn't use GQA.
 
 Note that the model is not trained and thus generates nonsensical text. However, you can use it as a drop-in replacement for the standard GPT model in chapters 5-7 and train it.
 
@@ -97,7 +97,7 @@ Also, this implementation uses the KV cache explained in [another bonus section]
 uv run gpt_with_kv_mha.py \
 --max_new_tokens 32768 \
 --n_heads 24 \
---n_layers 12 
+--n_layers 12
 
 ...