Skip to content

Commit 7084123

Browse files
authored
Note about output dimensions (#862)
1 parent 4d9f9dc commit 7084123

File tree

1 file changed

+26
-2
lines changed

1 file changed

+26
-2
lines changed

ch03/01_main-chapter-code/ch03.ipynb

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1900,8 +1900,32 @@
19001900
"metadata": {},
19011901
"source": [
19021902
"- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient\n",
1903-
"- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters\n",
1904-
"- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)\n"
1903+
"- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters"
1904+
]
1905+
},
1906+
{
1907+
"cell_type": "markdown",
1908+
"id": "c8bd41e1-32d4-4067-a6d0-fe756a6511a9",
1909+
"metadata": {},
1910+
"source": [
1911+
"---\n",
1912+
"\n",
1913+
"**A note about the output dimensions**\n",
1914+
"\n",
1915+
"- In the `MultiHeadAttention` above, I used `d_out=2` to use the same setting as in the `MultiHeadAttentionWrapper` class earlier\n",
1916+
"- The `MultiHeadAttentionWrapper`, due the the concatenation, returns the output head dimension `d_out * num_heads` (i.e., `2*2 = 4`)\n",
1917+
"- However, the `MultiHeadAttention` class (to make it more user-friendly) allows us to control the output head dimension directly via `d_out`; this means, if we set `d_out = 2`, the output head dimension will be 2, regardless of the number of heads\n",
1918+
"- In hindsight, as readers [pointed out](https://github.com/rasbt/LLMs-from-scratch/pull/859), it may be more intuitive to use `MultiHeadAttention` with `d_out = 4` so that it produces the same output dimensions as `MultiHeadAttentionWrapper` with `d_out = 2`.\n",
1919+
"\n",
1920+
"---"
1921+
]
1922+
},
1923+
{
1924+
"cell_type": "markdown",
1925+
"id": "9310bfa5-9aa9-40b4-8081-a5d8db5faf74",
1926+
"metadata": {},
1927+
"source": [
1928+
"- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)"
19051929
]
19061930
},
19071931
{

0 commit comments

Comments
 (0)