|
1900 | 1900 | "metadata": {}, |
1901 | 1901 | "source": [ |
1902 | 1902 | "- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient\n", |
1903 | | - "- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters\n", |
1904 | | - "- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)\n" |
| 1903 | + "- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters" |
| 1904 | + ] |
| 1905 | + }, |
| 1906 | + { |
| 1907 | + "cell_type": "markdown", |
| 1908 | + "id": "c8bd41e1-32d4-4067-a6d0-fe756a6511a9", |
| 1909 | + "metadata": {}, |
| 1910 | + "source": [ |
| 1911 | + "---\n", |
| 1912 | + "\n", |
| 1913 | + "**A note about the output dimensions**\n", |
| 1914 | + "\n", |
| 1915 | + "- In the `MultiHeadAttention` above, I used `d_out=2` to use the same setting as in the `MultiHeadAttentionWrapper` class earlier\n", |
| 1916 | + "- The `MultiHeadAttentionWrapper`, due the the concatenation, returns the output head dimension `d_out * num_heads` (i.e., `2*2 = 4`)\n", |
| 1917 | + "- However, the `MultiHeadAttention` class (to make it more user-friendly) allows us to control the output head dimension directly via `d_out`; this means, if we set `d_out = 2`, the output head dimension will be 2, regardless of the number of heads\n", |
| 1918 | + "- In hindsight, as readers [pointed out](https://github.com/rasbt/LLMs-from-scratch/pull/859), it may be more intuitive to use `MultiHeadAttention` with `d_out = 4` so that it produces the same output dimensions as `MultiHeadAttentionWrapper` with `d_out = 2`.\n", |
| 1919 | + "\n", |
| 1920 | + "---" |
| 1921 | + ] |
| 1922 | + }, |
| 1923 | + { |
| 1924 | + "cell_type": "markdown", |
| 1925 | + "id": "9310bfa5-9aa9-40b4-8081-a5d8db5faf74", |
| 1926 | + "metadata": {}, |
| 1927 | + "source": [ |
| 1928 | + "- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)" |
1905 | 1929 | ] |
1906 | 1930 | }, |
1907 | 1931 | { |
|
0 commit comments