Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,8 @@
"- In the equation above,\n",
" - \"expected value\" $\\mathbb{E}$ is statistics jargon and stands for the average or mean value of the random variable (the expression inside the brackets); optimizing $-\\mathbb{E}$ aligns the model better with user preferences\n",
" - The $\\pi_{\\theta}$ variable is the so-called policy (a term borrowed from reinforcement learning) and represents the LLM we want to optimize; $\\pi_{ref}$ is a reference LLM, which is typically the original LLM before optimization (at the beginning of the training, $\\pi_{\\theta}$ and $\\pi_{ref}$ are typically the same)\n",
" - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ increases the impact of the difference between\n",
"$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby increasing the divergence between the two models\n",
" - $\\beta$ is a hyperparameter to control the divergence between the $\\pi_{\\theta}$ and the reference model; increasing $\\beta$ reduces the impact of the difference between\n",
"$\\pi_{\\theta}$ and $\\pi_{ref}$ in terms of their log probabilities on the overall loss function, thereby decreasing the divergence between the two models\n",
" - the logistic sigmoid function, $\\sigma(\\centerdot)$ transforms the log-odds of the preferred and rejected responses (the terms inside the logistic sigmoid function) into a probability score \n",
"- To avoid bloating the code notebook with a more detailed discussion, I may write a separate standalone article with more details on these concepts in the future\n",
"- In the meantime, if you are interested in comparing RLHF and DPO, please see the section [2.2. RLHF vs Direct Preference Optimization (DPO)](https://magazine.sebastianraschka.com/i/142924793/rlhf-vs-direct-preference-optimization-dpo) in my article [Tips for LLM Pretraining and Evaluating Reward Models](https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms)"
Expand Down
Loading