03. Reducing Loss

Jump to bottom

Antonio Erdeljac edited this page Feb 26, 2019 · 1 revision

Reducing Loss

Topic: Reducing Loss

Course: GMLC

Date: 11 February 2019

Professor: Not specified

Resources

Key Points

Iterative learning
- Hot and Cold
- Finding the best possible model as efficiently as possible (in fewest amount of steps)
- Trial & error process that changes parameters to receive better results on each iteration
- b, w1 = 0 - initial values are chosen at random and corrected through iterations (Compute loss -> compute parameter updates (b = 1, w1 =1 for example, calculate loss again...)
- Convergion - point where overall loss stops changing or changes extremely slow
- Iteration goes on until conversion happens
Gradient descent
- w1 - often set to 0 or chosen at random
- Gradient has a direction & magnitude
- Always points in the direction of the steepest increase in the loss function
- Takes steps in negative gradient to reduce loss
- After taking a step in negative gradient, algorithm adds fraction of gradient’s magnitude to the starting point
Learning rate
- Scalar used to train a model via gradient descent
- Gradient descent algorithms multiply the gradient by learning rate or step size
- Hyperparamaters - tweaks in runs of training a model (learning rate for example)
- Too small learning rate - too long
- Too large learning rate - overshooting the minimum, haphazard motion of unexpectancy
- Goldilocks
  - Ideal learning rate for every regression problem
  - If a gradient loss is small, tweak a larger learning rate - results in a larger step size
  - - formula for calculating ideal learning rate for one-dimension
  - Hessian matrix is used for calculating ideal learning rate for multiple dimensions
Batch - the total number of examples in a single iteration (one gradient update)
Stochastic Gradient Descend (SGD)
- Single batch per iteration
- Very noisy
Mini-batch SGD
- 10 - 1000 examples per iteration
- Reduces the amount of noise, but is more effective than full-batch

Check your understanding

Know what is the most efficient batch iteration method
Understand how learning rate effects training
Understand how gradient descent works
Explain how convergion happens

Summary of Notes

In iterative learning, model computes the loss based on given initial parameters, and updates them accordingly for a better loss result
Gradient descent works by moving in negative gradient steps and adds gradient’s magnitude to find the next point on a curve
Hyperparamaters are tweaks used in training to reach minimal loss with as few steps as possible (such as learning rate)
Full batch is inefficient with large data sets, and SGD & Mini-batch SGD will lead to better results