-
Notifications
You must be signed in to change notification settings - Fork 34
03. Reducing Loss
Topic: Reducing Loss
Course: GMLC
Date: 11 February 2019
Professor: Not specified
-
https://developers.google.com/machine-learning/crash-course/reducing-loss/video-lecture
-
https://developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach
-
https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent
-
https://developers.google.com/machine-learning/crash-course/reducing-loss/learning-rate
-
https://developers.google.com/machine-learning/crash-course/fitter/graph
-
https://developers.google.com/machine-learning/crash-course/reducing-loss/playground-exercise
-
https://developers.google.com/machine-learning/crash-course/reducing-loss/check-your-understanding
-
Iterative learning
-
Hot and Cold
-
Finding the best possible model as efficiently as possible (in fewest amount of steps)
-
Trial & error process that changes parameters to receive better results on each iteration
-
b, w1 = 0 - initial values are chosen at random and corrected through iterations (Compute loss -> compute parameter updates (b = 1, w1 =1 for example, calculate loss again...)
-
Convergion - point where overall loss stops changing or changes extremely slow
-
Iteration goes on until conversion happens
-
-
Gradient descent
-
w1 - often set to 0 or chosen at random
-
Gradient has a direction & magnitude
-
Always points in the direction of the steepest increase in the loss function
-
Takes steps in negative gradient to reduce loss
-
After taking a step in negative gradient, algorithm adds fraction of gradient’s magnitude to the starting point
-
-
Learning rate
-
Scalar used to train a model via gradient descent
-
Gradient descent algorithms multiply the gradient by learning rate or step size
-
Hyperparamaters - tweaks in runs of training a model (learning rate for example)
-
Too small learning rate - too long
-
Too large learning rate - overshooting the minimum, haphazard motion of unexpectancy
-
Goldilocks
-
Ideal learning rate for every regression problem
-
If a gradient loss is small, tweak a larger learning rate - results in a larger step size
-
- formula for calculating ideal learning rate for one-dimension
-
Hessian matrix is used for calculating ideal learning rate for multiple dimensions
-
-
-
Batch - the total number of examples in a single iteration (one gradient update)
-
Stochastic Gradient Descend (SGD)
-
Single batch per iteration
-
Very noisy
-
-
Mini-batch SGD
-
10 - 1000 examples per iteration
-
Reduces the amount of noise, but is more effective than full-batch
-
-
Know what is the most efficient batch iteration method
-
Understand how learning rate effects training
-
Understand how gradient descent works
-
Explain how convergion happens
-
In iterative learning, model computes the loss based on given initial parameters, and updates them accordingly for a better loss result
-
Gradient descent works by moving in negative gradient steps and adds gradient’s magnitude to find the next point on a curve
-
Hyperparamaters are tweaks used in training to reach minimal loss with as few steps as possible (such as learning rate)
-
Full batch is inefficient with large data sets, and SGD & Mini-batch SGD will lead to better results