Skip to content

03. Reducing Loss

Antonio Erdeljac edited this page Feb 26, 2019 · 1 revision

Reducing Loss


Topic: Reducing Loss

Course: GMLC

Date: 11 February 2019

Professor: Not specified


Resources


Key Points


  • Iterative learning

    • Hot and Cold

    • Finding the best possible model as efficiently as possible (in fewest amount of steps)

    • Trial & error process that changes parameters to receive better results on each iteration

    • b, w1 = 0 - initial values are chosen at random and corrected through iterations (Compute loss -> compute parameter updates (b = 1, w1 =1 for example, calculate loss again...)

    • Convergion - point where overall loss stops changing or changes extremely slow

    • Iteration goes on until conversion happens

  • Gradient descent

    • w1 -  often set to 0 or chosen at random

    • Gradient has a direction & magnitude

    • Always points in the direction of the steepest increase in the loss function

    • Takes steps in negative gradient to reduce loss

    • After taking a step in negative gradient, algorithm adds fraction of gradient’s magnitude to the starting point

  • Learning rate

    • Scalar used to train a model via gradient descent

    • Gradient descent algorithms multiply the gradient by learning rate  or step size

    • Hyperparamaters  - tweaks in runs of training a model (learning rate for example)

    • Too small learning rate - too long

    • Too large learning rate - overshooting the minimum, haphazard motion of unexpectancy

    • Goldilocks

      • Ideal learning rate for every regression problem

      • If a gradient loss is small, tweak a larger learning rate - results in a larger step size

      •  - formula for calculating ideal learning rate for one-dimension

      • Hessian   matrix is used for calculating ideal learning rate for multiple dimensions

  • Batch - the total number of examples in a single iteration (one gradient update)

  • Stochastic Gradient Descend (SGD)

    • Single batch per iteration

    • Very noisy

  • Mini-batch SGD

    • 10 - 1000 examples per iteration

    • Reduces the amount of noise, but is more effective than full-batch

Check your understanding


  • Know what is the most efficient batch iteration method

  • Understand how learning rate effects training

  • Understand how gradient descent works

  • Explain how convergion happens

Summary of Notes


  • In iterative learning, model computes the loss based on given initial parameters, and updates them accordingly for a better loss result

  • Gradient descent works by moving in negative gradient steps and adds gradient’s magnitude to find the next point on a curve

  • Hyperparamaters are tweaks used in training to reach minimal loss with as few steps as possible (such as learning rate)

  • Full batch is inefficient with large data sets, and SGD & Mini-batch SGD will lead to better results

Clone this wiki locally