AI Log #3: What is Gradient Descent?

Victoria Johnston - Nov 29 '23 - - Dev Community

I am an experienced software engineer diving into AI and machine learning. Are you also learning/interested in learning?
Learn with me! I’m sharing my learning logs along the way.

Disclosure: I have already learnt basic machine learning, but I am starting this learning log from the beginning because I need a refresher 😅.

Cost Function Recap

In my previous learning log, I covered model parameters and cost functions. Here is a quick summary:

  • Parameters of a Model: Variables that can be altered during training to enhance the model. For example, in the model y = wx + b, w and b are parameters, with one input feature, x.
  • Cost Function, J: Indicates the accuracy of a model’s predictions against example data. A smaller J implies closer predicted values (ŷ) to actual values (y), signifying better parameter choices and an improved model.
  • J Calculation: J is computed across the example dataset and varies with the choice of parameter values. Thus, in a model with parameters w and b, the cost function is represented as J(w, b).

Gradient Descent

When training a model, our goal is to discover a function that best fits our example data. We achieve this by adjusting parameter values to obtain the lowest possible cost, J.

Let’s first consider the most obvious way to find optimal parameter values.

We could test every possible parameter value. At first glance, this solution may seem to work; however, there are two key pitfalls. Firstly, it is impossible to test all possible parameter values thoroughly as parameters are real numbers; they have infinite range and decimal depth! Secondly, it would be incredibly inefficient and time-consuming.

We need a more systematic and innovative solution.

Enter gradient descent. This algorithm involves iteratively testing different parameters and calculating the cost at each step. A ‘step’ is synonymous with iteration — each represents a move closer to the optimal parameter values. Gradient descent is similar to our naive solution in that it requires trying out different parameter values and many steps. The critical difference lies in choosing the next set of parameter values. In gradient descent, we choose the next values based on gradient calculations, leading to a much more directed and efficient process.

Here is how it works on a high level:

  • Starting Point: We begin with specific parameter values, usually chosen randomly or through an informed guess.
  • Direction of Descent: We determine the direction to move by calculating the gradient. Updating Parameters: We adjust the parameter values to reduce J using the gradient.
  • Repetition: This process continues until we meet certain end conditions. Each repetition is a step.

Here is what gradient descent may look like on a cost graph for a model with one parameter. In this diagram, we use gradient descent to take steps that get us closer to the minimum.

Image description

Gradient descent can handle any number of parameters (it just becomes increasingly more complex to visualise!). In general, as long as we have a cost function, we can use gradient descent to find values for parameters that minimise the cost, J.

Mechanics of Gradient Descent

To truly grasp gradient descent, let’s delve into the details.

Remember back to high school maths… What does the gradient help us calculate? The slope! The gradient in this context helps us determine the slope of the cost function. Knowing this slope enables us to identify the direction of the steepest descent.

This concept made more sense when I tried to visualise it. Say I plot a graph with the cost J at different parameter values. Here is an example of how a cost graph with two parameters might look like:

Image description

Imagine standing on this slope and looking for the quickest way down. Which direction should I move? Down the slope! Ideally, I would move in the steepest direction from my current position. This concept is the essence of gradient descent!

Image description

A critical aspect of this process is the size of the step we take down the slope. This is influenced by a crucial factor known as the learning rate, α. This rate is a positive number dictating the magnitude of each parameter change.

Gradient Descent Math

Now it’s time for some math.

Let’s introduce the mathematical formulae central to gradient descent. For each parameter, we calculate a new value using its gradient. The learning rate, α, plays a vital role here, determining the size of the step we take toward the steepest descent.

The formula for updating each parameter is unique, ensuring specific and effective adjustments. For instance, the updates for parameters w1, w2, etc., in a multi-parameter model are as follows:

Image description

Image description

Why do we use the partial derivative?

In gradient descent, using partial derivatives instead of full derivatives is a deliberate choice. This is because, in multi-parameter models, each parameter uniquely influences the model’s output. The partial derivative hones in on the impact of a single parameter change, keeping all others constant. For example, adjusting w1 affects the cost J, independent of changes in w2 or b.

Consequently, the gradient descent update for each parameter is tailored with its own partial derivative, allowing for precise, individualised adjustments:

Image description

Learning Rate

As mentioned, the learning rate, α, is a pivotal element that dictates the step size in our journey towards the cost function’s minimum. Thus, picking a reasonable learning rate is critical. Here is how choosing a bad learning rate can negatively affect gradient descent and, in some cases, stop it from working at all:

  • Overshooting with Large Steps: A high α makes our steps too large. We risk stepping past the minimum and onto another slope. We then may continue to bounce from slope to slope without settling down on the same slope. Here is what overshooting may look like:

Image description

Usually, we can tell if this is happening if the gradient fluctuates back and forth (e.g. between positive and negative) in each iteration. Sometimes, we can get lucky and move closer to the optimal minimum.

  • Divergence: In extreme cases, these oversized steps don’t just result in bouncing but escalate, moving us further from the minimum and causing the algorithm to diverge entirely:

Image description

  • Inefficient with Tiny Steps: A tiny learning rate can significantly slow the journey, resulting in minuscule progress towards the minimum. Gradient descent will still work, but it will be highly inefficient.

Image description

Tips for Fine-Tuning the Learning Rate

To optimise the learning rate:

  • Experiment with increments (e.g., 0.001, 0.01, 0.1), seeking a rate that consistently reduces the cost.
  • Start with a very small alpha to ensure consistent cost reduction, then gradually increase.
  • Aim for the largest learning rate that still guarantees consistent and rapid cost reduction.

When to Stop

Determining the right moment to stop gradient descent is a strategic decision. We can choose different end conditions; here are a few examples:

  • Convergence: We stop when parameter changes are negligible, suggesting proximity to the minimum.
  • Fixed Number of Steps: We set a predetermining number of iterations for practical reasons, especially when computational resources are a constraint.
  • Threshold-Based: We halt when the decrease in cost J falls below a set threshold, indicating diminishing returns on further iterations.

Summary

  • Gradient descent optimises model parameters by iteratively adjusting them based on gradient calculations, aiming to minimise the cost function, J.
  • The algorithm involves determining the steepest descent direction using partial derivatives and adjusting each parameter with a calculated step size influenced by the learning rate (α). The process repeats until it meets specific end conditions, like convergence or a fixed number of iterations.
  • The choice of learning rate is crucial as it influences the efficiency of reaching the minimum cost.

Gradient descent is a highly versatile algorithm applied in a wide array of machine learning methods, including complex models like deep learning.

Disclosure

I am taking Andrew Ng’s Machine Learning Specialization, and these learning logs contain some of what I learned from it. It’s a great course. I highly recommend it!

. . . . . .