In the *gradient descent method* of optimization, a *hypothesis* function, $h_\boldsymbol{\theta}(x)$, is fitted to a data set, $(x^{(i)}, y^{(i)})$ ($i=1,2,\cdots,m$) by minimizing an associated *cost function*, $J(\boldsymbol{\theta})$ in terms of the parameters $\boldsymbol\theta = \theta_0, \theta_1, \cdots$. The cost function describes how closely the hypothesis fits the data for a given choice of $\boldsymbol \theta$.

For example, one might wish to fit a given data set to a straight line, $$ h_\boldsymbol{\theta}(x) = \theta_0 + \theta_1 x. $$ An appropriate cost function might be the sum of the squared difference between the data and the hypothesis: $$ J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_i^{m} \left[h_\theta(x^{(i)}) - y^{(i)}\right]^2. $$ The gradient descent method starts with a set of initial parameter values of $\boldsymbol\theta$ (say, $\theta_0 = 0, \theta_1 = 0$), and then follows an iterative procedure, changing the values of $\theta_j$ so that $J(\boldsymbol{\theta})$ decreases: $$ \theta_j \rightarrow \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\boldsymbol{\theta}). $$

To simplify things, consider fitting a data set to a straight line through the origin: $h_\theta(x) = \theta_1 x$. In this one-dimensional problem, we can plot a simple graph for $J(\theta_1)$ and follow the iterative procedure which trys to converge on its minimum.

Fitting a general straight line to a data set requires two parameters, and so $J(\theta_0, \theta_1)$ can be visualized as a contour plot. The same iterative procedure over these two parameters can also be followed as points on this plot.

Here's the code for the one-parameter plot:

import numpy as np import matplotlib.pyplot as plt # The data to fit m = 20 theta1_true = 0.5 x = np.linspace(-1,1,m) y = theta1_true * x # The plot: LHS is the data, RHS will be the cost function. fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,6.15)) ax[0].scatter(x, y, marker='x', s=40, color='k') def cost_func(theta1): """The cost function, J(theta1) describing the goodness of fit.""" theta1 = np.atleast_2d(np.asarray(theta1)) return np.average((y-hypothesis(x, theta1))**2, axis=1)/2 def hypothesis(x, theta1): """Our "hypothesis function", a straight line through the origin.""" return theta1*x # First construct a grid of theta1 parameter pairs and their corresponding # cost function values. theta1_grid = np.linspace(-0.2,1,50) J_grid = cost_func(theta1_grid[:,np.newaxis]) # The cost function as a function of its single parameter, theta1. ax[1].plot(theta1_grid, J_grid, 'k') # Take N steps with learning rate alpha down the steepest gradient, # starting at theta1 = 0. N = 5 alpha = 1 theta1 = [0] J = [cost_func(theta1[0])[0]] for j in range(N-1): last_theta1 = theta1[-1] this_theta1 = last_theta1 - alpha / m * np.sum( (hypothesis(x, last_theta1) - y) * x) theta1.append(this_theta1) J.append(cost_func(this_theta1)) # Annotate the cost function plot with coloured points indicating the # parameters chosen and red arrows indicating the steps down the gradient. # Also plot the fit function on the LHS data plot in a matching colour. colors = ['b', 'g', 'm', 'c', 'orange'] ax[0].plot(x, hypothesis(x, theta1[0]), color=colors[0], lw=2, label=r'$\theta_1 = {:.3f}$'.format(theta1[0])) for j in range(1,N): ax[1].annotate('', xy=(theta1[j], J[j]), xytext=(theta1[j-1], J[j-1]), arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1}, va='center', ha='center') ax[0].plot(x, hypothesis(x, theta1[j]), color=colors[j], lw=2, label=r'$\theta_1 = {:.3f}$'.format(theta1[j])) # Labels, titles and a legend. ax[1].scatter(theta1, J, c=colors, s=40, lw=0) ax[1].set_xlim(-0.2,1) ax[1].set_xlabel(r'$\theta_1$') ax[1].set_ylabel(r'$J(\theta_1)$') ax[1].set_title('Cost function') ax[0].set_xlabel(r'$x$') ax[0].set_ylabel(r'$y$') ax[0].set_title('Data and fit') ax[0].legend(loc='upper left', fontsize='small') plt.tight_layout() plt.show()

The following program produces the two-dimensional plot:

import numpy as np import matplotlib.pyplot as plt # The data to fit m = 20 theta0_true = 2 theta1_true = 0.5 x = np.linspace(-1,1,m) y = theta0_true + theta1_true * x # The plot: LHS is the data, RHS will be the cost function. fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,6.15)) ax[0].scatter(x, y, marker='x', s=40, color='k') def cost_func(theta0, theta1): """The cost function, J(theta0, theta1) describing the goodness of fit.""" theta0 = np.atleast_3d(np.asarray(theta0)) theta1 = np.atleast_3d(np.asarray(theta1)) return np.average((y-hypothesis(x, theta0, theta1))**2, axis=2)/2 def hypothesis(x, theta0, theta1): """Our "hypothesis function", a straight line.""" return theta0 + theta1*x # First construct a grid of (theta0, theta1) parameter pairs and their # corresponding cost function values. theta0_grid = np.linspace(-1,4,101) theta1_grid = np.linspace(-5,5,101) J_grid = cost_func(theta0_grid[np.newaxis,:,np.newaxis], theta1_grid[:,np.newaxis,np.newaxis]) # A labeled contour plot for the RHS cost function X, Y = np.meshgrid(theta0_grid, theta1_grid) contours = ax[1].contour(X, Y, J_grid, 30) ax[1].clabel(contours) # The target parameter values indicated on the cost function contour plot ax[1].scatter([theta0_true]*2,[theta1_true]*2,s=[50,10], color=['k','w']) # Take N steps with learning rate alpha down the steepest gradient, # starting at (theta0, theta1) = (0, 0). N = 5 alpha = 0.7 theta = [np.array((0,0))] J = [cost_func(*theta[0])[0]] for j in range(N-1): last_theta = theta[-1] this_theta = np.empty((2,)) this_theta[0] = last_theta[0] - alpha / m * np.sum( (hypothesis(x, *last_theta) - y)) this_theta[1] = last_theta[1] - alpha / m * np.sum( (hypothesis(x, *last_theta) - y) * x) theta.append(this_theta) J.append(cost_func(*this_theta)) # Annotate the cost function plot with coloured points indicating the # parameters chosen and red arrows indicating the steps down the gradient. # Also plot the fit function on the LHS data plot in a matching colour. colors = ['b', 'g', 'm', 'c', 'orange'] ax[0].plot(x, hypothesis(x, *theta[0]), color=colors[0], lw=2, label=r'$\theta_0 = {:.3f}, \theta_1 = {:.3f}$'.format(*theta[0])) for j in range(1,N): ax[1].annotate('', xy=theta[j], xytext=theta[j-1], arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1}, va='center', ha='center') ax[0].plot(x, hypothesis(x, *theta[j]), color=colors[j], lw=2, label=r'$\theta_0 = {:.3f}, \theta_1 = {:.3f}$'.format(*theta[j])) ax[1].scatter(*zip(*theta), c=colors, s=40, lw=0) # Labels, titles and a legend. ax[1].set_xlabel(r'$\theta_0$') ax[1].set_ylabel(r'$\theta_1$') ax[1].set_title('Cost function') ax[0].set_xlabel(r'$x$') ax[0].set_ylabel(r'$y$') ax[0].set_title('Data and fit') axbox = ax[0].get_position() # Position the legend by hand so that it doesn't cover up any of the lines. ax[0].legend(loc=(axbox.x0+0.5*axbox.width, axbox.y0+0.1*axbox.height), fontsize='small') plt.show()

## Comments

## Student 1 year, 2 months ago

Very good, this is what I've been looking for!

Link | Reply## Trustin 6 months, 2 weeks ago

Thanks pro. Very easy to understand and useful, but I think it's better if you can use FuncAnimation to show above example.

Link | Reply## christian 6 months, 1 week ago

Thanks. Might do this, but will try to find an example that converges at a rate that looks good in an animation to show what's going on.

Link | Reply## Jeremy 5 months, 3 weeks ago

Hi, I'm playing around with the x = np.linspace(min_x,max_x,m) values and I realize when I use huge numbers, I don't get a nice round contours. They look quite straight..

Link | ReplyWas wondering if you know why this is happening?

## christian 5 months, 2 weeks ago

If you increase the value of range of x but keep theta1_grid (corresponding to the gradient) the same, then the contours become very tall and narrow, so across the plotted range you're probably just seeing their edges and not the rounded ends. Try setting e.g. max_x = -min_x = 3 to see this. Note that the values chosen for the x-range will determine how well the fit performs (not well for the chosen starting point when the x-range is too large).

Link | Reply## primef 3 months, 2 weeks ago

Hi,

Link | ReplyI was using your code to create a similar plot to yours. However after analyzing your code and the plots you get, I noticed that the results are wrong.

For Theta = [0,0], the cost J should be 2.04605263. When I print the J out of your code, I get the right value for J when thet is [0,0]. However in the contour plot this is not the case, as for theta [0,0] the corresponding J is 2.400. I assume the error is in the J_grid or in one of the theta grids, but unfortunately even after some hours of error search, I couldn't fix it.

Could you please check that, maybe I am seeing it even wrong.

Thank you in advance

## christian 3 months, 2 weeks ago

Thanks, primef – this was a bug in the way I constructed my J_grid array, as you supposed. I've fixed it now and the image has been corrected.

Link | Reply## primef 3 months, 2 weeks ago

Hi Christian,

Link | Replyhappy to hear that you have fixed the bug so fast.

Thank you!

## New Comment