Regularization in Machine Learning

Continuing the Coursera course said in Gradient Descent and Partial Derivatives, a new idea, regularization, was introduced. To short, regularization decreases theta (or weights in a neural network) to make the function “simpler” and more vague.


As seen in the video, regularization increases the cost function by the sum of theta^2’s. This will then increase the gradient / derivative of the cost function, causing gradient descent to “descent” even more as theta increases. By doing so, it keeps thetas small and therefore making the function more “generic,” preventing over-fitting.



A example of overfitting and how regularization applies to it. (Picture from Wikipedia).

In the graph above, the blue line is a higher polynomial function that has overfitted the data. Although it fits the training set perfectly, it will not do well with other data. The green line, however, fits the data relatively well in this case. (Since it did not state what the graph represents, we will assume that the green line is a logical prediction). It is more smooth and is more likely to predict the correct Y with a X value that is not in the training set.

Applying Regularization:

In short, regularization increases the cost function, which increases the gradient and cause theta to get smaller. Or, think about it this way: it is basically decreasing theta to remove the unnecessary higher-degree polynomials. In neural networks, weight decay is basically the same thing as L2 regularization. It decreases the weights to make the regression more generic, thus making it less likely to overfit the training data.

L2 regularization can be applied as:


\(J(\theta) = costFunction + \frac{\lambda}{2m} \sum_{n=1}^k \theta_n^2 \)


While this would give you the gradient descent function:


\(\theta_j := \theta_j (1 – a \frac{\lambda}{m}) – partialDerivative\)


where m is the number of training sets, k is the number of features/number of theta, and a is the learning rate.

(Partial Derivatives can be found here)

By these two equations, you can see how theta is getting closer to 0 with each epoch, making it more generic and less likely to overfit the training data.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.