Stay away from overfitting: L2-norm Regularization, Weight Decay and L1-norm Regularization techniques

Inara Koppert-Anisimova
unpack
Published in
4 min readJan 18, 2021

--

Real world data is complex and in order to solve complex problems, we need complex solutions.

Having fewer parameters is only one way of preventing our model from getting over complex. But it is actually a very limiting strategy. More parameters mean more interactions between various parts of our neural network.

However, we don’t want these interactions to get out of hand. Hence, what if we penalize complexity. We will still use a lot of parameters, but we will prevent our model from getting too complex. This is how the idea of weight regularization came up.

Img 2. Test accuracy attained by several deep Neural Networks trained on the CIFAR-10, CIFAR-100, and ImageNet datasets. Parameters trained with SGD were unconstrained.

What is weight decay?

One way to penalize complexity, would be to add all our parameters (weights) to our loss function. That won’t quite work because some parameters are positive and some are negative. So what if we add the squares of all the parameters to our loss function. We can do that, however it might result in our loss getting so huge that the best model would be to set all the parameters to 0.

To prevent that from happening, we multiply the sum of squares with another smaller number. This number is called weight decay or wd.

Our loss function now looks as follows:

Loss = MSE(y_hat, y) + wd * sum(w^2)

When we update weights using gradient descent we do the following:

w(t) = w(t-1) - lr * dLoss / dw

Now since our loss function has 2 terms in it, the derivative of the 2nd term w.r.t w would be:

d(wd * w^2) / dw = 2 * wd * w (similar to d(x^2)/dx = 2x)

That is from now on, we would not only subtract the learning rate * gradient from the weights but also 2 * wd * w . We are subtracting a constant times the weight from the original weight. This is why it is called weight decay.

Deciding the value of wd

Generally a wd = 0.1 works pretty well. However, at fastai the default value of weight decay is actually 0.01 .

As an example we take a multi-class (and not a multi-label) classification problem where we try to predict the class of plant seedlings using 3 values for weight decay, the default 0.01 , the best value of 0.1and a large value of 10. In the first case our model takes more epochs to fit. In the second case it works best and in the final case it never quite fits well even after 10 epochs.

L2 and L1 regularization

Img 3. L1 vs L2 Regularization

L2 regularization is often referred to as weight decay since it makes the weights smaller. It is also known as Ridge regression and it is a technique where the sum of squared parameters, or weights of a model (multiplied by some coefficient) is added into the loss function as a penalty term to be minimized.

So what is this subtle difference between them?

While weight decay is added directly to the update rule, L2 regularization is added to the loss. You can encounter this when a model, trained perfectly in one book, will start overfitting in another. This can happen because the initial book was implementing weight decay but Keras as an example implements L2 regularization.

For instance, if you had your weight decay set to 0.0005 as in the AlexNet paper and you move to a deep learning framework which implements L2 regularization instead, you should set lambda (lambda allows you to write quick functions without naming them) hyperparameter to 0.0005/2.0 to get the same behavior (where 2.0 is a pesky scale factor).

One important thing to know about L2 regularization — when it is used together with batch normalization in a convolutional neural net with typical architectures, an L2 objective penalty no longer has its original regularizing effect. Instead it becomes essentially equivalent to an adaptive adjustment of the learning rate!

And lastly, what is L1 regularization?

L1 regularization is called Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

The key difference between L1 and L2 regularization techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. In the case of s small set it is advisable to apply L2 Ridge regression and count all features which are given.

Sources:

  1. https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab
  2. https://bbabenko.github.io/weight-decay/
  3. https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c
  4. https://sudonull.com/post/69729-Dropout-a-method-for-solving-the-problem-of-retraining-in-neural-networks (Img 1 credits)
  5. https://www.kdnuggets.com/2019/12/5-techniques-prevent-overfitting-neural-networks.html (Img 3 credits)
  6. http://www.pokutta.com/blog/research/2020/11/11/NNFW.html (Img 2 credits)

--

--