5 min read

Machine Learning - Regression Regularization

library(tidyverse)
library(lasso2)
mtcars <- read_csv("mtcars.csv")

Table of Content

  • 1 Introduction
  • 2 The regularization of a regression
  • 3 Conclusion

1 Introduction

Regularization is the tuning of the preferred complexity of the statistical model, so that the predictive ability of the model is improved. If you do not use regularization, the model can become too complex and over-fit, or too simple and under-fit the data.

For this post the dataset mtcars from the statistic platform “Kaggle” was used. A copy of the record is available at https://drive.google.com/open?id=1u7cDZoOUg9ah8ZG3aiUWmUgts8RTL4iR.

2 The regularization of a regression

First, we get an overview of the record we have.

glimpse(mtcars)
## Observations: 32
## Variables: 12
## $ X1   <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Dri...
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <int> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <int> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <int> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <int> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <int> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

The main question we ask ourselves is the following: Which influencing factors determine the fuel consumption (mpg = miles per gallon) of the cars significantly?

With a simple linear model, we get the following values.

mtcars.tidy <- mtcars[,-1]
lm.all <- lm(mpg ~ ., data = mtcars.tidy)
summary(lm.all)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars.tidy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

We see that each variable has at least a minor impact on gas mileage. Now, with the help of regularization, we try to find out which influences really are important for a correct prediction.

lm.lasso <- l1ce(mpg ~ ., data = mtcars.tidy)
summary(lm.lasso)$coefficients
##                   Value  Std. Error     Z score   Pr(>|Z|)
## (Intercept) 36.01809203 18.92587647  1.90311355 0.05702573
## cyl         -0.86225790  1.12177221 -0.76865686 0.44209704
## disp         0.00000000  0.01912781  0.00000000 1.00000000
## hp          -0.01399880  0.02384398 -0.58709992 0.55713660
## drat         0.05501092  1.78394922  0.03083659 0.97539986
## wt          -2.68868427  2.05683876 -1.30719254 0.19114733
## qsec         0.00000000  0.75361628  0.00000000 1.00000000
## vs           0.00000000  2.31605743  0.00000000 1.00000000
## am           0.44530641  2.14959278  0.20715850 0.83588608
## gear         0.00000000  1.62955841  0.00000000 1.00000000
## carb        -0.09506985  0.91237207 -0.10420075 0.91701004

It can be seen that the l1ce-function has reduced the influence of the variables disp, qsec and gear to 0. These variables should not be considered in the next model.

lm.lasso2 <- l1ce(mpg ~ cyl + hp + drat + wt + am + carb, data = mtcars.tidy)
summary(lm.lasso2)$coefficients
##                    Value Std. Error     Z score    Pr(>|Z|)
## (Intercept) 31.446025119 11.4534952  2.74553965 0.006041147
## cyl         -0.789829796  0.9371848 -0.84276842 0.399357971
## hp          -0.001280132  0.0236353 -0.05416188 0.956806194
## drat         0.000000000  2.2516551  0.00000000 1.000000000
## wt          -1.952148210  1.4344649 -1.36088945 0.173548628
## am           0.000000000  2.3316618  0.00000000 1.000000000
## carb         0.000000000  0.7107026  0.00000000 1.000000000

Now the influence of the variables drat, am and carb has been reduced to 0. The following model will be changed accordingly.

lm.lasso3 <- l1ce(mpg ~ cyl + hp + wt, data = mtcars.tidy)
summary(lm.lasso3)$coefficients
##                  Value Std. Error    Z score  Pr(>|Z|)
## (Intercept) 30.2106931 1.97117597 15.3262284 0.0000000
## cyl         -0.7220771 0.82941877 -0.8705821 0.3839824
## hp           0.0000000 0.01748364  0.0000000 1.0000000
## wt          -1.7568469 1.07478525 -1.6346028 0.1021324

OK another adaptation is necessary.

lm.lasso4 <- l1ce(mpg ~ cyl + wt, data = mtcars.tidy)
summary(lm.lasso4)$coefficients
##                  Value Std. Error   Z score  Pr(>|Z|)
## (Intercept) 29.8694933  1.4029760 21.290096 0.0000000
## cyl         -0.6937847  0.5873288 -1.181254 0.2375017
## wt          -1.7052064  1.0720172 -1.590652 0.1116879

Perfect. The final model is: mpg = 29,87 - 0,69cyl - 1,71wt

3 Conclusion

By selecting a suitable complexity, one obtains a model which can predict the data as best as possible. In addition to the lasso method shown, there are other well-known methods such as the ridge regression or elastic net regression, which use the regularization.

Source

Burger, S. V. (2018). Introduction to Machine Learning with R: Rigorous Mathematical Analysis. " O’Reilly Media, Inc.“.