3 min read

Machine Learning - Training and Testing Sets: Regression Modeling

library(tidyverse)

Table of Content

  • 1 Introduction
  • 2 Creation of two dependent variables
  • 3 Train and test the simple regression model
  • 4 Train and test the polynomial regression model
  • 5 Train and test the exponential regression model
  • 6 Conclusion

1 Introduction

This post deals with the subject of machine learning. In particular, the training and testing of data for a regression analysis will be considered.

2 Creation of two dependent variables

In the first step, two interdependent variables are generated.

set.seed(123)
x <- rnorm(100, 2, 1)
y <- exp(x) + rnorm(7, 0, 1)
## Warning in exp(x) + rnorm(7, 0, 1): Länge des längeren Objektes
##       ist kein Vielfaches der Länge des kürzeren Objektes
linear <- lm(y  ~ x)
plot(x, y)
abline(a = coef(linear[1], b = coef(linear[2], lty = 2)))

summary(linear)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.457 -4.115 -2.108  1.310 28.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.4079     1.6402  -8.175 1.07e-12 ***
## x            12.0637     0.7196  16.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.536 on 98 degrees of freedom
## Multiple R-squared:  0.7414, Adjusted R-squared:  0.7388 
## F-statistic:   281 on 1 and 98 DF,  p-value: < 2.2e-16

3 Train and test the simple regression model

Subsequently, the newly created data set is divided into a training part (80%) and a test part (20%).

data <- data.frame(x, y)
data.samples <- sample(1:nrow(data), nrow(data) * 0.8, replace = FALSE)
training.data <- data[data.samples, ]
test.data <- data[-data.samples, ]

Now the regression model can be traniniert with the training data.

train.linear <- lm(y ~ x, training.data)
train.output <- predict(train.linear, test.data)

The quality of the prediction can be determined using the root mean square error (RMSE).

\[RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}\]

RMSE.df <- data.frame(predicted = train.output, actual = test.data$y, 
                      SE = ((train.output - test.data$y)^2/length(train.output)))

head(RMSE.df)
##    predicted    actual           SE
## 6  29.249080 41.016228 6.923288e+00
## 9   2.895065  3.974740 5.828484e-02
## 11 23.861977 24.782946 4.240916e-02
## 15  4.332535  3.527879 3.237358e-02
## 20  5.243763  4.560276 2.335772e-02
## 25  3.573288  3.607379 5.810787e-05
sqrt(sum(RMSE.df$SE))
## [1] 8.065677

We get a RMSE value of 8.07. To see how good this value is, it can be compared to other RMSE values.

4 Train and test the polynomial regression model

train.polyn <- lm(y ~ poly(x, 4), training.data)
polyn.output <- predict(train.polyn, test.data)

RMSE.polyn.df <- data.frame(predicted = polyn.output, actual = test.data$y, 
                                  SE = ((polyn.output - test.data$y)^2/length(polyn.output)))

head(RMSE.polyn.df)
##    predicted    actual           SE
## 6  41.203433 41.016228 1.752296e-03
## 9   3.333099  3.974740 2.058515e-02
## 11 24.954389 24.782946 1.469629e-03
## 15  3.873118  3.527879 5.959505e-03
## 20  4.245259  4.560276 4.961783e-03
## 25  3.581171  3.607379 3.434285e-05
sqrt(sum(RMSE.polyn.df$SE))
## [1] 0.4690057

With a RMSE value of 0.47, we can see that the quality of the prediction has already improved significantly.

5 Train and test the exponential regression model

train.exponential <- lm(y ~ exp(x) + x, training.data)
exponential.output <- predict(train.exponential, test.data)

RMSE.exponential.df <- data.frame(predicted = exponential.output, actual = test.data$y, 
                                  SE = ((exponential.output - test.data$y)^2/length(exponential.output)))

head(RMSE.exponential.df)
##    predicted    actual           SE
## 6  40.807386 41.016228 2.180737e-03
## 9   3.291509  3.974740 2.334023e-02
## 11 24.788044 24.782946 1.299666e-06
## 15  3.811601  3.527879 4.024919e-03
## 20  4.178644  4.560276 7.282133e-03
## 25  3.528361  3.607379 3.121932e-04
sqrt(sum(RMSE.exponential.df$SE))
## [1] 0.3703497

An even better predictive value we get in this case with the exponential regression model. RMSE = 0.37

6 Conclusion

This should be a brief demonstration of how regression models can be trained and their predictive power improved.

Source

Burger, S. V. (2018). Introduction to Machine Learning with R: Rigorous Mathematical Analysis. " O’Reilly Media, Inc.“.