library(tidyverse)
library(ROCR)
library(MASS)
library(caret)
library(nnet)
Table of Content
- 1 Introduction
- 2 Load and prepare the dataset
- 3 Prepare a training and test set
- 4 Some descriptive key figures
- 5 Conclusion
1 Introduction
In my last publication, “Machine Learning - Predictions with Generalized Linear Models”, the topic was discussed, how classifications can be made using generalized linear regressions. In the following, for example, a classification of an ordinal-scaled dependent variable shall be made.
For this post the dataset Wine+Quality from the UCI- Machine Learning Repository platform “UCI”) was used.
2 Load and prepare the dataset
Under the link above, two CSV files are stored. One for red wines and one for white whines. For the following examination, both data sets are loaded and combined into one.
red <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), sep = ";")
white <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"), sep = ";")
wine <- rbind(red, white)
glimpse(wine)
## Observations: 6,497
## Variables: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
For simplicity I will contract the original output variable (quality) to a three point scale from 0 to 2.
wine$quality <- factor(ifelse(wine$quality < 5, 0, ifelse(wine$quality > 6, 2, 1)))
3 Prepare a training and test set
set.seed(7644)
wine_sampling_vector <- createDataPartition(wine$quality, p = 0.80, list = FALSE)
wine_train <- wine[wine_sampling_vector,]
wine_test <- wine[-wine_sampling_vector,]
wine_model <- polr(quality ~ ., data = wine_train, Hess = T)
summary(wine_model)
## Call:
## polr(formula = quality ~ ., data = wine_train, Hess = T)
##
## Coefficients:
## Value Std. Error t value
## fixed.acidity 2.348e-01 0.037587 6.2462
## volatile.acidity -3.212e+00 0.298033 -10.7778
## citric.acid -1.694e-01 0.326327 -0.5191
## residual.sugar 1.330e-01 0.009488 14.0204
## chlorides -1.843e+00 1.363609 -1.3518
## free.sulfur.dioxide 1.514e-02 0.002996 5.0523
## total.sulfur.dioxide -5.786e-03 0.001078 -5.3674
## density -1.804e+02 0.531508 -339.3469
## pH 1.355e+00 0.258374 5.2457
## sulphates 2.111e+00 0.264873 7.9694
## alcohol 6.359e-01 0.036506 17.4192
##
## Intercepts:
## Value Std. Error t value
## 0|1 -170.1699 0.5449 -312.2988
## 1|2 -164.5075 0.5674 -289.9221
##
## Residual Deviance: 5668.121
## AIC: 5694.121
Our model summary shows us that we have three output classes and we have two intercepts. Let’s count the the number of samples by the output score and then express these as relative frequencies.
prop.table(table(wine$quality))
##
## 0 1 2
## 0.03786363 0.76558412 0.19655225
Class 1, which corresponds to average wines, is by far the most requent. In fact, a simple baseline model that always predicts this category would be correct 76,56% of the time.
4 Some descriptive key figures
Now, let’s look at the fit on the training data and the corresponding confusion matrix.
wine_predictions <- predict(wine_model, wine_train)
mean(wine_predictions == wine_train$quality)
## [1] 0.7814964
table(predicted = wine_predictions,actual = wine_train$quality)
## actual
## predicted 0 1 2
## 0 0 0 0
## 1 193 3834 793
## 2 4 146 229
Our model performs only marginally better on the training data than our baseline model. We can see why this is the case: it predicts the average class (1) very often.
Now we want to try this again with the test set.
wine_test_predictions <- predict(wine_model, wine_test)
mean(wine_test_predictions == wine_test$quality)
## [1] 0.779661
table(predicted = wine_test_predictions, actual = wine_test$quality)
## actual
## predicted 0 1 2
## 0 0 2 0
## 1 49 959 202
## 2 0 33 53
As you can see, we get a pretty much identical situation. It seems that our model is not a particularly good choise for this dataset. Probably we have to check if is wheather the proportional odds assumption is valid. One simple test that is easy to do, however, is to train a second model using multinomial logistic regression. Then we can compare the AIC value of our two models.
wine_model2 <- multinom(quality ~ ., data = wine_train, maxit = 1000)
## # weights: 39 (24 variable)
## initial value 5711.685289
## iter 10 value 3616.269493
## iter 20 value 2797.070604
## iter 30 value 2782.640097
## iter 40 value 2778.448036
## iter 50 value 2767.613495
## iter 50 value 2767.613493
## final value 2767.613493
## converged
wine_predictions2 <- predict(wine_model2, wine_test)
mean(wine_predictions2 == wine_test$quality)
## [1] 0.7858243
table(predicted = wine_predictions2, actual = wine_test$quality)
## actual
## predicted 0 1 2
## 0 1 1 0
## 1 48 949 185
## 2 0 44 70
Now we have to check their AIC values:
AIC(wine_model)
## [1] 5694.121
AIC(wine_model2)
## [1] 5583.227
5 Conclusion
The AIC is lower in the multinomial logistic regression model, which suggests that we might be better of working with that model.
Source
Miller, J. D., & Forte, R. M. (2017). Mastering Predictive Analytics with R. Packt Publishing Ltd.