Wine Quality Prediction

On June 14th, 2018, I participated in a Data Hackathon as part of the REU program at George Mason University.
Research
Author

Howard Baek

Published

June 18, 2018

Introduction

On June 14th, 2018, I participated in a Data Hackathon as part of the REU program at George Mason University. It was my first ever hackathon and I was excited to finally participate in one. It lasted approximately 4 hours, from 10am to 2pm. Our team, consisting of three undergraduate students, worked with the famous Wine Quality dataset, which is hosted by University of California Irvine’s Center for Machine Learning and Intelligent Systems. The goal of the hackathon was to accurately predict the Quality variable (“Good”= 1 or “Bad” = 0)

Dataset

I first import the dataset and observe it.

# Load tidyverse and caret package
library(tidyverse)
library(caret)

# Import training / test data
wine_train <- read_csv("wine_train.csv")
wine_test <- read_csv("wine_test.csv")
glimpse(wine_train)
Rows: 799
Columns: 12
$ `fixed acidity`        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7…
$ `volatile acidity`     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600…
$ `citric acid`          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00,…
$ `residual sugar`       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.…
$ chlorides              <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069…
$ `free sulfur dioxide`  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, …
$ `total sulfur dioxide` <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 10…
$ density                <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978,…
$ pH                     <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39,…
$ sulphates              <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47,…
$ alcohol                <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 1…
$ Quality                <chr> "B", "B", "B", "G", "B", "B", "B", "G", "G", "B…
glimpse(wine_test)
Rows: 800
Columns: 11
$ `fixed acidity`        <dbl> 9.4, 7.2, 8.6, 5.1, 7.7, 8.4, 8.2, 8.4, 8.2, 7.…
$ `volatile acidity`     <dbl> 0.500, 0.610, 0.550, 0.585, 0.560, 0.520, 0.280…
$ `citric acid`          <dbl> 0.34, 0.08, 0.09, 0.00, 0.08, 0.22, 0.40, 0.39,…
$ `residual sugar`       <dbl> 3.60, 4.00, 3.30, 1.70, 2.50, 2.70, 2.40, 2.00,…
$ chlorides              <dbl> 0.082, 0.082, 0.068, 0.044, 0.114, 0.084, 0.052…
$ `free sulfur dioxide`  <dbl> 5, 26, 8, 14, 14, 4, 4, 4, 4, 4, 4, 4, 7, 20, 4…
$ `total sulfur dioxide` <dbl> 14, 108, 17, 86, 46, 18, 10, 10, 10, 12, 15, 14…
$ density                <dbl> 0.99870, 0.99641, 0.99735, 0.99264, 0.99710, 0.…
$ pH                     <dbl> 3.29, 3.25, 3.23, 3.56, 3.24, 3.26, 3.33, 3.27,…
$ sulphates              <dbl> 0.52, 0.51, 0.44, 0.94, 0.66, 0.57, 0.70, 0.71,…
$ alcohol                <dbl> 10.7, 9.4, 10.0, 12.9, 9.6, 9.9, 12.8, 12.5, 12…

Training data has 799 observations and 12 variables, including the target variable, Quality, while the testing data has 800 observations and exactly the same attributes except Quality.

Data Manipulation

# Change columns names- Take out single quotations and underscores from names 
names(wine_train) <- gsub("'", '', names(wine_train))
names(wine_train) <- gsub(" ", "_", names(wine_train))
names(wine_train)
 [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
 [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
 [7] "total_sulfur_dioxide" "density"              "pH"                  
[10] "sulphates"            "alcohol"              "Quality"             
names(wine_test) <- gsub("'", '', names(wine_test))
names(wine_test) <- gsub(" ", "_", names(wine_test))
names(wine_test)
 [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
 [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
 [7] "total_sulfur_dioxide" "density"              "pH"                  
[10] "sulphates"            "alcohol"             
# Change values in Quality column: "B" = 0 & "G" = 1
wine_train <- wine_train %>% 
  mutate(Quality = ifelse(Quality == "B", 0, 1))

# Observe number of 0s and 1s
table(wine_train$Quality)

  0   1 
425 374 


Feature Selection

I first wanted to select the relevant and useful features by means of feature selection in the caret package, a popular R package for statistical machine learning. This tutorial got me started: https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/

# Feature Selection #1
set.seed(7) # Bring me luck
train_cor <- cor(wine_train[, -length(names(wine_train))])

# summarize the correlation matrix
print(train_cor)
                     fixed_acidity volatile_acidity citric_acid residual_sugar
fixed_acidity         1.0000000000      -0.30241919  0.69369727     0.17320660
volatile_acidity     -0.3024191923       1.00000000 -0.54708405    -0.03046016
citric_acid           0.6936972747      -0.54708405  1.00000000     0.13322604
residual_sugar        0.1732066025      -0.03046016  0.13322604     1.00000000
chlorides            -0.0003612805      -0.01426396  0.19704211    -0.02939710
free_sulfur_dioxide  -0.1562296716       0.03287532 -0.04284979     0.17117834
total_sulfur_dioxide -0.2105196234       0.08831363 -0.01356392     0.14129767
density               0.7293551024      -0.13418708  0.44405907     0.39166663
pH                   -0.6865541686       0.26154512 -0.56343133    -0.05858156
sulphates             0.1618399522      -0.26807580  0.28944979     0.02657419
alcohol               0.1261203673      -0.08160458  0.18895405     0.19244357
                         chlorides free_sulfur_dioxide total_sulfur_dioxide
fixed_acidity        -0.0003612805        -0.156229672         -0.210519623
volatile_acidity     -0.0142639627         0.032875319          0.088313629
citric_acid           0.1970421057        -0.042849793         -0.013563916
residual_sugar       -0.0293970980         0.171178339          0.141297672
chlorides             1.0000000000         0.001938843          0.022849048
free_sulfur_dioxide   0.0019388434         1.000000000          0.730655240
total_sulfur_dioxide  0.0228490477         0.730655240          1.000000000
density               0.0754952020        -0.033797520         -0.085640019
pH                   -0.2466119148         0.086088169          0.009654689
sulphates             0.4123789331         0.050006477          0.054326009
alcohol              -0.1500365030        -0.019738532         -0.112954461
                         density           pH   sulphates     alcohol
fixed_acidity         0.72935510 -0.686554169  0.16183995  0.12612037
volatile_acidity     -0.13418708  0.261545116 -0.26807580 -0.08160458
citric_acid           0.44405907 -0.563431327  0.28944979  0.18895405
residual_sugar        0.39166663 -0.058581563  0.02657419  0.19244357
chlorides             0.07549520 -0.246611915  0.41237893 -0.15003650
free_sulfur_dioxide  -0.03379752  0.086088169  0.05000648 -0.01973853
total_sulfur_dioxide -0.08564002  0.009654689  0.05432601 -0.11295446
density               1.00000000 -0.379103361  0.13497642 -0.18361374
pH                   -0.37910336  1.000000000 -0.28483760  0.12470433
sulphates             0.13497642 -0.284837597  1.00000000  0.09629489
alcohol              -0.18361374  0.124704335  0.09629489  1.00000000
# find attributes that are highly corrected (ideally >0.75)
high_cor <- findCorrelation(train_cor, cutoff=0.5)

# print indexes of highly correlated attributes
print(high_cor)
[1] 1 3 7
  • Index 1 = fixed_acidity
  • Index 2 = citric_acid
  • Index 3 = total_sulfur_dioxide

Model Fitting

Since fixed_acidity, citric_acid and total_sulfur_dioxide are highly correlated (redundant), I only used one of these features (total_sulfur_dioxide) and disposed of the two redundant ones (fixed_acidity, citric_acid). At this point, I formulated a hypothesis: a model without redundant features performs better than a model with redundant features. Let’s find out if this is true.

Since the target variable is binary, I fit a logistic regression.

# Logistic Regression
wine_train$Quality <- factor(wine_train$Quality, levels = c(0, 1))
train_log <- createDataPartition(wine_train$Quality, p=0.6, list=FALSE)
training <- wine_train[train_log, ]
testing <- wine_train[ -train_log, ]

# Hypothesis: Try logistic regression with all the predictor variables
mod_log <- train(Quality ~ .,  data=training, method="glm", family="binomial")

exp(coef(mod_log$finalModel))
         (Intercept)        fixed_acidity     volatile_acidity 
        5.228352e+52         1.362305e+00         4.014862e-02 
         citric_acid       residual_sugar            chlorides 
        1.273078e-01         1.032885e+00         2.292574e-01 
 free_sulfur_dioxide total_sulfur_dioxide              density 
        1.047235e+00         9.684034e-01         1.807645e-58 
                  pH            sulphates              alcohol 
        1.667995e+00         1.355519e+01         2.251100e+00 
pred <- predict(mod_log, newdata=testing)
accuracy <- table(pred, testing$Quality)
sum(diag(accuracy))/sum(accuracy)
[1] 0.7147335
# Hypothesis: Try logistic regression without the redundant variables
# Try logistic regression without highly correlated variables
mod_log_2 <- train(Quality ~ volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol,  data=training, method="glm", family="binomial")

pred_2 <- predict(mod_log_2, newdata = testing)
accuracy_2 <- table(pred_2, testing$Quality)
sum(diag(accuracy_2))/sum(accuracy_2)
[1] 0.7053292

The first hypothesis yields an accuracy rate of 71.5%! while the first hypothesis yields 70.8%! Apparently, including all the variables yields higher accuracy.

Cross Validation

At this point, I looked on the tutorial page for the caret package to learn how to cross validate. I learned about trainControl, a function “used to specify the type of resampling”. The parameter, method, specifies repeatedcv, which stands for repeated cross validation.

# Cross validation on the second model where I took out the redundant variables
ctrl <- trainControl(method = "repeatedcv", repeats = 10)

# Train logistic regression model
mod_log_2_ctrl <- train(Quality ~ volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide +
                     total_sulfur_dioxide + density + pH + sulphates + alcohol, data=training, 
                     trControl = ctrl, method="glm", family="binomial")
pred_2_ctrl <- predict(mod_log_2, newdata = testing)
accuracy_2_ctrl <- table(pred_2, testing$Quality)
sum(diag(accuracy_2_ctrl))/sum(accuracy_2_ctrl)
[1] 0.7053292

I got the same accuracy, which means that I didn’t use cross validation properly…I’ll have to learn more.

Advanced Models

In an effort to achieve a higher accuracy score, I looked for more accurate and powerful models, such as XgBoost, Random Forests, etc.

# XgBoost -----------------------------------------------------------------
mod_xgboost <- train(Quality ~ ., data=training, 
      trControl = ctrl, method="xgbTree", family="binomial")
pred_xgboost <- predict(mod_xgboost, newdata = testing)
acc_xgboost <- table(pred_xgboost, testing$Quality)
sum(diag(acc_xgboost))/sum(acc_xgboost)
# 70.8%

# Random Forest-----------------------------------------------------------------
mod_rf <- train(Quality ~ volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide +
                  total_sulfur_dioxide + density + pH + sulphates + alcohol, data=training, 
                trControl = ctrl, method="rf", family="binomial")
pred_rf <- predict(mod_rf, newdata = testing)
acc_rf <- table(pred_rf, testing$Quality)
sum(diag(acc_rf)) / sum(acc_rf)
# 80.6% is an improvement!

# Logit Boost-----------------------------------------------------------------
mod_logit <- train(Quality ~ volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide +
                  total_sulfur_dioxide + density + pH + sulphates + alcohol, data=training, 
                trControl = ctrl, method="LogitBoost")
pred_logit <- predict(mod_logit, newdata = testing)
acc_logit <- table(pred_logit, testing$Quality)
sum(diag(acc_logit)) / sum(acc_logit)
# 72.1%

# svmRadial-----------------------------------------------------------------
mod_svm <- train(Quality ~ volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide +
                     total_sulfur_dioxide + density + pH + sulphates + alcohol, data=training, 
                   trControl = ctrl, method="svmRadial")
pred_svm <- predict(mod_svm, newdata = testing)
acc_svm <- table(pred_svm, testing$Quality)
sum(diag(acc_svm)) / sum(acc_svm)
# 75.2%

# LMT-----------------------------------------------------------------
mod_svm_linear <- train(Quality ~ volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide +
                   total_sulfur_dioxide + density + pH + sulphates + alcohol, data=training, 
                 trControl = ctrl, method="svmLinearWeights2")
pred_svm_linear <- predict(mod_svm_linear, newdata = testing)
acc_svm_linear <- table(pred_svm_linear, testing$Quality)
sum(diag(acc_svm_linear)) / sum(acc_svm_linear)

Conclusion

I learned about a totally new field in machine learning. Importantly, is very interesting and motivating since coming up with machine learning models feels like creating and refining a crystal ball that shows the future. In the future, I plan on reading through this comprehensive tutorial of the caret package, take machine learning courses on DataCamp, and hope to learn from the mistakes I made during this hackathon.