Introduction

What is data science? This emerging field is concerned with exploring data models via the scientific method for the purposes of generating high-accuracy predictions. Often, the Data Scientist does not know exactly “which” algorithm will best fit the data before hand. The data scientist must adopt a rigorous training and testing methodology to determine which algorithms are the most appropriate. Furthermore, the data scientist must use a cross-validation scheme to properly tune the algorithms–usually there is no analytical way to determine these tuning parameters beforehand.

The website kaggle.com encourages users to model data using these techniques and to submit their predictions. These predictions are score and ranked.

This post will focus on developing several models and then ‘ensembling’ them to predict who will survive the Titanic disaster in 1912.

Tuning

Creating Datasets

Data describing the passengers of the Titanic was downloaded from Kaggle and cleaned. Cleaning and feature creation was accomplished by sourcing ‘read_in.R’. This script was written by following  a guide by Trevor Stephens. The creation of a “training” set and a “testing” set was accomplished by createDataPartition() within the caret package.

library(caret)
library(caretEnsemble)

## Load cleaned data
source('read_in_data.R')
train.data$Survived <- as.factor(train.data$Survived)
levels(train.data$Survived)[levels(train.data$Survived)=="0"]<-"NO"
levels(train.data$Survived)[levels(train.data$Survived)=="1"]<-"YES"

## Create Partition
set.seed(100)
trainIndex <- createDataPartition(train.data$Survived, 
                    p = 0.85, 
                    list = FALSE)

## training and testing set
trn <- train.data[trainIndex, c(2,4,5,6,7,9,11,12,13,16)]
tst <- train.data[-trainIndex, c(2,4,5,6,7,9,11,12,13,16)]

 Fitting Control

caret and caretEnsemble allow for a large degree of control in specifying the training methodology. Here, we specify the:

  • “cross-validation” method for hyper-parameter tuning
  • the number of cross-validations as 3
  • that only the final model will be saved (not for every iteration)
  • and the number of times resampling will be performed.
## Control for the train function
fit_control <- trainControl(method="cv",
                           number=3,
                           savePredictions = "final", 
                           classProbs=TRUE,
                           index=createResample(trn$Survived, times = 15),
                           summaryFunction = twoClassSummary,
                           verboseIter = FALSE)

Model Fitting

And now caretList will be used to sequentially run several different models using the same training methodology (specified above). Here, the response is set as “Survived” which is a vector of factors containing either a “Yes” (indicating survival) or “No” (indicating a death). The metric by which the models will be evaluated (and by which the hyper-parameters will be chosen) is the Receiver Operating Characteristic (ROC). This is an appropriate metric because the response is dichotomous. If “Survived” had more than 2 factor levels, we would have to use a different metric.

library(pROC)
library(randomForest)
library(nnet)
library(kernlab)

set.seed(100)
ptm <- proc.time()
model_list <- caretList(Survived ~ ., 
                        data = trn, 
                        trControl = fit_control,
                        metric = "ROC",
                        tuneList = list(knnrSpec = caretModelSpec(method = "knn", tuneLength = 3),
                                        nnetSpec = caretModelSpec(method = "nnet", tuneLength = 3, trace=FALSE),
                                        svmRSpec = caretModelSpec(method = "svmRadial", tuneLength = 3), 
                                        rdmfSpec = caretModelSpec(method = "rf", tuneLength = 3)))
proc.time() - ptm
##    user  system elapsed 
##   53.34    1.02   54.56

Evaluating 3 different values for each of the hyperparameters for the 4 models takes approximately 50 seconds.

Model Evaluation

Remember, within each model, cross-validation and ROC curves were being used to find the “best” hperparameters. The final hyperparameters for the four models are:

  • K-Nearest Neighbor: Number of Neighbors = 9
  • Neural Network: Size = 1, Decay = 0.1
  • SVM Radial Basis: Sigma = 0.102, Cost = 0.5
  • Random Forest: mtry = 8

The ROC curves are shown below:

These curves seem to show that the k nearest neighbors implementation is sub-par compared to the other models.

Model Predictions

Before we get into the ensembling process, we should check the accuracy of each model on our test set. The “Survived” response is a factor, with levels “No” and “Yes” represented as “1” and “2” respectively, but the output from caretList() is membership probability, so it is necessary to perform a conversion. The threshold used here is 0.5.

preds <- as.data.frame(predict(model_list, newdata = tst[,-10]))
con_mat <- data.frame()
for (i in seq(1,4)) {
        a <- rep(1,133)
        a[preds[,i] > 0.5] <- 2
        tmp <- confusionMatrix(a, as.numeric(tst$Survived))
        new_row <- data.frame(Accuracy = tmp$overall[[1]],
                              Lower = tmp$overall[[3]],
                              Upper = tmp$overall[[4]]) 
        con_mat <- rbind(con_mat, new_row)
}
con_mat$Model <- c("knn","nnet","svmRadial","rf")
con_mat
##    Accuracy     Lower     Upper     Model
## 1 0.6842105 0.5979815 0.7620240       knn
## 2 0.8045113 0.7268327 0.8681472      nnet
## 3 0.8270677 0.7519028 0.8871035 svmRadial
## 4 0.8270677 0.7519028 0.8871035        rf

All of the accuracies seem to be in the 70-80 range, except for k nearest neighbors, which is lower. The accuracy of the Random Forest is nearly identical to the SVM.

Model Correlation

We should check the correlation of the models to determine if they really are suitable for ensembling. This is accomplished with modelCor().

modelCor(resamples(model_list))
##           knnrSpec  nnetSpec  svmRSpec  rdmfSpec
## knnrSpec 1.0000000 0.2751767 0.3155195 0.6490129
## nnetSpec 0.2751767 1.0000000 0.6698466 0.6203163
## svmRSpec 0.3155195 0.6698466 1.0000000 0.6816144
## rdmfSpec 0.6490129 0.6203163 0.6816144 1.0000000

There are definitely some correlations among the models. The k nearest neighbors is the least correlated amongst models, random forest is most. There are only 15 points per model, so the fact the low correlations could simply because fitControl() did not generate enough data.

Ensembling

Now that the model has been tuned, it is time to re-run the model on the entire dataset using the values of the hyper-parameters found previously. Because the random forest was highly correlated with the other models, it is excluded from the final model_list.

## Control for the train function
control_full_data <- trainControl(savePredictions = "final", 
                           classProbs=TRUE,
                           index=createResample(trn$Survived, times = 15),
                           summaryFunction = twoClassSummary,
                           verboseIter = FALSE)
set.seed(100)
ptm <- proc.time()
model_full_data <- caretList(Survived ~ ., 
                        data = train.data[,c(2,4,5,6,7,9,11,12,13,16)], 
                        trControl = control_full_data,
                        metric = "ROC",
                        tuneList = list(knnrSpec = caretModelSpec(method = "knn", tuneGrid=data.frame(k=9)),
                                        nnetSpec = caretModelSpec(method = "nnet",trace = FALSE,
                                                                  tuneGrid=data.frame(size=1,decay=0.1)),
                                        svmRSpec = caretModelSpec(method = "svmRadial", 
                                                                  tuneGrid=data.frame(sigma=0.1016,C=0.5)) 
                                        # rdmfSpec = caretModelSpec(method = "rf", 
                                        #                           tuneGrid=data.frame(mtry=8))
                                        ))
proc.time() - ptm
##    user  system elapsed 
##    6.61    0.17    6.80

Ensembling with caretStack()

And now it is time for ensembling! caretEnsemble makes this very easy.

set.seed(100)
model_ensemble <- caretStack(model_full_data, 
                             method = "glm",
                             metric = "ROC",
                             trControl = trainControl(method = "boot",
                                                      number = 5,
                                                      savePredictions="final",
                                                      summaryFunction=twoClassSummary, 
                                                         classProbs = TRUE))
summary(model_ensemble)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3415  -0.5502  -0.4420   0.4918   2.2838  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.85406    0.06829 -41.791  < 2e-16 ***
## knnrSpec     0.79158    0.12768   6.200 5.66e-10 ***
## nnetSpec     3.69496    0.20978  17.613  < 2e-16 ***
## svmRSpec     1.31351    0.20748   6.331 2.44e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8297.9  on 6227  degrees of freedom
## Residual deviance: 5169.9  on 6224  degrees of freedom
## AIC: 5177.9
## 
## Number of Fisher Scoring iterations: 4

Generate Submission

Now that we have the ensemble created, it is time to create a submission.

ensemble_prediction <- predict(model_ensemble$models, newdata = test.data[,c(2,4,5,6,7,9,10,11,12,13,15)])
final_probs = (-2.85406) + 
        (0.79158 * ensemble_prediction[,1]) + 
        (3.69496 * ensemble_prediction[,2]) +
        (1.31351 * ensemble_prediction[,3])

## Normalize
min_f <- min(final_probs)
max_f <- max(final_probs)
normProbs = (final_probs - min_f)/(max_f - min_f)        

## Convert to 1/0 vector
submission = data.frame(PassengerId=test.data$PassengerId,
                        Survived=rep(0,418))
submission[normProbs > 0.5,"Survived"] <- 1

## Create submission!
write.table(
        submission,
        file=paste0(getwd(),"/submission.csv"),
        sep=",",
        row.names=FALSE,
        col.names=TRUE
        )

This ensemble (using k nearest neighbors and not randomf forest) scores 0.79426, which is, unfortunately, not as high as my current score.

Advertisements