I enjoy hiking and pretending that I’m one with nature. I have only realized rather recently that I have no idea what kinds of trees I pass as I amble about the trail. But I no longer have to wonder, because there’s  a wonderful dataset on Kaggle where I can classify these large plants based on their leaves. The competition is found here.

It seems that this problem lends itself to XGBoost most readily. Here is what I did. First, I loaded the relevant libraries and imported the data:

# Load libraries

## Read .csv file
train.data <- read.csv(
        file = paste0(getwd(),"/train.csv"),
        sep = ",",
        strip.white = TRUE,
        header = TRUE

Then, I had to re-formulate the data to readily work with XGBoost.  Because it only accepts numerical information, I converted the new factor levels into integers. I was also under the impression that XGBoost begins counting at 0 so I subtracted “1” from the entire vector of responses. Then, I converted it to a factor so that CARET wouldn’t do anything too weird when I used the “species” response to index my training and testing data set.

## Transform the response
train.data$species <- as.factor(as.integer(as.factor(train.data$species)) - 1)

And here is my attempt at training with CARET. There are additional methods one can add to CARET to make your “tuning” session more effective.

# Create training and testing data
leafIndex <- createDataPartition(
        y = train.data$species,
        p = 0.8,
        list = FALSE

training <- train.data[leafIndex, -1]
testing <- train.data[-leafIndex, -1]

## Set control function for the tuning session
tuneControl <- trainControl(
        method = "cv",
        number = 3,
        verboseIter = TRUE

## Train!
ptm <- proc.time()
xgboost.caret <- train(
        x = training[,-1],
        y = as.integer(training$species)-1,
        method = "xgbTree",
        verbose = 1,
        trControl = tuneControl
proc.time() - ptm

This indicates that the most effective parameters (from the defaults) are:

  • Fitting nround = 150
  • max_depth = 1
  • eta = 0.3
  • gamma = 0
  • colsample_bytree = 0.9
  • min-child-weight = 1

after running for 1002.25 seconds.

The accuracy can be assessed with:

## Generate Predictions and assess accuracy
pred <- predict(xgboost.caret, newdata = testing[,-1])
confusionMatrix(pred, testing$species)

which comes out to be 90.91% with a 95% confidence interval of (86.01%, 94.52%). Of course, this doesn’t mean that we are guaranteed to score well, we need to see what Kaggle says!

So we run xgboost one more time on all the data with the hyperparameters that we found above.

## Now redo the fitting with xgboost by itself, with fitted parameters, with the 
## softprob function
train.data$id <- NULL
train.data$species <- (as.integer(as.factor(train.data$species)) - 1)
train.matrix <- xgb.DMatrix(
        data = as.matrix(train.data[,-1]),
        label = train.data$species

xgb.final.fit <- xgb.train(
       data = train.matrix,
       nrounds = 150,
       num_class = 99,
       objective = "multi:softprob"

Generate our final predictions from the testing data:

## Generate Prediction
test.data <- read.csv(
        file = paste0(getwd(),"/test.csv"),
        sep = ",",
        strip.white = TRUE,
        header = TRUE

xgb.prediction <- predict(xgb.final.fit,as.matrix(test.data[,-1]))

Format and submit!

## Format 
result <- matrix(xgb.prediction, ncol = 99, byrow = T)
result.to.save <- data.frame(result)
result.to.save$id <- test.data$id
result.to.save <- result.to.save[,c(100,1:99)]
names(result.to.save) <- c("id",names.of.species)

## Write to File (Gives score of 0.55 or #108 as of Friday night)
        x = result.to.save,
        file = paste0(getwd(),"/submit_16Sept2016.csv"),
        sep = ",",
        col.names = TRUE,
        row.names = FALSE

This gives a score of 0.55, which in the middle of September put me at #108. That is no longer the case. But it was fun learning to use XGBoost with CARET!