I enjoy hiking and pretending that I’m one with nature. I have only realized rather recently that I have no idea what kinds of trees I pass as I amble about the trail. But I no longer have to wonder, because there’s a wonderful dataset on Kaggle where I can classify these large plants based on their leaves. The competition is found here.
It seems that this problem lends itself to XGBoost most readily. Here is what I did. First, I loaded the relevant libraries and imported the data:
# Load libraries library(caret) library(xgboost) ## Read .csv file train.data <- read.csv( file = paste0(getwd(),"/train.csv"), sep = ",", strip.white = TRUE, header = TRUE )
Then, I had to re-formulate the data to readily work with XGBoost. Because it only accepts numerical information, I converted the new factor levels into integers. I was also under the impression that XGBoost begins counting at 0 so I subtracted “1” from the entire vector of responses. Then, I converted it to a factor so that CARET wouldn’t do anything too weird when I used the “species” response to index my training and testing data set.
## Transform the response train.data$species <- as.factor(as.integer(as.factor(train.data$species)) - 1)
And here is my attempt at training with CARET. There are additional methods one can add to CARET to make your “tuning” session more effective.
# Create training and testing data leafIndex <- createDataPartition( y = train.data$species, p = 0.8, list = FALSE ) training <- train.data[leafIndex, -1] testing <- train.data[-leafIndex, -1] ## Set control function for the tuning session tuneControl <- trainControl( method = "cv", number = 3, verboseIter = TRUE ) ## Train! set.seed(1) ptm <- proc.time() xgboost.caret <- train( x = training[,-1], y = as.integer(training$species)-1, method = "xgbTree", num_class=99, objective="multi:softmax", verbose = 1, trControl = tuneControl ) proc.time() - ptm
This indicates that the most effective parameters (from the defaults) are:
- Fitting nround = 150
- max_depth = 1
- eta = 0.3
- gamma = 0
- colsample_bytree = 0.9
- min-child-weight = 1
after running for 1002.25 seconds.
The accuracy can be assessed with:
## Generate Predictions and assess accuracy pred <- predict(xgboost.caret, newdata = testing[,-1]) confusionMatrix(pred, testing$species)
which comes out to be 90.91% with a 95% confidence interval of (86.01%, 94.52%). Of course, this doesn’t mean that we are guaranteed to score well, we need to see what Kaggle says!
So we run xgboost one more time on all the data with the hyperparameters that we found above.
## Now redo the fitting with xgboost by itself, with fitted parameters, with the ## softprob function train.data$id <- NULL train.data$species <- (as.integer(as.factor(train.data$species)) - 1) train.matrix <- xgb.DMatrix( data = as.matrix(train.data[,-1]), label = train.data$species ) xgb.final.fit <- xgb.train( data = train.matrix, label=train.data$species, max.depth=1, eta=0.3, gamma=0, min_child_weight=1, colsample_bylevel=0.9, nrounds = 150, num_class = 99, objective = "multi:softprob" )
Generate our final predictions from the testing data:
## Generate Prediction test.data <- read.csv( file = paste0(getwd(),"/test.csv"), sep = ",", strip.white = TRUE, header = TRUE ) xgb.prediction <- predict(xgb.final.fit,as.matrix(test.data[,-1]))
Format and submit!
## Format result <- matrix(xgb.prediction, ncol = 99, byrow = T) result.to.save <- data.frame(result) result.to.save$id <- test.data$id result.to.save <- result.to.save[,c(100,1:99)] names(result.to.save) <- c("id",names.of.species) ## Write to File (Gives score of 0.55 or #108 as of Friday night) write.table( x = result.to.save, file = paste0(getwd(),"/submit_16Sept2016.csv"), sep = ",", col.names = TRUE, row.names = FALSE )
This gives a score of 0.55, which in the middle of September put me at #108. That is no longer the case. But it was fun learning to use XGBoost with CARET!