Introduction

Previously, I had participated in predicting whether individual passengers would survive or not based on certain information about them. After a fair bit of work–and following several tutorials–I was able to achieve a score of 0.80, which meant that I was able to predict 80% of the private dataset correctly. I was excited to participate in the Bank of Sber’s price prediction competition because:

  • it was small enough to allow me to work with R
  • the test set was ahead of the training set in time (like a real forecast

Unfortunately, I lost sight of the “forecasting” part. It may have been appropriate to treat the data as a timeseries and maybe use the BSTS package or the XTS package to accomplish forecasting, but I defaulted to xgboost and never really tried anything else.

Load Data

Here I load and clean some data. The first time I loaded the .csv file I found that a whole bunch of information was being concatenated together. I found that by being specific about column types I was able to solve this problem. I am also loading fixes for bad data.

library(readxl)

## Functions
percent_missing <- function(x){sum(is.na(x))/length(x)*100}
mod_int <- function(x) {as.integer(x) - 1}

## Load Training Data
train_types = c("numeric","date",
                rep("numeric", 9),
                rep("text", 2),
                rep("numeric", 16),
                rep("text", 1),
                rep("numeric", 3),
                rep("text", 8),
                rep("numeric", 65),
                rep("text", 1),
                rep("numeric", 7),
                rep("text", 1),
                rep("numeric", 3),
                rep("text", 1),
                rep("numeric", 33),
                rep("text", 1),
                rep("numeric", 139))

train_data <- read_excel(path = paste0(getwd(),"/train.xlsx"),
                         sheet = 1,
                         col_types = train_types,
                         na = "",trim_ws = TRUE)

## Load Testing Data
test_types = c("numeric","date",
               rep("numeric", 9),
               rep("text", 2),
               rep("numeric", 16),
               rep("text", 1),
               rep("numeric", 3),
               rep("text", 8),
               rep("numeric", 73),
               rep("text", 1),
               rep("numeric", 37),
               rep("text", 1),
               rep("numeric", 138))

test_data <- read_excel(path = paste0(getwd(),"/test.xlsx"),
                        sheet = 1,
                        col_types = test_types,
                        na = "",trim_ws = TRUE)

## Read in the fixes
fix_data <- read_excel(path = paste0(getwd(),"/BAD_ADDRESS_FIX.xlsx"),
                       sheet = 1,
                       na = "",
                       trim_ws = TRUE)

## read in train lat lon
train_lat_lon <- read.table(file = paste0(getwd(),"/train_lat_lon.csv"),
                            header = TRUE,
                            sep = ",",
                            strip.white = TRUE)

## read in test lat lon
test_lat_lon <- read.table(file = paste0(getwd(),"/test_lat_lon.csv"),
                           header = TRUE,
                           sep = ",",
                           strip.white = TRUE)

Accounting

Here I’m just trying to keep track of a couple of different things, so these variables are just, well, accounting.

fix_data_cols <- names(fix_data)
fix_data_id <- fix_data$id
## save response
response <- train_data$price_doc
train_data$price_doc <- NULL
train_data$kaggle <- "TRAIN"
test_data$kaggle <- "TEST"

## columnbind the datasets
the_data <- rbind(train_data, test_data)
## Save ids
all_ids <- the_data$id

Substitute for Bad Data

There has to be a better way to do this. I only found this fixed data the day that the competition was ending, so I had to write something fast to update my data. This just goes through the dataframe and performs substitutions where needed. Part of what is happening here is that I’m turning a 174 level factor column into an x and y column. I was hoping this would be more helpful.

## Select all the columns which are safe
df1 <- the_data[,which(!(names(the_data) %in% fix_data_cols))]

## Select all the columns which need to be changed
df2 <- the_data[,which(names(the_data) %in% fix_data_cols)]

for (i in df2$id) {
        if (i %in% fix_data_id) {
                df2[which(df2$id == i), ] <- fix_data[which(fix_data$id == i), ]
        }
}

final_data <- cbind(df1, df2)
almost_train <- subset(final_data, kaggle == "TRAIN")
almost_test <- subset(final_data, kaggle == "TEST")
final_train <- merge(almost_train, train_lat_lon, by = "id")
final_test <- merge(almost_test, test_lat_lon, by = "id")
final_train$price_doc <- response
train_id <- final_train$id
test_id <- final_test$id

## Remove non predictive variables
final_train$price_doc <- NULL
final_train$id <- NULL
final_test$id <- NULL
final_train$kaggle <- NULL
final_test$kaggle <- NULL

Continue Cleaning

Turning the numeric “tolerance” level into a factor-column, and then evaluate the “missingness” of all the data. I selected columns which had no more than 14% of their rows consisting of empty values. May have been too much. Conventional wisdom suggests that anything above 5% is bad. Also, somewhere in my hurried, convoluted process some of the numeric data became factors so I had to fix that.

## columnbind the datasets
the_data <- rbind(final_train, final_test)

## remove key column
the_data$key <- NULL
the_data$tolerance_cat <- "filler"

for ( i in seq(1, nrow(the_data))) {
        if (the_data[i, "tolerance_m"] == 5) {
            the_data[i, "tolerance_cat"] <- "BEST"
        } else if ( the_data[i, "tolerance_m"] == 10) {
            the_data[i, "tolerance_cat"] <- "GOOD"
        } else if ( the_data[i, "tolerance_m"] == 25) {
            the_data[i, "tolerance_cat"] <- "OKAY"
        } else {
            the_data[i, "tolerance_cat"] <- "BAD"
        }
}

## delete tolerance_m
the_data$tolerance_m <- NULL

## delete sub_area
sub_area <- the_data$sub_area
the_data$sub_area <- NULL

## Create date based columns
timestamp <- the_data$timestamp

## Create month and year categories
the_data <- the_data %>%
separate(timestamp, c("year_cat","month_cat","day_cat"), sep = "-")

## change to character
the_data$year <- paste0(the_data$year_cat, "_cat")
the_data$month <- paste0(the_data$month_cat, "_cat")

## Remove certain columns
the_data$timestamp <- NULL

## Determine percent missing for each column
pm = apply(the_data,2,percent_missing)

## Keep only intact columns
sub_data <- the_data[,pm <= 14]

## Some weird problem where numeric stuff got turned into factors.
tf <- grepl("cafe",names(sub_data))
list_of_names <- names(sub_data)[tf]
list_of_names <- c(list_of_names, "prom_part_5000")

for (i in list_of_names ) {
        sub_data[,i] <- as.numeric(sub_data[,i])
}

Perform PCA

Pre-processing with PCA, yay! The function performs a series of steps on the numeric data:

  1. It checks for correlations between columns
  2. It centers the columns
  3. It scales the columns
  4. It imputes using 5 Nearest Neighbors
  5. It performs Principal Component Analysis

I then take the character related data and perform some one-hot encoding. “Lubridate” is used to create some features from the timestamp. And the data is split into a training and testing set once again.

## Perform PCA on the original numeric_data (first)
pca_data <- preProcess(x = sub_data,
                       method = c("corr", "center", "scale", "knnImpute","pca"),
                       k = 5,
                       knnSummary = median,
                       verbose = TRUE)
pca_sub <- predict(pca_data, sub_data)

## Separate numeric data from character data
data_char <- unlist(lapply(pca_sub, is.character))
data_num <- unlist(lapply(pca_sub, is.numeric))

## One hot encode to dummy variables
character_data <- as.data.frame(lapply(pca_sub[,data_char],as.factor))
dmy <- dummyVars(formula = "~ .", data = character_data)
dmy_data <- as.data.frame(predict(dmy, character_data))
final_data <- cbind(pca_sub[,data_num], dmy_data)

## Create time data
final_data$DAY <- (ymd(timestamp) - min(ymd(timestamp)))/ddays(1)
final_data$WEEK <- (ymd(timestamp) - min(ymd(timestamp)))/dweeks(1)

## Make data sets
train_data <- final_data[1:length(train_id),]
test_data <- final_data[(1+length(train_id)):(length(train_id) + length(test_id)),]

## Attach response to training set
train_data$price_doc <- response

Variable Importance

Now that I have my “better” data, I run it through xgboost and look at the columns/features/variables that it considers to be the most important. I am using caret for this, and because this is an embarrassingly parallel problem, I parallelize it!

I use ten-fold cross-validation (maybe should have used more?) and a custom tuning
grid for xgboost. Some of the parameters I pulled from a notebook (I forget which). I then select the top 65 features according to xgboost and make training/testing sets with only those features.

One further note: I am taking the natural log of the response. My understanding is that my RMSE metric would more closely approximate the RMLSE that the competition uses by doing this. Not entirely sure if it helped!

library(caret)
library(xgboost)
library(parallel)
library(doParallel)

## Transform with log
original_response <- train_data$price_doc
train_data$price_doc <- log(train_data$price_doc + 1)

## Set tune control
fit_control <- trainControl(method="cv",
                            number=10,
                            savePredictions = "final",
                            allowParallel = TRUE,
                            verboseIter = TRUE)

xgb_grid <- expand.grid(nrounds = c(25, 50, 75),
                        lambda = 0.1,
                        alpha = c(0.00005, 0.000075, 0.0001, 0.0005),
                        eta = c(0.15, 0.075, 0.05))

## Create cluster
cluster <- makeCluster(detectCores() - 2)
registerDoParallel(cluster)

## Train caret list
set.seed(100)
ptm <- proc.time()
for_imp <- train(x = train_data[,-which(names(train_data)=="price_doc")],
                y = train_data$price_doc,
                trControl = fit_control,
                metric = "RMSE",
                method = "xgbLinear",
                tuneGrid = xgb_grid,
                max_depth = 5,
                subsample = 0.7,
                colsample_bytree = 0.7)

## Stop Cluster
stopCluster(cluster)
registerDoSEQ()
proc.time() - ptm

## Try xgb on a subset (say, 30)
V <- varImp(for_imp)
row_names <- rownames(V$importance)[1:65]
trn <- train_data[,row_names]
trn$price_doc <- log(original_response + 1)
tst <- test_data[,row_names]

Create an Ensemble

I have a confession: I only learned how to use the caretEnsemble package within the last few days of the competition. For the longest time, I would create my model list using careList() and then generate an error when trying to call caretEnsemble() on the list. I finally figured out the answer: use savePredictions = TRUE in trainControl() and name the method identical to the method listed in caretModelSpec(). The tutorial I had followed had a custom name, but a custom name will confuse the inner workings of caretEnsemble(). The following code works.

## Set tune control
fit_control <- trainControl(method="cv",
                           number=10,
                           savePredictions = TRUE,
                           index=createMultiFolds(trn$price_doc, k=2),
                           allowParallel = TRUE,
                           verboseIter = TRUE)

xgb_grid <- expand.grid(nrounds = c(25, 50, 75),
                        lambda = 0.1,
                        alpha = c(0.00005, 0.000075, 0.0001, 0.0005),
                        eta = c(0.15, 0.075, 0.05))

## Execute in Parallel
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

## Train caret list
set.seed(100)
ptm <- proc.time()
model_list <- caretList(x = trn[,-which(names(trn)=="price_doc")],
                        y = trn$price_doc,
                        trControl = fit_control,
                        metric = "RMSE",
                        tuneList = list(knn = caretModelSpec(method = "knn",
                                        tuneLength = 3),
                                        xgbLinear = caretModelSpec(method = "xgbLinear",
                        tuneGrid = xgb_grid,
                        max_depth = 5,
                        subsample = 0.7,
                        colsample_bytree = 0.7)))

## Stop Cluster
stopCluster(cluster)
registerDoSEQ()
proc.time() - ptm

## Generage (Linear Ensemble)
model_ensemble <- caretEnsemble(model_list,
                                metric = "RMSE",
                                trControl = trainControl(number = 5,
                                                         savePredictions = TRUE))

Generate Predictions

Finally, I generate predictions and create a submission file. It’s obviously very important to transform the response into the original form!

## Generate predictions for each model
P <- predict(model_ensemble, newdata = tst)

## Grab submission id from submission example
sub_id <- read.table(file = paste0(getwd(),"/sample_submission.csv"),
                    header = TRUE,
                    sep = ",",
                    strip.white = TRUE)

## Write table
test_id <- sub_id$id
pred <- exp(P) - 1
for_submission <- as.data.frame(cbind(test_id, pred))
names(for_submission) <- c("id","price_doc")
write.table(for_submission,
            file = paste0(getwd(),"/submission.csv"),
            sep = ",",
            row.names = FALSE)
Advertisements