TP2 - NYC Trip Duration Prediction

Introduction & Background

Introduction: New York City is known for their abundance of traffic, high amount of pedestrians, bicycles, and other things that could potentially clog up the road. Because of this, most people in New York avoid driving on their own and end up using taxis. Even with not having to drive on their own, the overwhelming street congestion leaves many New York visitors and residents wondering how they can make it to their destination on time. To help with this problem, our goal was to build a model that can predict how long taxi trip duration will be based on the route.

Dataset: The observations in this dataset, recorded in 2016, was collected by the NYC Taxi and Limousine Commission. Within the dataset contains certain features that can play a role in duration, such as day of the week, time of day, number of passengers, the distance between the pickup and dropoff location, and more. In addition to predicting trip duration, we will identify features that tend to heavily influence trip durations than others.

We load the data to our environment using read.csv function from base R.

# Create dataframe for training and testing 

# The training set consists of 1458644 observations and 11 features
train_df <- read.csv("train.csv", stringsAsFactors=F, header=T, sep=',')

# The test data consists of 625134 observations and 9 variables
# It is missing trip_duration and dropoff_time
test_df <- read.csv("test.csv", stringsAsFactors=F, header=T, sep=',')

print(names(train_df))

##  [1] "id"                 "vendor_id"          "pickup_datetime"   
##  [4] "dropoff_datetime"   "passenger_count"    "pickup_longitude"  
##  [7] "pickup_latitude"    "dropoff_longitude"  "dropoff_latitude"  
## [10] "store_and_fwd_flag" "trip_duration"

Data Cleaning

Initial Train & Test data Statistics

To gain a better understanding of the data we are dealing with, we start by looking at the data type of each feature.

# Use str() to output a list of each feature, its data type, and some instances
str(train_df)

## 'data.frame':    1458644 obs. of  11 variables:
##  $ id                : chr  "id2875421" "id2377394" "id3858529" "id3504673" ...
##  $ vendor_id         : int  2 1 2 2 2 2 1 2 1 2 ...
##  $ pickup_datetime   : chr  "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
##  $ dropoff_datetime  : chr  "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
##  $ passenger_count   : int  1 1 1 1 1 6 4 1 1 1 ...
##  $ pickup_longitude  : num  -74 -74 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.7 40.8 40.7 40.8 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
##  $ trip_duration     : int  455 663 2124 429 435 443 341 1551 255 1225 ...

As you can see, our data consists mainly of numeric and categorical variables. However, if we pay special attention to some of the categorical variables, we notice that they should actually be date-time objects. We will address this point later, but for now it might be useful to get summary statistics of the given data.

# Use the summary function to view summary statistics
summary(train_df$trip_duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     397     662     959    1075 3526282

It is important to stop and make sense of the numbers we are seeing here. Trip_duration, which is our Y variable, has a minimum value of 1. The Kaggle page, that is hosting this competition, tells us the this numeric number is measured in seconds. It is probably a fair assumption that no trip has length of 1 second. Furthermore, the max trip is 3,526,282 seconds which is approximately 9,795 hours. Therefore, we have some outliers within our given data set. This is the most important result to note from the summary statistics.

Clean trip duration

To ensure we exclude these most of these outliers, given the large range, we decided to omit data that does not lie within two standard deviations of the mean for trip_duration. This is a common statistical decision which is why we chose 2 standard deviations.

# First we calculate the mean and standard deviation of trip_duration
avgTrip <- mean(train_df$trip_duration)
sdTrip <- sd(train_df$trip_duration)

# import dplyer to start applying data set manipulation
library(dplyr)

# omit all observations that lie outside of 2 standard deviations of 
# trip_duration
train_df %>% filter(trip_duration >= (avgTrip - 2*sdTrip)) %>%
             filter(trip_duration <= (avgTrip + 2*sdTrip)) -> train_df

# Re-look at summary statistics
summary(train_df)

##       id              vendor_id     pickup_datetime    dropoff_datetime  
##  Length:1456540     Min.   :1.000   Length:1456540     Length:1456540    
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :1.534                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##  passenger_count pickup_longitude  pickup_latitude dropoff_longitude
##  Min.   :0.000   Min.   :-121.93   Min.   :34.36   Min.   :-121.93  
##  1st Qu.:1.000   1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99  
##  Median :1.000   Median : -73.98   Median :40.75   Median : -73.98  
##  Mean   :1.664   Mean   : -73.97   Mean   :40.75   Mean   : -73.97  
##  3rd Qu.:2.000   3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96  
##  Max.   :9.000   Max.   : -61.34   Max.   :51.88   Max.   : -61.34  
##  dropoff_latitude store_and_fwd_flag trip_duration    
##  Min.   :32.18    Length:1456540     Min.   :    1.0  
##  1st Qu.:40.74    Class :character   1st Qu.:  397.0  
##  Median :40.75    Mode  :character   Median :  662.0  
##  Mean   :40.75                       Mean   :  836.9  
##  3rd Qu.:40.77                       3rd Qu.: 1073.0  
##  Max.   :43.92                       Max.   :11411.0

Latitude and Longitude

Through googling, and looking at other Kaggle submissions, most people define New York City to lie within the following coordinates:

Latitude: 40.63, 40.85
Longitude: -74.03, -73.75

Comparing this again to our summary statistics above, we still have observations outside of this geographic area. Since this project is focused on NYC, we will restrict our training data to lie within its perimeter.

## Clean up coordinates -- only include obs in city limits
train_df %>% filter(pickup_longitude <= -73.75) %>%
             filter(pickup_longitude >= -74.03) %>%
             filter(pickup_latitude <= 40.85) %>%
             filter(pickup_latitude >= 40.63) %>%
             filter(dropoff_longitude <= -73.75) %>% 
             filter(dropoff_longitude >= -74.03) %>%
             filter(dropoff_latitude <= 40.85) %>%
             filter(dropoff_latitude >= 40.63) -> train_df

Change string dates into data-time format

As mentioned earlier, the last cleaning step is to convert pickup and dropoff dates and time into the correct data type. This is easily accomplished using a combination of lubridate and dplyr.

# import lubridate package
library(lubridate)

# Change format of training date variables into date and time
train_df %>% mutate(pickup_datetime, 
                    pickup_datetime = ymd_hms(pickup_datetime)) %>%
             mutate(dropoff_datetime, 
                    dropoff_datetime= ymd_hms(dropoff_datetime)) -> train_df

# We can the truncate the extract only the data from the above columns
train_df %>% mutate(pickup_date = date(pickup_datetime)) %>%
  mutate(dropoff_date= date(dropoff_datetime)) -> train_df

# Repeat the above procedure for the test set
test_df %>% mutate(pickup_datetime, 
                   pickup_datetime = ymd_hms(pickup_datetime)) -> test_df
test_df %>% mutate(pickup_date = date(pickup_datetime)) -> test_df

Data Visualization, Preprocessing and Feature Expansion

Skewness of Y variables

In this next section, we are going to use some visualizations tools to gain a better understanding of the data. First we want to look at the remaining spread of the Y variable (trip_duration).

# Plot histogram of trip duration with 100 bins
hist(train_df$trip_duration, 
     breaks = 100, main = "Histogram of Given Trip Durations")

As you can see, our data is right-skewed(positive skewness). To accommodate for this, we take the log transformation of the same variable. Additionally, this gave us some intuition as why the error measure for the problem is rmsle (root-mean-square-log-error).

# log transform Y variable
train_df %>% mutate(trip_duration, 
                    logTripDuration = log(trip_duration +1)) -> train_df

# display a histogram to view changes
hist(train_df$logTripDuration, 
     breaks = 100, main = "Histogram of Given Log Trip Durations")

The above graph now has the shape comparable to a normal distribution.

Since we are dealing with trip data, it is possibly that there is seasonality involved. To examine this hypothesis, we can look at the number of trips over time.

## Plot Time Series of Data 
# count the number of trips by date for train and test sets
train_df %>% count(date = pickup_date) -> train_date_counts
test_df %>% count(date = pickup_date) -> test_date_counts

# plot the time series graph for train
plot(train_date_counts, ylim=c(0,max(train_date_counts$n) ),
     main = 'Timeseries of Number of Trips',
     ylab = 'Number of Trips')
lines(train_date_counts,type = 'l', lty = 2, col='blue') 

# fix the prior axis so train and test time-series can be on the same graph
par(new=TRUE)
plot(test_date_counts, axes = F, ylim=c(0,max(train_date_counts$n)))
lines(test_date_counts,type = 'l', lty = 1, col='darkgreen')

We do notice a distinct drop late January to early February. However this coincides with the test it would probably be unwise to remove it.

Feature expansion on distances

We added 3 features to our model:

Haversine Distance
- Shortest distance between two points on a sphere based on coordinate locations
Manhattan Distance
- Sum of vertical and horizontal distances.
Travel Direction (degrees)

We calculate all these 3 features using a combination of dplyr and a package called geosphere, which contains a haversine function that easily integrates with dplyr’s syntax.

# define coordinate boundaries mentioned above in variables for function 
# parameters
city_long_border <- c(-74.03, -73.75)
city_lat_border <- c(40.63, 40.85)

library(ggplot2)
# visulaize coordinates using scatter plot 
ggplot(train_df[1:100000,], aes(x = pickup_latitude, y = pickup_longitude)) +
  geom_point(color = 'blue', size = 0.1) +
  ggtitle("Outline of NYC by Trip Coordinates")

# import geosphere library 
library(geosphere)

# calculate haversine distance for all training observations
train_df %>% mutate(dist_haversine = 
                      distHaversine(cbind(pickup_longitude,
                                          pickup_latitude),
                                    cbind(dropoff_longitude,
                                          dropoff_latitude))/1000) -> train_df

# calculate haversine distance for all test observations
test_df %>% mutate(dist_haversine = 
                     distHaversine(cbind(pickup_longitude
                                         ,pickup_latitude),
                                   cbind(dropoff_longitude,
                                         dropoff_latitude))/1000) -> test_df

# calculate manahttan distance for all training observations
train_df %>% mutate(distance_dummy_manhattan = distHaversine(cbind(pickup_longitude,pickup_latitude),
cbind(dropoff_longitude,pickup_latitude))/1000 + distHaversine(cbind(pickup_longitude,pickup_latitude), cbind(pickup_longitude,dropoff_latitude))/1000) -> train_df

# calculate manhattan distance for all test observations
test_df %>% mutate(distance_dummy_manhattan =  distHaversine(cbind(pickup_longitude,pickup_latitude), cbind(dropoff_longitude,pickup_latitude))/1000 +
distHaversine(cbind(pickup_longitude,pickup_latitude), cbind(pickup_longitude,dropoff_latitude))/1000) -> test_df                                     

# calculate trip direction for all training observations 
train_df %>% mutate(direction = bearing(cbind(pickup_longitude,
                                              pickup_latitude),
                                        cbind(dropoff_longitude,
                                              dropoff_latitude))) -> train_df

# calculate trip direction for all test observations
test_df %>% mutate(direction = bearing(cbind(pickup_longitude,
                                             pickup_latitude),
                                       cbind(dropoff_longitude,
                                             dropoff_latitude))) -> test_df

Locations and Neighbourhoods

We also wanted to expand the data by adding neighborhoods to the data. To do this, we used K-Means clustering on the coordinates to create 100 groups.

Importantly, we had to limit the number of samples we used to create the clusters. This is because of memory constraints on our laptops. The reason K-Means is appropriate here is because K-means is a least-squares optimization problem. K-Means tries to find the least-squares partition of the data, thus attempting to create small close knit clusters (small inter cluster distance), which we can define as neighborhoods.

We can the use the clue package to predict the cluster labels for the remaining observations.

## create Neighborhoods

# merge pickup and dropoff coordinates into a single dataframe
coords <- rbind(cbind(train_df$pickup_latitude,train_df$pickup_longitude),
                cbind(train_df$dropoff_latitude, train_df$dropoff_longitude))

# randomly select 10,000 observations to run kmeans on
randomSamples <- sample(1:nrow(coords),10000)
randomCoords <- coords[randomSamples,]

# apply kmeans clustering to the selected coordinate locations above
km.out <- kmeans(randomCoords, centers = 100)

# import the clue library to predict cluster locations of observations not used
# in kmeans
library(clue)
cl_predict(km.out,newdata = coords)

# predict pickup neighborhoods for training set
train_df$pickup_cluster <- cl_predict(km.out,
                                     cbind(train_df$pickup_latitude,
                                           train_df$pickup_longitude))

# predict pickup neighborhoods for test set
test_df$pickup_cluster <- cl_predict(km.out,cbind(test_df$pickup_latitude,
                                                  test_df$pickup_longitude))

# predict dropoff neighborhoods for training set
train_df$dropoff_cluster <- cl_predict(km.out,cbind(train_df$dropoff_latitude, train_df$dropoff_longitude))

# predict dropoff neighborhoods for test set
test_df$dropoff_cluster <- cl_predict(km.out,cbind(test_df$dropoff_latitude, test_df$dropoff_longitude))

We can now re-visualize the above scatter plot with the newly defined clusters.

# visualize coordinates using scatter plot 
ggplot(train_df[1:500000,], aes(x = pickup_latitude, y = pickup_longitude)) +
  geom_point(aes(colour = factor(pickup_cluster)), size = 0.1,show.legend = F) +
  ggtitle("Neighborhoods of NYC")

Type Conversion

Earlier, we created a date-time column. We now want to split that data into hour, day, day of the week and month respectively.

# extract months
train_df$Month <- as.factor(month(train_df$pickup_date))
test_df$Month <- as.factor(month(test_df$pickup_date))

# extract numeric day within month
train_df$Day <- as.factor(day(train_df$pickup_date))
test_df$Day <- as.factor(day(test_df$pickup_date))

# extract hour within day
train_df$Hour <- as.factor(hour(train_df$pickup_datetime))
test_df$Hour <- as.factor(hour(test_df$pickup_datetime))

# extract day of the week
train_df$DayOfWeek <- as.factor(weekdays(train_df$pickup_date))
test_df$DayOfWeek <- as.factor(weekdays(test_df$pickup_date))

Additionally, a lot of our numeric data is actually better described as categorical information. For example, the neighborhoods (cluster locations) columns are analogous to zip codes. So, we will quickly convert all these into factors.

# change to factors
train_df$vendor_id <- as.factor(train_df$vendor_id)
test_df$vendor_id <- as.factor(test_df$vendor_id)

train_df$passenger_count <- as.factor(train_df$passenger_count)
test_df$passenger_count <- as.factor(test_df$passenger_count)

train_df$store_and_fwd_flag <- as.factor(train_df$store_and_fwd_flag)
test_df$store_and_fwd_flag <- as.factor(test_df$store_and_fwd_flag)

train_df$pickup_cluster <- as.factor(train_df$pickup_cluster)
test_df$pickup_cluster <- as.factor(test_df$pickup_cluster)
train_df$dropoff_cluster <- as.factor(train_df$dropoff_cluster)
test_df$dropoff_cluster <- as.factor(test_df$dropoff_cluster)

Append Additional Data

Lastly, since our given data did not provide enough columns itself. We also decided to locate more data. On kaggle, we found an OSRM data set for New York that matches the ID pairs in our given data set. This data set defines the fastest routes for all trips in our data set. However, even though it gives the total time and distance traveled, these are the fastest mappings and do not take into consideration things like traffic and alternative routes. Therefore, they are not a direct measure of the prediction variable.

# https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm?select=fastest_routes_test.csv

# import and concatenate new data with existing training set
fr1 <- read.csv('fastest_routes_train_part_1.csv')[,c('id', 'total_distance', 'total_travel_time',  'number_of_steps')]
fr2 <- read.csv('fastest_routes_train_part_2.csv')[,c('id', 'total_distance', 'total_travel_time', 'number_of_steps')]

# import and concatenate new data with existing test set
test_street_info = read.csv('fastest_routes_test.csv')[,
                               c('id', 'total_distance', 'total_travel_time', 'number_of_steps')]

# merge two training sets
train_street_info <- rbind(fr1,fr2)

# left join according to id column in both data sets
train_df <- left_join(train_df,train_street_info,by="id")
test_df <- left_join(test_df,test_street_info,by="id")

We have now reached the end of our preprocessing stage. The final step before moving on to our models is to select the columns of interest. Otherwise, we will have a lot of collinearity since we created many, more explanatory features, from existing ones.

train_df %>% select(c('dist_haversine','distance_dummy_manhattan',
                      'direction','pickup_cluster','dropoff_cluster',
                      'total_distance','total_travel_time','number_of_steps',
                      'vendor_id','passenger_count','store_and_fwd_flag',
                      'Month','Day','Hour','DayOfWeek','trip_duration')) -> Train_Master

test_df %>% select(c('dist_haversine','distance_dummy_manhattan',
                      'direction','pickup_cluster','dropoff_cluster',
                      'total_distance','total_travel_time','number_of_steps',
                      'vendor_id','passenger_count','store_and_fwd_flag',
                      'Month','Day','Hour','DayOfWeek')) -> Test_Master

dim(Train_Master)

## [1] 1437128      16

dim(Test_Master)

## [1] 625134     15

# Last check for NA's possibly introduced via feature expansion
sum(is.na(Test_Master))

## [1] 0

Train_Master <- na.omit(Train_Master)

Train and Validation

To train our model, we will be using the caret library to partition our training data into train and validation sets.

# import caret library
library(caret)

# create a 80-20 split of the training data 
train_val_split <- createDataPartition(Train_Master$trip_duration, p = .8, list = F)
#XMatrix <- sparse.model.matrix(trip_duration ~ ., data = Train_Master)

# construct feature and prediction training sets
Train.X <- data.matrix(subset(Train_Master[train_val_split,],select = -trip_duration))
Train.Y <- log(Train_Master$trip_duration)[train_val_split]

# construct feature and prediction validation sets
Val.X <-  data.matrix(subset(Train_Master[-train_val_split,],select = -trip_duration))
Val.Y <- log(Train_Master$trip_duration)[-train_val_split]

# convert test set into a matrix
Test.X <- data.matrix(Test_Master)

Model Critiques

Lasso

library(glmnet)

# Use cross-validation
cv_lasso = cv.glmnet(Train.X, Train.Y, alpha=1) # Lasso regression
plot(cv_lasso)

We can extract the best lambda from cross validation.

(best_lambda <- cv_lasso$lambda.min)

## [1] 0.0001276496

Make our own predictions on validation set.

mod.lasso <- glmnet(Train.X, Train.Y, alpha = 1,lambda = best_lambda)

# Check prediction error in the testing dataset
lasso_pred <- predict(mod.lasso, s = best_lambda, newx = Val.X)
# The Mean squared error (MSE)
(error_Lasso <- sqrt(mean((lasso_pred - Val.Y)^2)))

## [1] 0.5215697

Random Forest

Here is our implementation of Random Forest based on other user’s submission.

This model took 38 minutes to train. It is very slow.

library(randomForest)
RFmodel <- randomForest(x = Train.X,
                    y = Train.Y,
                    importance=TRUE, 
                    ntree=500,
                    mtry = sqrt(15),
                    maxnodes = 90)

Call: randomForest(x = Train.X, y = Train.Y, ntree = 500, mtry = sqrt(15),
maxnodes = 90, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 4

      Mean of squared residuals: 0.1782996
                % Var explained: 69.99

varImpPlot(RFmodel)

Evaluate on validation set.

RFyhat <- predict(RFmodel, newdata = Val.X)
ErrorRF <- sqrt(mean((RFyhat - Val.Y)^2))
ErrorRF

[1] 0.4207632

eXtreme Gradient Boosting

Initial Implementation

XGBoost was the best model to use on this data set because it highly flexible and versatile and also handles large-scale problems efficiently. This is directly correlated to its ability to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.

First step, is to create of XGBoost matrices for our train and validation sets.

library(xgboost)
dtrain <- xgb.DMatrix(Train.X, label = Train.Y)
dval <- xgb.DMatrix(Val.X, label = Val.Y)

Next, we define the parameters we want the model to use.

# Train an xgboost model
watchlist <- list(train = dtrain, val = dval)
param <- list(eta = .5, # learning rate
             nthread = -1, # number of parallel threads 
             object="reg:squaredlogerror", # regression (for classification, specify "binary:logistic")
             max.depth=10, # number of splits
             verbose = 1,
             subsample = 0.9,
             booster ='gbtree',
             lambda = 1,
             colsample_bytree = 0.9,
             maximize = F,
             min_child_weight = 1) # print training error)

Lastly, we train the model.

xgboostTrip <- xgb.train(params = param,
                         data = dtrain,
                         nrounds = 1000, 
                         watchlist = watchlist, 
                         early_stopping_rounds = 50,
                         eval_metric = 'rmse')

## [10:37:47] WARNING: amalgamation/../src/learner.cc:541: 
## Parameters: { maximize, object, verbose } might not be used.
## 
##   This may not be accurate due to some parameters are only used in language bindings but
##   passed down to XGBoost core.  Or some parameters are not used but slip through this
##   verification. Please open an issue if you find above cases.
## 
## 
## [1]  train-rmse:3.020098 val-rmse:3.020520 
## Multiple eval metrics are present. Will use val_rmse for early stopping.
## Will train until val_rmse hasn't improved in 50 rounds.
## 
## [2]  train-rmse:1.547621 val-rmse:1.548754 
## [3]  train-rmse:0.842112 val-rmse:0.844834 
## [4]  train-rmse:0.529528 val-rmse:0.535212 
## [5]  train-rmse:0.411930 val-rmse:0.420949 
## [6]  train-rmse:0.373062 val-rmse:0.384511 
## [7]  train-rmse:0.359155 val-rmse:0.372638 
## [8]  train-rmse:0.352862 val-rmse:0.368764 
## [9]  train-rmse:0.347499 val-rmse:0.366620 
## [10] train-rmse:0.344809 val-rmse:0.365468 
## [11] train-rmse:0.340856 val-rmse:0.362705 
## [12] train-rmse:0.338529 val-rmse:0.361793 
## [13] train-rmse:0.336335 val-rmse:0.361013 
## [14] train-rmse:0.334597 val-rmse:0.360258 
## [15] train-rmse:0.332149 val-rmse:0.359633 
## [16] train-rmse:0.330026 val-rmse:0.359281 
## [17] train-rmse:0.328796 val-rmse:0.359040 
## [18] train-rmse:0.327505 val-rmse:0.358976 
## [19] train-rmse:0.326493 val-rmse:0.358649 
## [20] train-rmse:0.325242 val-rmse:0.357988 
## [21] train-rmse:0.322451 val-rmse:0.357037 
## [22] train-rmse:0.320204 val-rmse:0.355969 
## [23] train-rmse:0.319207 val-rmse:0.355875 
## [24] train-rmse:0.317522 val-rmse:0.355795 
## [25] train-rmse:0.316317 val-rmse:0.355783 
## [26] train-rmse:0.314879 val-rmse:0.355414 
## [27] train-rmse:0.313790 val-rmse:0.355058 
## [28] train-rmse:0.312664 val-rmse:0.354831 
## [29] train-rmse:0.311896 val-rmse:0.354782 
## [30] train-rmse:0.310544 val-rmse:0.354849 
## [31] train-rmse:0.310347 val-rmse:0.354782 
## [32] train-rmse:0.309528 val-rmse:0.355018 
## [33] train-rmse:0.307567 val-rmse:0.354789 
## [34] train-rmse:0.306929 val-rmse:0.354787 
## [35] train-rmse:0.305652 val-rmse:0.354435 
## [36] train-rmse:0.304898 val-rmse:0.354445 
## [37] train-rmse:0.304223 val-rmse:0.354647 
## [38] train-rmse:0.302790 val-rmse:0.354101 
## [39] train-rmse:0.301810 val-rmse:0.353836 
## [40] train-rmse:0.301242 val-rmse:0.354084 
## [41] train-rmse:0.300163 val-rmse:0.354075 
## [42] train-rmse:0.299541 val-rmse:0.353947 
## [43] train-rmse:0.299183 val-rmse:0.354051 
## [44] train-rmse:0.298795 val-rmse:0.354159 
## [45] train-rmse:0.298374 val-rmse:0.354306 
## [46] train-rmse:0.297896 val-rmse:0.354381 
## [47] train-rmse:0.297338 val-rmse:0.354418 
## [48] train-rmse:0.297001 val-rmse:0.354364 
## [49] train-rmse:0.296635 val-rmse:0.354487 
## [50] train-rmse:0.296038 val-rmse:0.354755 
## [51] train-rmse:0.295528 val-rmse:0.354814 
## [52] train-rmse:0.294308 val-rmse:0.354363 
## [53] train-rmse:0.293848 val-rmse:0.354452 
## [54] train-rmse:0.293056 val-rmse:0.354493 
## [55] train-rmse:0.291899 val-rmse:0.354502 
## [56] train-rmse:0.291525 val-rmse:0.354616 
## [57] train-rmse:0.291026 val-rmse:0.354804 
## [58] train-rmse:0.290256 val-rmse:0.354584 
## [59] train-rmse:0.289817 val-rmse:0.354659 
## [60] train-rmse:0.289575 val-rmse:0.354744 
## [61] train-rmse:0.288936 val-rmse:0.354823 
## [62] train-rmse:0.288058 val-rmse:0.354658 
## [63] train-rmse:0.287706 val-rmse:0.354630 
## [64] train-rmse:0.286945 val-rmse:0.354778 
## [65] train-rmse:0.286535 val-rmse:0.354715 
## [66] train-rmse:0.286369 val-rmse:0.354798 
## [67] train-rmse:0.286016 val-rmse:0.354755 
## [68] train-rmse:0.285665 val-rmse:0.354804 
## [69] train-rmse:0.285259 val-rmse:0.354757 
## [70] train-rmse:0.284816 val-rmse:0.354855 
## [71] train-rmse:0.283921 val-rmse:0.354763 
## [72] train-rmse:0.283280 val-rmse:0.354987 
## [73] train-rmse:0.282549 val-rmse:0.354919 
## [74] train-rmse:0.281748 val-rmse:0.354822 
## [75] train-rmse:0.281554 val-rmse:0.354814 
## [76] train-rmse:0.281124 val-rmse:0.354820 
## [77] train-rmse:0.281049 val-rmse:0.354827 
## [78] train-rmse:0.280778 val-rmse:0.354860 
## [79] train-rmse:0.280432 val-rmse:0.354977 
## [80] train-rmse:0.279993 val-rmse:0.354980 
## [81] train-rmse:0.279623 val-rmse:0.355019 
## [82] train-rmse:0.279100 val-rmse:0.354963 
## [83] train-rmse:0.278962 val-rmse:0.354973 
## [84] train-rmse:0.278747 val-rmse:0.355033 
## [85] train-rmse:0.278522 val-rmse:0.355072 
## [86] train-rmse:0.277771 val-rmse:0.355077 
## [87] train-rmse:0.277544 val-rmse:0.355122 
## [88] train-rmse:0.277305 val-rmse:0.355142 
## [89] train-rmse:0.277074 val-rmse:0.355212 
## Stopping. Best iteration:
## [39] train-rmse:0.301810 val-rmse:0.353836

Tuning the model

To tune the model, we will use cross validation and random parameter indexing.

# Random parameters
best_param <- list()    #empty list to different parameter combinations
best_seednumber <- 1234 # set the seed
best_rmse <- Inf        # initilaize the best rmse (minimization objective)
best_rmse_index <- 0    # counter to append results per interation

set.seed(1234)

# Run 1000 random iterations 
for (iter in 1:1000) {
  param <- list(objective = "reg:linear",  # For regression
                eval_metric = "rmse",      # rmse is used for regression
                max_depth = sample(6:10, 1),
                eta = runif(1, .01, .1),   # Learning rate, default: 0.3
                subsample = runif(1, .6, .9),
                colsample_bytree = runif(1, .5, .8), 
                min_child_weight = sample(5:10, 1), # These two are important
                max_delta_step = sample(5:10, 1)    # Can help to focus error
                                                    # into a small range.
  )
  cv.nround <-  1000
  cv.nfold <-  5 # 5-fold cross-validation
  seed.number  <-  sample.int(10000, 1) # set seed for the cv
  set.seed(seed.number)
  
  # mdcv is common variable name for XGBoost cross validation
  mdcv <- xgb.cv(data = dtrain, params = param,  
                 nfold = cv.nfold, nrounds = cv.nround,
                 verbose = F, early_stopping_rounds = 15, maximize = FALSE)
  
  min_rmse_index  <-  mdcv$best_iteration
  min_rmse <-  mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
  
  if (min_rmse < best_rmse) {
    best_rmse <- min_rmse
    best_rmse_index <- min_rmse_index
    best_seednumber <- seed.number
    best_param <- param
  }
}

Our best parameters were

# best_param

$objective [1] “reg:linear”

$eval_metric [1] “rmse”

$max_depth [1] 9

$eta [1] 0.06600695

$subsample [1] 0.7827824

$colsample_bytree [1] 0.6870138

$min_child_weight [1] 9

$max_delta_step [1] 8

Rerun with New Paramters

# Train an xgboost model
watchlist <- list(train = dtrain, val = dval)
param <- list(eta = 0.06600695, # learning rate
             nthread = -1, # number of parallel threads 
             object="reg:linear",
             max.depth=9, # number of splits
             verbose = 0,
             subsample = 0.7827824,
             booster ='gbtree',
             lambda = 1,
             colsample_bytree = 0.6870138,
             maximize = F,
             min_child_weight = 9,
             max_delta_step = 8)