Introduction: New York City is known for their abundance of traffic, high amount of pedestrians, bicycles, and other things that could potentially clog up the road. Because of this, most people in New York avoid driving on their own and end up using taxis. Even with not having to drive on their own, the overwhelming street congestion leaves many New York visitors and residents wondering how they can make it to their destination on time. To help with this problem, our goal was to build a model that can predict how long taxi trip duration will be based on the route.
Dataset: The observations in this dataset, recorded in 2016, was collected by the NYC Taxi and Limousine Commission. Within the dataset contains certain features that can play a role in duration, such as day of the week, time of day, number of passengers, the distance between the pickup and dropoff location, and more. In addition to predicting trip duration, we will identify features that tend to heavily influence trip durations than others.
We load the data to our environment using read.csv function from base R.
# Create dataframe for training and testing
# The training set consists of 1458644 observations and 11 features
train_df <- read.csv("train.csv", stringsAsFactors=F, header=T, sep=',')
# The test data consists of 625134 observations and 9 variables
# It is missing trip_duration and dropoff_time
test_df <- read.csv("test.csv", stringsAsFactors=F, header=T, sep=',')
print(names(train_df))
## [1] "id" "vendor_id" "pickup_datetime"
## [4] "dropoff_datetime" "passenger_count" "pickup_longitude"
## [7] "pickup_latitude" "dropoff_longitude" "dropoff_latitude"
## [10] "store_and_fwd_flag" "trip_duration"
To gain a better understanding of the data we are dealing with, we start by looking at the data type of each feature.
# Use str() to output a list of each feature, its data type, and some instances
str(train_df)
## 'data.frame': 1458644 obs. of 11 variables:
## $ id : chr "id2875421" "id2377394" "id3858529" "id3504673" ...
## $ vendor_id : int 2 1 2 2 2 2 1 2 1 2 ...
## $ pickup_datetime : chr "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
## $ dropoff_datetime : chr "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
## $ passenger_count : int 1 1 1 1 1 6 4 1 1 1 ...
## $ pickup_longitude : num -74 -74 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.7 40.8 40.7 40.8 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.7 40.7 40.7 40.8 ...
## $ store_and_fwd_flag: chr "N" "N" "N" "N" ...
## $ trip_duration : int 455 663 2124 429 435 443 341 1551 255 1225 ...
As you can see, our data consists mainly of numeric and categorical variables. However, if we pay special attention to some of the categorical variables, we notice that they should actually be date-time objects. We will address this point later, but for now it might be useful to get summary statistics of the given data.
# Use the summary function to view summary statistics
summary(train_df$trip_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 397 662 959 1075 3526282
It is important to stop and make sense of the numbers we are seeing here. Trip_duration, which is our Y variable, has a minimum value of 1. The Kaggle page, that is hosting this competition, tells us the this numeric number is measured in seconds. It is probably a fair assumption that no trip has length of 1 second. Furthermore, the max trip is 3,526,282 seconds which is approximately 9,795 hours. Therefore, we have some outliers within our given data set. This is the most important result to note from the summary statistics.
To ensure we exclude these most of these outliers, given the large range, we decided to omit data that does not lie within two standard deviations of the mean for trip_duration. This is a common statistical decision which is why we chose 2 standard deviations.
# First we calculate the mean and standard deviation of trip_duration
avgTrip <- mean(train_df$trip_duration)
sdTrip <- sd(train_df$trip_duration)
# import dplyer to start applying data set manipulation
library(dplyr)
# omit all observations that lie outside of 2 standard deviations of
# trip_duration
train_df %>% filter(trip_duration >= (avgTrip - 2*sdTrip)) %>%
filter(trip_duration <= (avgTrip + 2*sdTrip)) -> train_df
# Re-look at summary statistics
summary(train_df)
## id vendor_id pickup_datetime dropoff_datetime
## Length:1456540 Min. :1.000 Length:1456540 Length:1456540
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.534
## 3rd Qu.:2.000
## Max. :2.000
## passenger_count pickup_longitude pickup_latitude dropoff_longitude
## Min. :0.000 Min. :-121.93 Min. :34.36 Min. :-121.93
## 1st Qu.:1.000 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99
## Median :1.000 Median : -73.98 Median :40.75 Median : -73.98
## Mean :1.664 Mean : -73.97 Mean :40.75 Mean : -73.97
## 3rd Qu.:2.000 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96
## Max. :9.000 Max. : -61.34 Max. :51.88 Max. : -61.34
## dropoff_latitude store_and_fwd_flag trip_duration
## Min. :32.18 Length:1456540 Min. : 1.0
## 1st Qu.:40.74 Class :character 1st Qu.: 397.0
## Median :40.75 Mode :character Median : 662.0
## Mean :40.75 Mean : 836.9
## 3rd Qu.:40.77 3rd Qu.: 1073.0
## Max. :43.92 Max. :11411.0
Through googling, and looking at other Kaggle submissions, most people define New York City to lie within the following coordinates:
Comparing this again to our summary statistics above, we still have observations outside of this geographic area. Since this project is focused on NYC, we will restrict our training data to lie within its perimeter.
## Clean up coordinates -- only include obs in city limits
train_df %>% filter(pickup_longitude <= -73.75) %>%
filter(pickup_longitude >= -74.03) %>%
filter(pickup_latitude <= 40.85) %>%
filter(pickup_latitude >= 40.63) %>%
filter(dropoff_longitude <= -73.75) %>%
filter(dropoff_longitude >= -74.03) %>%
filter(dropoff_latitude <= 40.85) %>%
filter(dropoff_latitude >= 40.63) -> train_df
As mentioned earlier, the last cleaning step is to convert pickup and dropoff dates and time into the correct data type. This is easily accomplished using a combination of lubridate and dplyr.
# import lubridate package
library(lubridate)
# Change format of training date variables into date and time
train_df %>% mutate(pickup_datetime,
pickup_datetime = ymd_hms(pickup_datetime)) %>%
mutate(dropoff_datetime,
dropoff_datetime= ymd_hms(dropoff_datetime)) -> train_df
# We can the truncate the extract only the data from the above columns
train_df %>% mutate(pickup_date = date(pickup_datetime)) %>%
mutate(dropoff_date= date(dropoff_datetime)) -> train_df
# Repeat the above procedure for the test set
test_df %>% mutate(pickup_datetime,
pickup_datetime = ymd_hms(pickup_datetime)) -> test_df
test_df %>% mutate(pickup_date = date(pickup_datetime)) -> test_df
In this next section, we are going to use some visualizations tools to gain a better understanding of the data. First we want to look at the remaining spread of the Y variable (trip_duration).
# Plot histogram of trip duration with 100 bins
hist(train_df$trip_duration,
breaks = 100, main = "Histogram of Given Trip Durations")
As you can see, our data is right-skewed(positive skewness). To accommodate for this, we take the log transformation of the same variable. Additionally, this gave us some intuition as why the error measure for the problem is rmsle (root-mean-square-log-error).
# log transform Y variable
train_df %>% mutate(trip_duration,
logTripDuration = log(trip_duration +1)) -> train_df
# display a histogram to view changes
hist(train_df$logTripDuration,
breaks = 100, main = "Histogram of Given Log Trip Durations")
The above graph now has the shape comparable to a normal distribution.
Since we are dealing with trip data, it is possibly that there is seasonality involved. To examine this hypothesis, we can look at the number of trips over time.
## Plot Time Series of Data
# count the number of trips by date for train and test sets
train_df %>% count(date = pickup_date) -> train_date_counts
test_df %>% count(date = pickup_date) -> test_date_counts
# plot the time series graph for train
plot(train_date_counts, ylim=c(0,max(train_date_counts$n) ),
main = 'Timeseries of Number of Trips',
ylab = 'Number of Trips')
lines(train_date_counts,type = 'l', lty = 2, col='blue')
# fix the prior axis so train and test time-series can be on the same graph
par(new=TRUE)
plot(test_date_counts, axes = F, ylim=c(0,max(train_date_counts$n)))
lines(test_date_counts,type = 'l', lty = 1, col='darkgreen')
We do notice a distinct drop late January to early February. However this coincides with the test it would probably be unwise to remove it.
We added 3 features to our model:
We calculate all these 3 features using a combination of dplyr and a package called geosphere, which contains a haversine function that easily integrates with dplyr’s syntax.
# define coordinate boundaries mentioned above in variables for function
# parameters
city_long_border <- c(-74.03, -73.75)
city_lat_border <- c(40.63, 40.85)
library(ggplot2)
# visulaize coordinates using scatter plot
ggplot(train_df[1:100000,], aes(x = pickup_latitude, y = pickup_longitude)) +
geom_point(color = 'blue', size = 0.1) +
ggtitle("Outline of NYC by Trip Coordinates")
# import geosphere library
library(geosphere)
# calculate haversine distance for all training observations
train_df %>% mutate(dist_haversine =
distHaversine(cbind(pickup_longitude,
pickup_latitude),
cbind(dropoff_longitude,
dropoff_latitude))/1000) -> train_df
# calculate haversine distance for all test observations
test_df %>% mutate(dist_haversine =
distHaversine(cbind(pickup_longitude
,pickup_latitude),
cbind(dropoff_longitude,
dropoff_latitude))/1000) -> test_df
# calculate manahttan distance for all training observations
train_df %>% mutate(distance_dummy_manhattan = distHaversine(cbind(pickup_longitude,pickup_latitude),
cbind(dropoff_longitude,pickup_latitude))/1000 + distHaversine(cbind(pickup_longitude,pickup_latitude), cbind(pickup_longitude,dropoff_latitude))/1000) -> train_df
# calculate manhattan distance for all test observations
test_df %>% mutate(distance_dummy_manhattan = distHaversine(cbind(pickup_longitude,pickup_latitude), cbind(dropoff_longitude,pickup_latitude))/1000 +
distHaversine(cbind(pickup_longitude,pickup_latitude), cbind(pickup_longitude,dropoff_latitude))/1000) -> test_df
# calculate trip direction for all training observations
train_df %>% mutate(direction = bearing(cbind(pickup_longitude,
pickup_latitude),
cbind(dropoff_longitude,
dropoff_latitude))) -> train_df
# calculate trip direction for all test observations
test_df %>% mutate(direction = bearing(cbind(pickup_longitude,
pickup_latitude),
cbind(dropoff_longitude,
dropoff_latitude))) -> test_df
We also wanted to expand the data by adding neighborhoods to the data. To do this, we used K-Means clustering on the coordinates to create 100 groups.
Importantly, we had to limit the number of samples we used to create the clusters. This is because of memory constraints on our laptops. The reason K-Means is appropriate here is because K-means is a least-squares optimization problem. K-Means tries to find the least-squares partition of the data, thus attempting to create small close knit clusters (small inter cluster distance), which we can define as neighborhoods.
We can the use the clue package to predict the cluster labels for the remaining observations.
## create Neighborhoods
# merge pickup and dropoff coordinates into a single dataframe
coords <- rbind(cbind(train_df$pickup_latitude,train_df$pickup_longitude),
cbind(train_df$dropoff_latitude, train_df$dropoff_longitude))
# randomly select 10,000 observations to run kmeans on
randomSamples <- sample(1:nrow(coords),10000)
randomCoords <- coords[randomSamples,]
# apply kmeans clustering to the selected coordinate locations above
km.out <- kmeans(randomCoords, centers = 100)
# import the clue library to predict cluster locations of observations not used
# in kmeans
library(clue)
cl_predict(km.out,newdata = coords)
# predict pickup neighborhoods for training set
train_df$pickup_cluster <- cl_predict(km.out,
cbind(train_df$pickup_latitude,
train_df$pickup_longitude))
# predict pickup neighborhoods for test set
test_df$pickup_cluster <- cl_predict(km.out,cbind(test_df$pickup_latitude,
test_df$pickup_longitude))
# predict dropoff neighborhoods for training set
train_df$dropoff_cluster <- cl_predict(km.out,cbind(train_df$dropoff_latitude, train_df$dropoff_longitude))
# predict dropoff neighborhoods for test set
test_df$dropoff_cluster <- cl_predict(km.out,cbind(test_df$dropoff_latitude, test_df$dropoff_longitude))
We can now re-visualize the above scatter plot with the newly defined clusters.
# visualize coordinates using scatter plot
ggplot(train_df[1:500000,], aes(x = pickup_latitude, y = pickup_longitude)) +
geom_point(aes(colour = factor(pickup_cluster)), size = 0.1,show.legend = F) +
ggtitle("Neighborhoods of NYC")
Earlier, we created a date-time column. We now want to split that data into hour, day, day of the week and month respectively.
# extract months
train_df$Month <- as.factor(month(train_df$pickup_date))
test_df$Month <- as.factor(month(test_df$pickup_date))
# extract numeric day within month
train_df$Day <- as.factor(day(train_df$pickup_date))
test_df$Day <- as.factor(day(test_df$pickup_date))
# extract hour within day
train_df$Hour <- as.factor(hour(train_df$pickup_datetime))
test_df$Hour <- as.factor(hour(test_df$pickup_datetime))
# extract day of the week
train_df$DayOfWeek <- as.factor(weekdays(train_df$pickup_date))
test_df$DayOfWeek <- as.factor(weekdays(test_df$pickup_date))
Additionally, a lot of our numeric data is actually better described as categorical information. For example, the neighborhoods (cluster locations) columns are analogous to zip codes. So, we will quickly convert all these into factors.
# change to factors
train_df$vendor_id <- as.factor(train_df$vendor_id)
test_df$vendor_id <- as.factor(test_df$vendor_id)
train_df$passenger_count <- as.factor(train_df$passenger_count)
test_df$passenger_count <- as.factor(test_df$passenger_count)
train_df$store_and_fwd_flag <- as.factor(train_df$store_and_fwd_flag)
test_df$store_and_fwd_flag <- as.factor(test_df$store_and_fwd_flag)
train_df$pickup_cluster <- as.factor(train_df$pickup_cluster)
test_df$pickup_cluster <- as.factor(test_df$pickup_cluster)
train_df$dropoff_cluster <- as.factor(train_df$dropoff_cluster)
test_df$dropoff_cluster <- as.factor(test_df$dropoff_cluster)
Lastly, since our given data did not provide enough columns itself. We also decided to locate more data. On kaggle, we found an OSRM data set for New York that matches the ID pairs in our given data set. This data set defines the fastest routes for all trips in our data set. However, even though it gives the total time and distance traveled, these are the fastest mappings and do not take into consideration things like traffic and alternative routes. Therefore, they are not a direct measure of the prediction variable.
# https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm?select=fastest_routes_test.csv
# import and concatenate new data with existing training set
fr1 <- read.csv('fastest_routes_train_part_1.csv')[,c('id', 'total_distance', 'total_travel_time', 'number_of_steps')]
fr2 <- read.csv('fastest_routes_train_part_2.csv')[,c('id', 'total_distance', 'total_travel_time', 'number_of_steps')]
# import and concatenate new data with existing test set
test_street_info = read.csv('fastest_routes_test.csv')[,
c('id', 'total_distance', 'total_travel_time', 'number_of_steps')]
# merge two training sets
train_street_info <- rbind(fr1,fr2)
# left join according to id column in both data sets
train_df <- left_join(train_df,train_street_info,by="id")
test_df <- left_join(test_df,test_street_info,by="id")
We have now reached the end of our preprocessing stage. The final step before moving on to our models is to select the columns of interest. Otherwise, we will have a lot of collinearity since we created many, more explanatory features, from existing ones.
train_df %>% select(c('dist_haversine','distance_dummy_manhattan',
'direction','pickup_cluster','dropoff_cluster',
'total_distance','total_travel_time','number_of_steps',
'vendor_id','passenger_count','store_and_fwd_flag',
'Month','Day','Hour','DayOfWeek','trip_duration')) -> Train_Master
test_df %>% select(c('dist_haversine','distance_dummy_manhattan',
'direction','pickup_cluster','dropoff_cluster',
'total_distance','total_travel_time','number_of_steps',
'vendor_id','passenger_count','store_and_fwd_flag',
'Month','Day','Hour','DayOfWeek')) -> Test_Master
dim(Train_Master)
## [1] 1437128 16
dim(Test_Master)
## [1] 625134 15
# Last check for NA's possibly introduced via feature expansion
sum(is.na(Test_Master))
## [1] 0
Train_Master <- na.omit(Train_Master)
To train our model, we will be using the caret library to partition our training data into train and validation sets.
# import caret library
library(caret)
# create a 80-20 split of the training data
train_val_split <- createDataPartition(Train_Master$trip_duration, p = .8, list = F)
#XMatrix <- sparse.model.matrix(trip_duration ~ ., data = Train_Master)
# construct feature and prediction training sets
Train.X <- data.matrix(subset(Train_Master[train_val_split,],select = -trip_duration))
Train.Y <- log(Train_Master$trip_duration)[train_val_split]
# construct feature and prediction validation sets
Val.X <- data.matrix(subset(Train_Master[-train_val_split,],select = -trip_duration))
Val.Y <- log(Train_Master$trip_duration)[-train_val_split]
# convert test set into a matrix
Test.X <- data.matrix(Test_Master)
library(glmnet)
# Use cross-validation
cv_lasso = cv.glmnet(Train.X, Train.Y, alpha=1) # Lasso regression
plot(cv_lasso)
We can extract the best lambda from cross validation.
(best_lambda <- cv_lasso$lambda.min)
## [1] 0.0001276496
Make our own predictions on validation set.
mod.lasso <- glmnet(Train.X, Train.Y, alpha = 1,lambda = best_lambda)
# Check prediction error in the testing dataset
lasso_pred <- predict(mod.lasso, s = best_lambda, newx = Val.X)
# The Mean squared error (MSE)
(error_Lasso <- sqrt(mean((lasso_pred - Val.Y)^2)))
## [1] 0.5215697
Here is our implementation of Random Forest based on other user’s submission.
This model took 38 minutes to train. It is very slow.
library(randomForest)
RFmodel <- randomForest(x = Train.X,
y = Train.Y,
importance=TRUE,
ntree=500,
mtry = sqrt(15),
maxnodes = 90)
Call: randomForest(x = Train.X, y = Train.Y, ntree = 500, mtry = sqrt(15),
maxnodes = 90, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 4
Mean of squared residuals: 0.1782996
% Var explained: 69.99
varImpPlot(RFmodel)
Evaluate on validation set.
RFyhat <- predict(RFmodel, newdata = Val.X)
ErrorRF <- sqrt(mean((RFyhat - Val.Y)^2))
ErrorRF
[1] 0.4207632
XGBoost was the best model to use on this data set because it highly flexible and versatile and also handles large-scale problems efficiently. This is directly correlated to its ability to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.
First step, is to create of XGBoost matrices for our train and validation sets.
library(xgboost)
dtrain <- xgb.DMatrix(Train.X, label = Train.Y)
dval <- xgb.DMatrix(Val.X, label = Val.Y)
Next, we define the parameters we want the model to use.
# Train an xgboost model
watchlist <- list(train = dtrain, val = dval)
param <- list(eta = .5, # learning rate
nthread = -1, # number of parallel threads
object="reg:squaredlogerror", # regression (for classification, specify "binary:logistic")
max.depth=10, # number of splits
verbose = 1,
subsample = 0.9,
booster ='gbtree',
lambda = 1,
colsample_bytree = 0.9,
maximize = F,
min_child_weight = 1) # print training error)
Lastly, we train the model.
xgboostTrip <- xgb.train(params = param,
data = dtrain,
nrounds = 1000,
watchlist = watchlist,
early_stopping_rounds = 50,
eval_metric = 'rmse')
## [10:37:47] WARNING: amalgamation/../src/learner.cc:541:
## Parameters: { maximize, object, verbose } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
##
##
## [1] train-rmse:3.020098 val-rmse:3.020520
## Multiple eval metrics are present. Will use val_rmse for early stopping.
## Will train until val_rmse hasn't improved in 50 rounds.
##
## [2] train-rmse:1.547621 val-rmse:1.548754
## [3] train-rmse:0.842112 val-rmse:0.844834
## [4] train-rmse:0.529528 val-rmse:0.535212
## [5] train-rmse:0.411930 val-rmse:0.420949
## [6] train-rmse:0.373062 val-rmse:0.384511
## [7] train-rmse:0.359155 val-rmse:0.372638
## [8] train-rmse:0.352862 val-rmse:0.368764
## [9] train-rmse:0.347499 val-rmse:0.366620
## [10] train-rmse:0.344809 val-rmse:0.365468
## [11] train-rmse:0.340856 val-rmse:0.362705
## [12] train-rmse:0.338529 val-rmse:0.361793
## [13] train-rmse:0.336335 val-rmse:0.361013
## [14] train-rmse:0.334597 val-rmse:0.360258
## [15] train-rmse:0.332149 val-rmse:0.359633
## [16] train-rmse:0.330026 val-rmse:0.359281
## [17] train-rmse:0.328796 val-rmse:0.359040
## [18] train-rmse:0.327505 val-rmse:0.358976
## [19] train-rmse:0.326493 val-rmse:0.358649
## [20] train-rmse:0.325242 val-rmse:0.357988
## [21] train-rmse:0.322451 val-rmse:0.357037
## [22] train-rmse:0.320204 val-rmse:0.355969
## [23] train-rmse:0.319207 val-rmse:0.355875
## [24] train-rmse:0.317522 val-rmse:0.355795
## [25] train-rmse:0.316317 val-rmse:0.355783
## [26] train-rmse:0.314879 val-rmse:0.355414
## [27] train-rmse:0.313790 val-rmse:0.355058
## [28] train-rmse:0.312664 val-rmse:0.354831
## [29] train-rmse:0.311896 val-rmse:0.354782
## [30] train-rmse:0.310544 val-rmse:0.354849
## [31] train-rmse:0.310347 val-rmse:0.354782
## [32] train-rmse:0.309528 val-rmse:0.355018
## [33] train-rmse:0.307567 val-rmse:0.354789
## [34] train-rmse:0.306929 val-rmse:0.354787
## [35] train-rmse:0.305652 val-rmse:0.354435
## [36] train-rmse:0.304898 val-rmse:0.354445
## [37] train-rmse:0.304223 val-rmse:0.354647
## [38] train-rmse:0.302790 val-rmse:0.354101
## [39] train-rmse:0.301810 val-rmse:0.353836
## [40] train-rmse:0.301242 val-rmse:0.354084
## [41] train-rmse:0.300163 val-rmse:0.354075
## [42] train-rmse:0.299541 val-rmse:0.353947
## [43] train-rmse:0.299183 val-rmse:0.354051
## [44] train-rmse:0.298795 val-rmse:0.354159
## [45] train-rmse:0.298374 val-rmse:0.354306
## [46] train-rmse:0.297896 val-rmse:0.354381
## [47] train-rmse:0.297338 val-rmse:0.354418
## [48] train-rmse:0.297001 val-rmse:0.354364
## [49] train-rmse:0.296635 val-rmse:0.354487
## [50] train-rmse:0.296038 val-rmse:0.354755
## [51] train-rmse:0.295528 val-rmse:0.354814
## [52] train-rmse:0.294308 val-rmse:0.354363
## [53] train-rmse:0.293848 val-rmse:0.354452
## [54] train-rmse:0.293056 val-rmse:0.354493
## [55] train-rmse:0.291899 val-rmse:0.354502
## [56] train-rmse:0.291525 val-rmse:0.354616
## [57] train-rmse:0.291026 val-rmse:0.354804
## [58] train-rmse:0.290256 val-rmse:0.354584
## [59] train-rmse:0.289817 val-rmse:0.354659
## [60] train-rmse:0.289575 val-rmse:0.354744
## [61] train-rmse:0.288936 val-rmse:0.354823
## [62] train-rmse:0.288058 val-rmse:0.354658
## [63] train-rmse:0.287706 val-rmse:0.354630
## [64] train-rmse:0.286945 val-rmse:0.354778
## [65] train-rmse:0.286535 val-rmse:0.354715
## [66] train-rmse:0.286369 val-rmse:0.354798
## [67] train-rmse:0.286016 val-rmse:0.354755
## [68] train-rmse:0.285665 val-rmse:0.354804
## [69] train-rmse:0.285259 val-rmse:0.354757
## [70] train-rmse:0.284816 val-rmse:0.354855
## [71] train-rmse:0.283921 val-rmse:0.354763
## [72] train-rmse:0.283280 val-rmse:0.354987
## [73] train-rmse:0.282549 val-rmse:0.354919
## [74] train-rmse:0.281748 val-rmse:0.354822
## [75] train-rmse:0.281554 val-rmse:0.354814
## [76] train-rmse:0.281124 val-rmse:0.354820
## [77] train-rmse:0.281049 val-rmse:0.354827
## [78] train-rmse:0.280778 val-rmse:0.354860
## [79] train-rmse:0.280432 val-rmse:0.354977
## [80] train-rmse:0.279993 val-rmse:0.354980
## [81] train-rmse:0.279623 val-rmse:0.355019
## [82] train-rmse:0.279100 val-rmse:0.354963
## [83] train-rmse:0.278962 val-rmse:0.354973
## [84] train-rmse:0.278747 val-rmse:0.355033
## [85] train-rmse:0.278522 val-rmse:0.355072
## [86] train-rmse:0.277771 val-rmse:0.355077
## [87] train-rmse:0.277544 val-rmse:0.355122
## [88] train-rmse:0.277305 val-rmse:0.355142
## [89] train-rmse:0.277074 val-rmse:0.355212
## Stopping. Best iteration:
## [39] train-rmse:0.301810 val-rmse:0.353836
To tune the model, we will use cross validation and random parameter indexing.
# Random parameters
best_param <- list() #empty list to different parameter combinations
best_seednumber <- 1234 # set the seed
best_rmse <- Inf # initilaize the best rmse (minimization objective)
best_rmse_index <- 0 # counter to append results per interation
set.seed(1234)
# Run 1000 random iterations
for (iter in 1:1000) {
param <- list(objective = "reg:linear", # For regression
eval_metric = "rmse", # rmse is used for regression
max_depth = sample(6:10, 1),
eta = runif(1, .01, .1), # Learning rate, default: 0.3
subsample = runif(1, .6, .9),
colsample_bytree = runif(1, .5, .8),
min_child_weight = sample(5:10, 1), # These two are important
max_delta_step = sample(5:10, 1) # Can help to focus error
# into a small range.
)
cv.nround <- 1000
cv.nfold <- 5 # 5-fold cross-validation
seed.number <- sample.int(10000, 1) # set seed for the cv
set.seed(seed.number)
# mdcv is common variable name for XGBoost cross validation
mdcv <- xgb.cv(data = dtrain, params = param,
nfold = cv.nfold, nrounds = cv.nround,
verbose = F, early_stopping_rounds = 15, maximize = FALSE)
min_rmse_index <- mdcv$best_iteration
min_rmse <- mdcv$evaluation_log[min_rmse_index]$test_rmse_mean
if (min_rmse < best_rmse) {
best_rmse <- min_rmse
best_rmse_index <- min_rmse_index
best_seednumber <- seed.number
best_param <- param
}
}
Our best parameters were
# best_param
$objective [1] “reg:linear”
$eval_metric [1] “rmse”
$max_depth [1] 9
$eta [1] 0.06600695
$subsample [1] 0.7827824
$colsample_bytree [1] 0.6870138
$min_child_weight [1] 9
$max_delta_step [1] 8
# Train an xgboost model
watchlist <- list(train = dtrain, val = dval)
param <- list(eta = 0.06600695, # learning rate
nthread = -1, # number of parallel threads
object="reg:linear",
max.depth=9, # number of splits
verbose = 0,
subsample = 0.7827824,
booster ='gbtree',
lambda = 1,
colsample_bytree = 0.6870138,
maximize = F,
min_child_weight = 9,
max_delta_step = 8)
Lastly, we train the model.
xgboostTrip <- xgb.train(params = param,
data = dtrain,
nrounds = 1000,
watchlist = watchlist,
early_stopping_rounds = 50,
eval_metric = 'rmse')
XBGpred <- predict(xgboostTrip, dval)
(XBG_error <- sqrt(mean((XBGpred - Val.Y)^2)))
## [1] 0.3379605