Introduction to Kaggle - House Prices, Advanced Regression Techniques

 

Introduction

This tutorial will go through the steps to make an initial submission to Kaggle competition House Prices - Advanced Regression Techniques available at https://www.kaggle.com/c/house-prices-advanced-regression-techniques, using R for the analysis.

This competition is currently in the Getting Started Prediction category at Kaggle, there are no prizes to win, but it can be a great exercise for getting comfortable working with tabular data. I find that although (or because of?) it is an older competition, Kaggle competitions can be pretty competitive, and while the main focus of this tutorial is learning, I hope that this tutorial also could provide a good starting point for anyone new to this dataset intending to reach for higher scores. In summary this tutorial will cover

  • Basic data manipulation with dplyr.
  • Feature selection using recursive feature elimination.
  • Handling missing values using Random Forests with the package missForest.
  • Obtaining predictions using Random Forests and XGBoost.

I will be as explicit as possible regarding parameter values, models and corresponding score on Kaggle, providing a baseline for the reader to improve upon.

Note that in order to retrieve the data for this competition you need to register for a free Kaggle account and join this competition at the Kaggle website.

Getting started with Kaggle

If you are new to Kaggle, you can create a free account at kaggle.com, and register for the competition you want to participate in, for this tutorial you want to register for: House Prices - Advanced Regression Techniques.

The next step is to create a notebook on the Kaggle website with your solution (at the time of writing they support either R or Python notebooks), run it at their servers, and submit your solution to get the score on the test set for the leaderboard. You could also use the command line interface (CLI) to submit your solution, running your code at your own computer, or at a cloud computing service of choice.

Apart from the competition covered here, I found that the Titanic dataset is a good starting point, with many great tutorials available online. I recommend watching the official Kaggle introduction for the Titanic dataset on youtube, as a gateway to making your first submission: [How to Get Started with Kaggle’s Titanic Competition Kaggle](https://youtu.be/8yZMXCaFshs).

Personally, I prefer using the Kaggle CLI rather that the website, the CLI provides a quick and convenient way of interacting with Kaggle, making submissions, signing up for competitions, etc (you can send up to a total of 10 submissions/day using the website and CLI). Check out the official Kaggle documentation for their CLI here: How to Use Kaggle

The Ames Housing data

The dataset for this challenge consists of sale prices of residential property in Ames, Iowa, US, between 2006 and 2010. In total there are 2919 observations (rows) and 81 variables (columns) both numerical, categorical and discrete. The independent variable which we want to predict is the sale price in US dollars.

The result will be evaluated by Kaggle with the root mean squared error between the logarithmised price of the prediction you submit and the true sale price. The submission file should have two columns: property ID and SalePrice, for more details check out https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

dplyr and DescTools

We will begin with loading both the test/train data and join them to make sure we do all initial data wrangling on all the data. We will make use of the dplyr package and Desctools, so we start by importing these, you might need to install them with e.g. Install.packages(“package_name”).

require("dplyr")
require("DescTools")
require("tidyr") # included in tidyverse and required for pivot_longer() used later

Data import and overview

Importing the data

Now we load the test/train data, which I have downloaded from the Kaggle site and placed in the same working directory as this script. Then, we merge the test and training data to one larger dataset to make sure we do all preprocessing/data wrangling on the complete dataset.

We also save the id’s of all observations in the test data to keep track of what is train and what is test.

train_orig = read.csv("train.csv", stringsAsFactors = T) # read training data
test_orig = read.csv("test.csv", stringsAsFactors = T) # read test data
test_id = test_orig$Id # get ID of rows corresponding to the test data
all_data = full_join(train_orig, test_orig) # join test/train data (this works since test/train have exactly the same type of columns)

Now, define a convenience function to get back the test and train sets from the full merged data set, and check the resulting number of rows and columns of our data.

get_tt = function(df){
  train = df %>% filter(!(Id %in% test_id)) %>% mutate(Id=NULL)
  test = df %>% filter(Id %in% test_id) %>% mutate(Id=NULL)
  return(list("train" = train, "test" = test))
}
res = get_tt(all_data)
cat("#Columns in data:", ncol(all_data), "\n")
## #Columns in data: 81
cat("#Rows in training data:", nrow(all_data), "\n")
## #Rows in training data: 2919
cat("#Rows in training data:", nrow(res$train), "\n")
## #Rows in training data: 1460
cat("#Rows in test data:",nrow(res$test), "\n")
## #Rows in test data: 1459

Overview of the data

We start by plotting the distribution of the dependent variable SalePrice, where we note that the distribution seems skewed towards the left.

hist(main="SalePrice", all_data$SalePrice, xlab="SalePrice ($)")

Interestingly, if we instead plot the log of the SalePrice, the distribution looks more normal, providing further motivation for using the logarithmised sale price rather than the value in dollars.

hist(main="SalePrice (log)", log(all_data$SalePrice), xlab="SalePrice (log)")

Now, with DescTools, we can get an overview of the whole dataset. The function Desc gives a summary of each variable, depending on the data type. As it gives us a very long list of output, we will instead use Abstract which only lists all variables with their name, type, missing values, along with the different levels for factors (Abstract is otherwise also printed by Desc)

#Desc(all_data, plotit=FALSE) # maybe too much information
Abstract(all_data, list.len=100)
## ----------------------------------------------------------------------------------------------------------------------------------------- 
## all_data
## 
## data frame:  2919 obs. of  81 variables
##      0 complete cases (0.0%)
## 
##   Nr  ColName        Class    NAs           Levels                                                         
##   1   Id             integer     .                                                                         
##   2   MSSubClass     integer     .                                                                         
##   3   MSZoning       factor      4 (0.1%)   (5): 1-C (all), 2-FV, 3-RH, 4-RL, 5-RM                         
##   4   LotFrontage    integer   486 (16.6%)                                                                 
##   5   LotArea        integer     .                                                                         
##   6   Street         factor      .          (2): 1-Grvl, 2-Pave                                            
##   7   Alley          factor   2721 (93.2%)  (2): 1-Grvl, 2-Pave                                            
##   8   LotShape       factor      .          (4): 1-IR1, 2-IR2, 3-IR3, 4-Reg                                
##   9   LandContour    factor      .          (4): 1-Bnk, 2-HLS, 3-Low, 4-Lvl                                
##   10  Utilities      factor      2 (0.1%)   (2): 1-AllPub, 2-NoSeWa                                        
##   11  LotConfig      factor      .          (5): 1-Corner, 2-CulDSac, 3-FR2, 4-FR3, 5-Inside               
##   12  LandSlope      factor      .          (3): 1-Gtl, 2-Mod, 3-Sev                                       
##   13  Neighborhood   factor      .          (25): 1-Blmngtn, 2-Blueste, 3-BrDale, 4-BrkSide, 5-ClearCr, ...
##   14  Condition1     factor      .          (9): 1-Artery, 2-Feedr, 3-Norm, 4-PosA, 5-PosN, ...            
##   15  Condition2     factor      .          (8): 1-Artery, 2-Feedr, 3-Norm, 4-PosA, 5-PosN, ...            
##   16  BldgType       factor      .          (5): 1-1Fam, 2-2fmCon, 3-Duplex, 4-Twnhs, 5-TwnhsE             
##   17  HouseStyle     factor      .          (8): 1-1.5Fin, 2-1.5Unf, 3-1Story, 4-2.5Fin, 5-2.5Unf, ...     
##   18  OverallQual    integer     .                                                                         
##   19  OverallCond    integer     .                                                                         
##   20  YearBuilt      integer     .                                                                         
##   21  YearRemodAdd   integer     .                                                                         
##   22  RoofStyle      factor      .          (6): 1-Flat, 2-Gable, 3-Gambrel, 4-Hip, 5-Mansard, ...         
##   23  RoofMatl       factor      .          (8): 1-ClyTile, 2-CompShg, 3-Membran, 4-Metal, 5-Roll, ...     
##   24  Exterior1st    factor      1 (0.0%)   (15): 1-AsbShng, 2-AsphShn, 3-BrkComm, 4-BrkFace, 5-CBlock, ...
##   25  Exterior2nd    factor      1 (0.0%)   (16): 1-AsbShng, 2-AsphShn, 3-Brk Cmn, 4-BrkFace, 5-CBlock, ...
##   26  MasVnrType     factor     24 (0.8%)   (4): 1-BrkCmn, 2-BrkFace, 3-None, 4-Stone                      
##   27  MasVnrArea     integer    23 (0.8%)                                                                  
##   28  ExterQual      factor      .          (4): 1-Ex, 2-Fa, 3-Gd, 4-TA                                    
##   29  ExterCond      factor      .          (5): 1-Ex, 2-Fa, 3-Gd, 4-Po, 5-TA                              
##   30  Foundation     factor      .          (6): 1-BrkTil, 2-CBlock, 3-PConc, 4-Slab, 5-Stone, ...         
##   31  BsmtQual       factor     81 (2.8%)   (4): 1-Ex, 2-Fa, 3-Gd, 4-TA                                    
##   32  BsmtCond       factor     82 (2.8%)   (4): 1-Fa, 2-Gd, 3-Po, 4-TA                                    
##   33  BsmtExposure   factor     82 (2.8%)   (4): 1-Av, 2-Gd, 3-Mn, 4-No                                    
##   34  BsmtFinType1   factor     79 (2.7%)   (6): 1-ALQ, 2-BLQ, 3-GLQ, 4-LwQ, 5-Rec, ...                    
##   35  BsmtFinSF1     integer     1 (0.0%)                                                                  
##   36  BsmtFinType2   factor     80 (2.7%)   (6): 1-ALQ, 2-BLQ, 3-GLQ, 4-LwQ, 5-Rec, ...                    
##   37  BsmtFinSF2     integer     1 (0.0%)                                                                  
##   38  BsmtUnfSF      integer     1 (0.0%)                                                                  
##   39  TotalBsmtSF    integer     1 (0.0%)                                                                  
##   40  Heating        factor      .          (6): 1-Floor, 2-GasA, 3-GasW, 4-Grav, 5-OthW, ...              
##   41  HeatingQC      factor      .          (5): 1-Ex, 2-Fa, 3-Gd, 4-Po, 5-TA                              
##   42  CentralAir     factor      .          (2): 1-N, 2-Y                                                  
##   43  Electrical     factor      1 (0.0%)   (5): 1-FuseA, 2-FuseF, 3-FuseP, 4-Mix, 5-SBrkr                 
##   44  X1stFlrSF      integer     .                                                                         
##   45  X2ndFlrSF      integer     .                                                                         
##   46  LowQualFinSF   integer     .                                                                         
##   47  GrLivArea      integer     .                                                                         
##   48  BsmtFullBath   integer     2 (0.1%)                                                                  
##   49  BsmtHalfBath   integer     2 (0.1%)                                                                  
##   50  FullBath       integer     .                                                                         
##   51  HalfBath       integer     .                                                                         
##   52  BedroomAbvGr   integer     .                                                                         
##   53  KitchenAbvGr   integer     .                                                                         
##   54  KitchenQual    factor      1 (0.0%)   (4): 1-Ex, 2-Fa, 3-Gd, 4-TA                                    
##   55  TotRmsAbvGrd   integer     .                                                                         
##   56  Functional     factor      2 (0.1%)   (7): 1-Maj1, 2-Maj2, 3-Min1, 4-Min2, 5-Mod, ...                
##   57  Fireplaces     integer     .                                                                         
##   58  FireplaceQu    factor   1420 (48.6%)  (5): 1-Ex, 2-Fa, 3-Gd, 4-Po, 5-TA                              
##   59  GarageType     factor    157 (5.4%)   (6): 1-2Types, 2-Attchd, 3-Basment, 4-BuiltIn, 5-CarPort, ...  
##   60  GarageYrBlt    integer   159 (5.4%)                                                                  
##   61  GarageFinish   factor    159 (5.4%)   (3): 1-Fin, 2-RFn, 3-Unf                                       
##   62  GarageCars     integer     1 (0.0%)                                                                  
##   63  GarageArea     integer     1 (0.0%)                                                                  
##   64  GarageQual     factor    159 (5.4%)   (5): 1-Ex, 2-Fa, 3-Gd, 4-Po, 5-TA                              
##   65  GarageCond     factor    159 (5.4%)   (5): 1-Ex, 2-Fa, 3-Gd, 4-Po, 5-TA                              
##   66  PavedDrive     factor      .          (3): 1-N, 2-P, 3-Y                                             
##   67  WoodDeckSF     integer     .                                                                         
##   68  OpenPorchSF    integer     .                                                                         
##   69  EnclosedPorch  integer     .                                                                         
##   70  X3SsnPorch     integer     .                                                                         
##   71  ScreenPorch    integer     .                                                                         
##   72  PoolArea       integer     .                                                                         
##   73  PoolQC         factor   2909 (99.7%)  (3): 1-Ex, 2-Fa, 3-Gd                                          
##   74  Fence          factor   2348 (80.4%)  (4): 1-GdPrv, 2-GdWo, 3-MnPrv, 4-MnWw                          
##   75  MiscFeature    factor   2814 (96.4%)  (4): 1-Gar2, 2-Othr, 3-Shed, 4-TenC                            
##   76  MiscVal        integer     .                                                                         
##   77  MoSold         integer     .                                                                         
##   78  YrSold         integer     .                                                                         
##   79  SaleType       factor      1 (0.0%)   (9): 1-COD, 2-Con, 3-ConLD, 4-ConLI, 5-ConLw, ...              
##   80  SaleCondition  factor      .          (6): 1-Abnorml, 2-AdjLand, 3-Alloca, 4-Family, 5-Normal, ...   
##   81  SalePrice      integer  1459 (50.0%)

There is a mixture of numerical and categorical variables, with some categorical variables (factors) having overlapping levels (e.g. GarageQual, GarageCond, FireplaceQu, …). Information of the meaning of all different features and levels can be found on Kaggle.

Importantly, we note several features have missing values, which we want to correct later, but before this we first check for errors related to the data type or input form when loading the data into R.

Data wrangling

Check numerical variables for errors

When loading our data, all non-numerical should have been loaded as Factors in R, while numerical columns are Numeric or integer.

To check if these datatypes makes sense for the respective features, we first look at all numerical variables to see if we can spot any obvious errors, utilizing the capabilities of the dplyr package, calculating mean, min, max and number of unique values for each column.

num_summary = all_data %>%
  summarise(across(where(~is.numeric(.)), # do this for numeric columns
                   list(mean = ~mean(.x, na.rm = TRUE), 
                        min = ~min(.x, na.rm = TRUE),
                        max = ~max(.x, na.rm = TRUE),
                        nUnique = ~length(unique(.x, na.rm = TRUE))))) %>%
  pivot_longer(cols = everything(),
               names_sep = "_",
               names_to  = c("variable", ".value"))
print(num_summary, n=100)
## # A tibble: 38 x 5
##    variable             mean   min    max nUnique
##    <chr>               <dbl> <int>  <int>   <int>
##  1 Id              1460          1   2919    2919
##  2 MSSubClass        57.1       20    190      16
##  3 LotFrontage       69.3       21    313     129
##  4 LotArea        10168.      1300 215245    1951
##  5 OverallQual        6.09       1     10      10
##  6 OverallCond        5.56       1      9       9
##  7 YearBuilt       1971.      1872   2010     118
##  8 YearRemodAdd    1984.      1950   2010      61
##  9 MasVnrArea       102.         0   1600     445
## 10 BsmtFinSF1       441.         0   5644     992
## 11 BsmtFinSF2        49.6        0   1526     273
## 12 BsmtUnfSF        561.         0   2336    1136
## 13 TotalBsmtSF     1052.         0   6110    1059
## 14 X1stFlrSF       1160.       334   5095    1083
## 15 X2ndFlrSF        336.         0   2065     635
## 16 LowQualFinSF       4.69       0   1064      36
## 17 GrLivArea       1501.       334   5642    1292
## 18 BsmtFullBath       0.430      0      3       5
## 19 BsmtHalfBath       0.0614     0      2       4
## 20 FullBath           1.57       0      4       5
## 21 HalfBath           0.380      0      2       3
## 22 BedroomAbvGr       2.86       0      8       8
## 23 KitchenAbvGr       1.04       0      3       4
## 24 TotRmsAbvGrd       6.45       2     15      14
## 25 Fireplaces         0.597      0      4       5
## 26 GarageYrBlt     1978.      1895   2207     104
## 27 GarageCars         1.77       0      5       7
## 28 GarageArea       473.         0   1488     604
## 29 WoodDeckSF        93.7        0   1424     379
## 30 OpenPorchSF       47.5        0    742     252
## 31 EnclosedPorch     23.1        0   1012     183
## 32 X3SsnPorch         2.60       0    508      31
## 33 ScreenPorch       16.1        0    576     121
## 34 PoolArea           2.25       0    800      14
## 35 MiscVal           50.8        0  17000      38
## 36 MoSold             6.21       1     12      12
## 37 YrSold          2008.      2006   2010       5
## 38 SalePrice     180921.     34900 755000     664

Comparing the list of numerical variables with the data description on Kaggle, we note that MSSubClass is actually a code for the type of residence (see data description on the Kaggle website), and should be a categorical variable (factor), not a numerical value, therefore we simply convert it to a factor as such, with the levels corresponding to the different residential codes for the type of property:

all_data$MSSubClass = as.factor(all_data$MSSubClass)

Further, at least one house has a garage from the future (GarageYrBlt=2207), lets set this value to the same year (YearBuilt) the house was built. In fact, lets do this for all garages built after 2010 (there is only one row like this though), using the mutate and ifelse functions from dplyr.

all_data = all_data %>%
  mutate(GarageYrBlt = ifelse(GarageYrBlt>2010, YearBuilt, GarageYrBlt))

Otherwise, except for the missing values which we save for later, I cannot spot any obvious error in these datasets from this very simple analysis, and we will move on.

Check categorical variables for errors

After looking at the numerical values, we move on to the categorical features in the dataset. We note that categorical values can be any of two following types:

  • Ordinal: Categorical variable with intrinsic order, can be placed on a scale. Example: height (short, average, tall).
  • Nominal: Categorical variable with no intrinsic order. Example: list of colors (blue, green, red).

For simplicity, we assume all categorical variables to be of the nominal type, despite this obviously not being true. Several of the categorical variables indicate some form of condition/quality ranging from worst to best, and it might be beneficial to leverage this information by converting these to e.g. (ordered) integers, putting these on a scale of what we think is from low to high value/SalePrice. Deeming this to be out of scope for this tutorial, such an exercise is left to the reader, as it seemed possible to get decent results assuming only nominal categories.

Otherwise, a quick comparison of the categorical variables and the data description reveals nothing wrong with any of the categorical variables as far as I can see. It could be that some values are mistyped but this is hard for me to judge, and I will not be modifying any of these values. However, several of the categorical values are still missing, and this is what we will focus on next. # Missing values Shifting our focus to the missing values in the data, we begin defining a function for calculating the percentage of missing values per column, to keep track of our progress:

perc_missing = function(x) colSums(is.na(x)/nrow(x))*100
pm = perc_missing(all_data)

# Missing values (%) per column, descending
print(pm[order(-pm)])
##        PoolQC   MiscFeature         Alley         Fence     SalePrice   FireplaceQu   LotFrontage   GarageYrBlt  GarageFinish 
##   99.65741692   96.40287770   93.21685509   80.43850634   49.98287085   48.64679685   16.64953751    5.44707091    5.44707091 
##    GarageQual    GarageCond    GarageType      BsmtCond  BsmtExposure      BsmtQual  BsmtFinType2  BsmtFinType1    MasVnrType 
##    5.44707091    5.44707091    5.37855430    2.80918123    2.80918123    2.77492292    2.74066461    2.70640630    0.82219938 
##    MasVnrArea      MSZoning     Utilities  BsmtFullBath  BsmtHalfBath    Functional   Exterior1st   Exterior2nd    BsmtFinSF1 
##    0.78794108    0.13703323    0.06851662    0.06851662    0.06851662    0.06851662    0.03425831    0.03425831    0.03425831 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF    Electrical   KitchenQual    GarageCars    GarageArea      SaleType            Id 
##    0.03425831    0.03425831    0.03425831    0.03425831    0.03425831    0.03425831    0.03425831    0.03425831    0.00000000 
##    MSSubClass       LotArea        Street      LotShape   LandContour     LotConfig     LandSlope  Neighborhood    Condition1 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    Condition2      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd     RoofStyle      RoofMatl 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##     ExterQual     ExterCond    Foundation       Heating     HeatingQC    CentralAir     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##     GrLivArea      FullBath      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces    PavedDrive    WoodDeckSF 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea       MiscVal        MoSold        YrSold SaleCondition 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000

Except for SalePrice, which we want to predict, we note that only PoolQC, MiscFeature, Alley, Fence, FireplaceQu, LotFrontage have more than 10% missing values, (see official Kaggle description for the meaning of these variables), though there are many columns with some missing values:

cat("Number of columns in data set:", ncol(all_data), "\n")
## Number of columns in data set: 81
cat("Number of columns with NA's:", sum(0 != pm), "\n")
## Number of columns with NA's: 35

Several of the features with missing values will be remedied in the next section, noting that a missing value is actually a special value indicating a feature being absent. After this modification, the remaining missing values will then be imputed with a Random Forest algorithm.

Replacing NA’s with Absent

We note 34 features have missing values, ignoring dependent variable SalePrice.

According to the data description available at the Kaggle website, NA is a special value denoting absent/none for the following categories: FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature, Alley, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2. In theory it is possible that some of these NA’s actually are true missing values, data which was not collected, but we will follow the description and replace all NA values with the category ‘absent’, meaning that an NA indicate lack of e.g. a basement, garage, for the columns mentioned above.

# define name of columns for which we want to replace NA's with the new category 'Absent'
na_means_absence = c("FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond", "PoolQC", "Fence",
"MiscFeature", "Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2")

# helper function for replacing NA's with Absent
fact_replace_na = function(x){
  x =  as.character(x)
  x =  if_else(is.na(x), "Absent", x)
  x = as.factor(x)
}

# apply 'fact_replace_na' to columns listed in 'na_means_absent'
all_data = all_data %>%
  mutate_at(na_means_absence, fact_replace_na)

Now, we have 21 columns with missing values instead of 31:

pm = perc_missing(all_data)
cat("Number of columns with NA's (after replacing NA's with 'absent'):", sum(0 != pm), "\n")
## Number of columns with NA's (after replacing NA's with 'absent'): 21

Check dataset consistency

Let’s do a quick check on the subset of columns defined above to see if the dataset is consistent, by noting that if one of the garage variables indicates an absence, the rest of the garage-related variables should also have the value absent.

garage_c = c("GarageType", "GarageFinish", "GarageQual", "GarageCond")
# extract indices of inconsistent rows (if any)
inconsistent_rows = which(apply(all_data[,garage_c] == "Absent", 1, any) & apply(all_data[,garage_c] != "Absent", 1, any))
all_data[inconsistent_rows,] %>% select(contains("garage"))
##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond
## 2127     Detchd          NA       Absent          1        360     Absent     Absent
## 2577     Detchd          NA       Absent         NA         NA     Absent     Absent

Two observations have a detached (Detchd) GarageType, while other columns indicates absence. It seems possible that a ‘detached’ garage can be seen as also being absent, or semi-absent, depending on how you view it, and here we will keep these data points as they are for the model.

Let’s also look at the basement columns instead and see if they seem consistent.

bsmt_c=c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2")

# extract indices of inconsistent rows (if any)
inconsistent_rows = which(apply(all_data[,bsmt_c] == "Absent", 1, any) & apply(all_data[,bsmt_c] != "Absent", 1, any))

all_data[inconsistent_rows,] %>% select(contains("bsmt"))
##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 333        Gd       TA           No          GLQ       1124       Absent        479      1603        3206            1            0
## 949        Gd       TA       Absent          Unf          0          Unf          0       936         936            0            0
## 1488       Gd       TA       Absent          Unf          0          Unf          0      1595        1595            0            0
## 2041       Gd   Absent           Mn          GLQ       1044          Rec        382         0        1426            1            0
## 2186       TA   Absent           No          BLQ       1033          Unf          0        94        1127            0            1
## 2218   Absent       Fa           No          Unf          0          Unf          0       173         173            0            0
## 2219   Absent       TA           No          Unf          0          Unf          0       356         356            0            0
## 2349       Gd       TA       Absent          Unf          0          Unf          0       725         725            0            0
## 2525       TA   Absent           Av          ALQ        755          Unf          0       240         995            0            0

In total 9 rows show the basement as being absent while some of the other columns indicates that the basement exists, by giving scores for e.g., height (BsmtQual) or condition (BsmtCond). Sometimes a house can have more than one basement with one of them unfinished (=absent?), and I found it difficult to see just from the data what is correct here (which column we should be trust). Since there are only 9 rows it probably won’t to much harm to keep them, and as these values could contain some important signal relating to quality/building progress, therefore I will leave these unchanged.

Considering the size of this dataset (>2900 observations), the sample features we looked at seemed fairly consistent, where Absent-type garages were almost always followed by absent/unfinished/detached status in related columns. For our purposes we will not dig any deeper into this, but we note that some gain might be had by making sure that the entire dataset is consistent, by e.g. changing all related values to absent if at least one value indicates the absence of a feature, or by some other strategy.

Despite this, we will treat the status of our dataset at this point to be “good enough” and we will now move on to impute the rest of the missing values using random forests.

Imputation with missForest

First, we see that although many (=21) features still have missing values, only LotFrontage and GarageYrBlt have more than 1% missing, after replacing NA’s with Absent where applicable according to description (ignoring the target variable SalePrice).

pm = perc_missing(all_data)
print(pm[pm>1]) # more than 1% missing
## LotFrontage GarageYrBlt   SalePrice 
##   16.649538    5.447071   49.982871
print(pm[pm>0]) # features with missing values
##     MSZoning  LotFrontage    Utilities  Exterior1st  Exterior2nd   MasVnrType   MasVnrArea   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF 
##   0.13703323  16.64953751   0.06851662   0.03425831   0.03425831   0.82219938   0.78794108   0.03425831   0.03425831   0.03425831 
##  TotalBsmtSF   Electrical BsmtFullBath BsmtHalfBath  KitchenQual   Functional  GarageYrBlt   GarageCars   GarageArea     SaleType 
##   0.03425831   0.03425831   0.06851662   0.06851662   0.03425831   0.06851662   5.44707091   0.03425831   0.03425831   0.03425831 
##    SalePrice 
##  49.98287085

We will now impute the remaining missing values with the missForest library.

missForest uses random forests to predict values for the NA’s in the dataset. First we load the package, install it if needed (it is available at CRAN).

#install.packages("missForest")
require(missForest)
## Loading required package: missForest

## Loading required package: randomForest

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## Loading required package: foreach

## Loading required package: itertools

## Loading required package: iterators

Important parameters that we will change here the Random Forest algorithm are ntree, the number of trees in each forest, maxiter, the maximum number of iterations, but see also the documentation (the one you get by typing ?missForest in R) for more information.

We will reduce the number of trees to 10 (from 100) and keep maxiter at 10, and having all other parameters at their default values, as this seems to work well for the current problem.

# Impute missing values, excluding the target column (SalePrice)
imp_result = missForest(select(all_data, -SalePrice), maxiter=10, ntree=10) # !!!
##   missForest iteration 1 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
##   missForest iteration 2 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
##   missForest iteration 3 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
##   missForest iteration 4 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
##   missForest iteration 5 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
##   missForest iteration 6 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
##   missForest iteration 7 in progress...

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry = mtry, : The response has five or fewer unique values. Are you
## sure you want to do regression?

## done!
# ' missForest additionally gives us the out-of-bag (OOB) errors calculated by comparing imputed values to
# ' the values in the training data, calculated as the normalized root mean squared error (NRMSE)
# ' for continuous variables and proportion of falsely classified (PFC) for the
# ' categorical variables, providing an indication of the overall imputation error.
imp_result$OOBerror
##      NRMSE        PFC 
## 0.03336438 0.03371619

The error rate introduced by imputation of the missing data is estimated at about 4%, which does not seem to bad. We save this result and verify that no values are missing.

data_imp = imp_result$ximp

cat("Columns with missing values:", sum(0 != perc_missing(data_imp)), "\n")
## Columns with missing values: 0
# add back dependent variable SalePrice to the data
data_imp$SalePrice = all_data$SalePrice

After some rudimentary data cleaning and imputation of missing values we now we have a complete dataset (no missing values), and are ready to move on to the modelling.

Model evaluation

We will model the data with both Random Forests and XGBoost. Random Forests works by combining (bagging) the predictions of several classifiers, each trained on different part of the dataset.

XGBoost is similar in that it also a collection of several classifiers, but unlike Random Forest, classifiers/boosters are added sequentially, with each booster training on the errors/residuals of the previous booster, rather than the errors of the full data set.

Importantly, each tree in a Random Forest are independent, adding more trees, growing larger forests is unlikely to result in overfitting. For XGBoost, adding more boosters could eventually result in overfitting. To check for overfitting, we will split our training dataset to training and validation as outlined below.

Defining the metric

The score on the Kaggle leaderboard is evaluated by using a root mean square error (RMSE) on the logarithmic difference between the true sale price and our prediction.

Here, we define a function for evaluating the accuracy given a model/fit to the training data, and the actual data/sale price. By default we logarithmise the SalePrice to correspond with the error used to evaluate performance on Kaggle.

If no values for the targets are given, we assume that the targets can be obtained from the data object as data$SalePrice. Further, if the keyword all_errors=TRUE is given, the individual (squared) errors will be return along with total RMSE of the errors. Note this function is not at all general, and is written for convenience for this tutorial,

calc_RMSE = function(fit, data, target=NULL, log=TRUE, all_errors=FALSE){
  pred = predict(fit, data, se.fit=FALSE)
  if(is.null(target)) target = data$SalePrice
  if(log){
    pred = log(pred)
    target = log(target)}
  errors = (pred-target)^2
  error = sqrt(mean(errors))
  if(all_errors) return(list("error" = error, "errors" = errors))
  else return(error)
}

Training/validation

Then, define a function for splitting the training data further to training/validation, using 80% for the training and 20% for validation.

split_data = function(df, f=0.8){
  train.i = sample(nrow(df), f*nrow(df), replace = FALSE)
  return(list("train" = df[train.i,], "val" = df[-train.i,]))
}
tt = get_tt(data_imp) # first split all data to training/test
s = split_data(tt$train) # the split training to training/validation

Random forests

Creating the model

Now we define and run a random forest model, using the randomForest package. We set the number of trees in each forest ntree=200, the minimum size of end nodes nodesize=10. Further we set importance=TRUE to also save the importance scores for each of the variables (see section on Feature selection below)

require(randomForest)
rf.1 = randomForest(SalePrice ~ ., data=s$train, importance=TRUE, nodesize=10, ntree=200)

Next we evaluate our model by calculating the RMSE (Log) on both train and validation.

calc_RMSE(rf.1, s$train, log=TRUE)
## [1] 0.07046209
calc_RMSE(rf.1, s$val, log=TRUE)
## [1] 0.1616448

When I run it, I get a higher error for the validation set, however, the first two significant digits of the error for both the training/validation does not change when increasing the number of trees ntree by a factor 10 (simulations not shown here), suggesting that having more trees than 200 will not result in any significant improvement, but also that adding more trees does not reduce the performance by overfitting.

rf.11 = randomForest(SalePrice ~ ., data=s$train, importance=TRUE, nodesize=10, ntree=2000)
calc_RMSE(rf.11, s$train, log=TRUE)
## [1] 0.06942862
calc_RMSE(rf.11, s$val, log=TRUE)
## [1] 0.1603187

With this, we now have a baseline result from the Random Forest model above (using ntree=200). To further analyse the Random Forest model, lets plot the (squared) errors as a function of (log) SalePrice to see if there is any region in SalePrice where the model performs best/worst.

plot(log(s$train$SalePrice), calc_RMSE(rf.1, s$train, log=TRUE, all_errors=TRUE)$errors,
     xlab="SalePrice (log)", ylab="Squared log errors")

From the plot we see a (very slightly) bowl-shaped error distribution biased to the left, with higher errors for low sale prices, but also some large outlier errors throughout the range of all sale prices. One might speculate that model does not prioritize the lower values of SalePrice, since these correspond to small absolute errors with the root mean square error metric used by randomForest, while they in the log-space used by Kaggle correspond to larger relative errors.

However, my experience was that log-transforming the SalePrice did not reduce the error for the Random Forest model (code not shown here, feel free to re-run model with log(SalePrice) instead and see if any improvement is possible).

Recursive feature selection

Now, we will try to find variables which can be excluded from the analysis using recursive feature selection. This will be done in an attempt to simplify the dataset and avoid overfitting.

To this end we use the same Random Forest model as above, but extracting the importance scores for each variable to see which variables are important for the accuracy of our prediction of SalePrice on the validation set. The method randomForest in R calculates two importance scores, the first one relating to the mean decrease in the mean squared error (MSE), and the second one relating to the decrease in average node purity/Gini index. As I am not sure on how to interpret the importance score for Gini index, I will instead focus on the decrease in MSE.

A positive value for the MSE importance score %IncMSE reported by randomForest means that, when the corresponding variable is replaced with random permutation, the average MSE increases by the indicated amount. We plot the importance score (from the Random Forest model trained above) with varImpPlot(), and retrieve them with importance(),

varImpPlot(main="Variable Importance", rf.1)

imp = importance(rf.1)
print(imp)
##                   %IncMSE IncNodePurity
## MSSubClass    12.84605639  8.281970e+10
## MSZoning       3.74508008  6.853436e+09
## LotFrontage    4.39107307  3.709595e+10
## LotArea        6.58381049  7.434984e+10
## Street         1.00250941  3.941767e+07
## Alley          1.91837963  1.941784e+09
## LotShape       1.64028418  7.440053e+09
## LandContour   -1.58974431  7.141523e+09
## Utilities      0.00000000  8.851323e+06
## LotConfig      0.96648222  1.004356e+10
## LandSlope      0.43879877  2.259115e+09
## Neighborhood  16.76391123  8.098671e+11
## Condition1     1.17208872  4.148986e+09
## Condition2     0.28846169  1.699123e+08
## BldgType       3.34385130  3.175804e+09
## HouseStyle     4.06058325  6.876994e+09
## OverallQual   14.24597138  1.611958e+12
## OverallCond    4.00035721  2.001098e+10
## YearBuilt      5.28863450  5.641135e+10
## YearRemodAdd   5.26719822  3.938568e+10
## RoofStyle      0.51544384  5.791698e+09
## RoofMatl       1.49648998  3.706796e+09
## Exterior1st    4.51961267  3.649105e+10
## Exterior2nd    5.37819487  4.433319e+10
## MasVnrType     2.81127648  6.159362e+09
## MasVnrArea     2.32293162  3.515422e+10
## ExterQual      8.39209792  5.731137e+11
## ExterCond      1.51761350  3.469629e+09
## Foundation     5.09076149  4.319645e+09
## BsmtQual       6.54930513  1.796095e+11
## BsmtCond       3.57606675  6.405939e+09
## BsmtExposure   4.16428480  1.810295e+10
## BsmtFinType1   5.62280387  2.316542e+10
## BsmtFinSF1     9.87434266  1.490848e+11
## BsmtFinType2   1.21489582  4.703409e+09
## BsmtFinSF2     1.02294339  3.614480e+09
## BsmtUnfSF      6.30861686  2.554345e+10
## TotalBsmtSF    9.64263602  2.681220e+11
## Heating       -1.12323552  3.641040e+08
## HeatingQC      1.26481158  6.323270e+09
## CentralAir     6.30340765  1.425191e+10
## Electrical     0.07832025  1.122232e+09
## X1stFlrSF     11.18438187  2.456017e+11
## X2ndFlrSF      8.20065538  1.003543e+11
## LowQualFinSF  -0.09338836  2.047265e+09
## GrLivArea     21.89810237  7.764230e+11
## BsmtFullBath   4.06199330  7.935079e+09
## BsmtHalfBath  -0.69358267  7.212215e+08
## FullBath       4.39229459  4.925970e+10
## HalfBath       2.16870203  8.200302e+09
## BedroomAbvGr   3.82840961  1.040001e+10
## KitchenAbvGr   1.82768672  1.993486e+09
## KitchenQual    4.95723833  1.963257e+11
## TotRmsAbvGrd   3.52734605  6.866472e+10
## Functional     2.51809690  5.000037e+09
## Fireplaces     3.84601388  2.952625e+10
## FireplaceQu    8.41291068  6.493834e+10
## GarageType     6.78367760  3.513616e+10
## GarageYrBlt    6.94059095  2.097304e+10
## GarageFinish   4.94465944  4.538122e+10
## GarageCars     6.04108289  4.898588e+11
## GarageArea    10.04695501  2.343051e+11
## GarageQual     3.20603381  6.672819e+09
## GarageCond     2.99877944  5.173845e+09
## PavedDrive     0.72832423  2.787904e+09
## WoodDeckSF     3.76277047  1.768825e+10
## OpenPorchSF    4.16496291  2.400444e+10
## EnclosedPorch  0.21423929  4.136588e+09
## X3SsnPorch     1.40214201  1.238475e+09
## ScreenPorch   -0.29992571  5.181233e+09
## PoolArea       0.00000000  9.404562e+08
## PoolQC        -1.00250941  1.351085e+09
## Fence          0.80252563  2.307690e+09
## MiscFeature   -1.25845352  4.159301e+08
## MiscVal        1.14971518  5.949758e+08
## MoSold        -1.72093675  1.094121e+10
## YrSold        -0.76470137  6.650601e+09
## SaleType       2.59370364  8.831711e+09
## SaleCondition  3.25312488  6.618408e+09

It seems that there are many variables making small contributions to the total prediction, and we keep as many as possible as we there are many features and we haven’t fully explored the interdependencies.

However, some variables actually correspond to a reduction of the error if they are ignored/changed, as they have a negative value for the reported %IncMSE. We will here remove all variable with negative importance, using recursive feature elimination. I was not able to find such a function in R, so instead we will do the feature elimination ourselves.

To this end, we will create a loop where we for each iteration train a Random Forest model and find the variable with the lowest importance. If this importance is less than zero, we remove the feature from the dataset, and start a new iteration, until all variables have a positive importance.

m_data = data_imp # make a copy of the imputed data
min_imp = -1 # to run loop at least one time

while(min_imp<0){
  tt = get_tt(m_data) # split all data to training/test (again)
  s = split_data(tt$train) # split training to training/validation (again)
  rf.temp = randomForest(SalePrice ~ ., data=s$train, importance=TRUE, nodesize=10, ntree=200)
  imp = importance(rf.temp)
  imp_worst = which.min(imp[,1])
  min_imp = imp[imp_worst,1]
  print("Variable with least importance:")
  print(imp_worst)
  cat("Min. importance:", min_imp, "\n" )
  if(min_imp<0){
  m_data = m_data %>% select(-imp_worst)
  m_data$SalePrice = all_data$SalePrice # new dataset with SalePrice added back
  m_data$Id = all_data$Id # new dataset with Id added back
  cat("Variables in original dataset:", ncol(all_data),
      ". Variables in new dataset:", ncol(m_data), "\n")
  }
  else{
    cat("No variable removed.\n")
  }
}
## [1] "Variable with least importance:"
## Electrical 
##         42 
## Min. importance: -1.937437 
## Variables in original dataset: 81 . Variables in new dataset: 80 
## [1] "Variable with least importance:"
## PoolArea 
##       70 
## Min. importance: -1.633364 
## Variables in original dataset: 81 . Variables in new dataset: 79 
## [1] "Variable with least importance:"
## LotConfig 
##        10 
## Min. importance: -1.740129 
## Variables in original dataset: 81 . Variables in new dataset: 78 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           43 
## Min. importance: -1.342515 
## Variables in original dataset: 81 . Variables in new dataset: 77 
## [1] "Variable with least importance:"
## RoofMatl 
##       21 
## Min. importance: -1.844842 
## Variables in original dataset: 81 . Variables in new dataset: 76 
## [1] "Variable with least importance:"
## MiscFeature 
##          69 
## Min. importance: -2.121014 
## Variables in original dataset: 81 . Variables in new dataset: 75 
## [1] "Variable with least importance:"
## YrSold 
##     71 
## Min. importance: -1.533271 
## Variables in original dataset: 81 . Variables in new dataset: 74 
## [1] "Variable with least importance:"
## RoofMatl 
##       20 
## Min. importance: -2.072388 
## Variables in original dataset: 81 . Variables in new dataset: 73 
## [1] "Variable with least importance:"
## YrSold 
##     69 
## Min. importance: -2.288397 
## Variables in original dataset: 81 . Variables in new dataset: 72 
## [1] "Variable with least importance:"
## Street 
##      5 
## Min. importance: -1.740176 
## Variables in original dataset: 81 . Variables in new dataset: 71 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           39 
## Min. importance: -1.422772 
## Variables in original dataset: 81 . Variables in new dataset: 70 
## [1] "Variable with least importance:"
## MiscFeature 
##          65 
## Min. importance: -1.557098 
## Variables in original dataset: 81 . Variables in new dataset: 69 
## [1] "Variable with least importance:"
## Street 
##      4 
## Min. importance: -1.421266 
## Variables in original dataset: 81 . Variables in new dataset: 68 
## [1] "Variable with least importance:"
## RoofMatl 
##       17 
## Min. importance: -1.746962 
## Variables in original dataset: 81 . Variables in new dataset: 67 
## [1] "Variable with least importance:"
## MiscFeature 
##          62 
## Min. importance: -1.671692 
## Variables in original dataset: 81 . Variables in new dataset: 66 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           36 
## Min. importance: -1.253548 
## Variables in original dataset: 81 . Variables in new dataset: 65 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           35 
## Min. importance: -1.832373 
## Variables in original dataset: 81 . Variables in new dataset: 64 
## [1] "Variable with least importance:"
## MiscFeature 
##          59 
## Min. importance: -1.730302 
## Variables in original dataset: 81 . Variables in new dataset: 63 
## [1] "Variable with least importance:"
## Condition2 
##         11 
## Min. importance: -1.415822 
## Variables in original dataset: 81 . Variables in new dataset: 62 
## [1] "Variable with least importance:"
## Condition2 
##         10 
## Min. importance: -1.46205 
## Variables in original dataset: 81 . Variables in new dataset: 61 
## [1] "Variable with least importance:"
## Condition2 
##          9 
## Min. importance: -1.785374 
## Variables in original dataset: 81 . Variables in new dataset: 60 
## [1] "Variable with least importance:"
## LotConfig 
##         7 
## Min. importance: -1.75339 
## Variables in original dataset: 81 . Variables in new dataset: 59 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           30 
## Min. importance: -2.135262 
## Variables in original dataset: 81 . Variables in new dataset: 58 
## [1] "Variable with least importance:"
## Condition2 
##          7 
## Min. importance: -1.721953 
## Variables in original dataset: 81 . Variables in new dataset: 57 
## [1] "Variable with least importance:"
## Street 
##      3 
## Min. importance: -1.002509 
## Variables in original dataset: 81 . Variables in new dataset: 56 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           27 
## Min. importance: -1.038828 
## Variables in original dataset: 81 . Variables in new dataset: 55 
## [1] "Variable with least importance:"
## RoofMatl 
##       10 
## Min. importance: -2.216398 
## Variables in original dataset: 81 . Variables in new dataset: 54 
## [1] "Variable with least importance:"
## Street 
##      2 
## Min. importance: -1.681009 
## Variables in original dataset: 81 . Variables in new dataset: 53 
## [1] "Variable with least importance:"
## RoofMatl 
##        8 
## Min. importance: -1.492722 
## Variables in original dataset: 81 . Variables in new dataset: 52 
## [1] "Variable with least importance:"
## Condition2 
##          4 
## Min. importance: -2.055811 
## Variables in original dataset: 81 . Variables in new dataset: 51 
## [1] "Variable with least importance:"
## Condition2 
##          3 
## Min. importance: -2.264846 
## Variables in original dataset: 81 . Variables in new dataset: 50 
## [1] "Variable with least importance:"
## Condition2 
##          2 
## Min. importance: -2.660484 
## Variables in original dataset: 81 . Variables in new dataset: 49 
## [1] "Variable with least importance:"
## Condition2 
##          1 
## Min. importance: -1.478165 
## Variables in original dataset: 81 . Variables in new dataset: 49 
## [1] "Variable with least importance:"
## RoofMatl 
##        4 
## Min. importance: -1.189817 
## Variables in original dataset: 81 . Variables in new dataset: 48 
## [1] "Variable with least importance:"
## Condition2 
##          1 
## Min. importance: -1.802141 
## Variables in original dataset: 81 . Variables in new dataset: 47 
## [1] "Variable with least importance:"
## LowQualFinSF 
##           18 
## Min. importance: -0.1308753 
## Variables in original dataset: 81 . Variables in new dataset: 46 
## [1] "Variable with least importance:"
## YrSold 
##     42 
## Min. importance: -1.192609 
## Variables in original dataset: 81 . Variables in new dataset: 45 
## [1] "Variable with least importance:"
## MiscFeature 
##          41 
## Min. importance: -0.8270808 
## Variables in original dataset: 81 . Variables in new dataset: 44 
## [1] "Variable with least importance:"
## BsmtFinType2 
##           15 
## Min. importance: 1.160883 
## No variable removed.

Now let’s calculate the RMSE (log) of our model with the new set of features to compare it to the model with the original set.

cat("Variables in original dataset:", ncol(all_data),
    ". Variables in new dataset:", ncol(m_data), "\n")
## Variables in original dataset: 81 . Variables in new dataset: 44
calc_RMSE(rf.temp, s$train, log=TRUE)
## [1] 0.08389441
calc_RMSE(rf.temp, s$val, log=TRUE)
## [1] 0.1741376

Interestingly, our model with removed variables performed worse on the validation compared to the original data set, despite only removing features with negative importance scores. Maybe this suggests large interdependence effects across the variables, with certain combinations of variables being important, relations cannot be seen when looking at the variables one by one.

Since we got a worse result after feature selection, we are obliged to keep the original variable set for the final model.

Submitting predictions to Kaggle

For the final Kaggle submission, we will use the full training data set (training+validation) for fitting the model and increase the number of trees to ntree=2000.

tt = get_tt(data_imp) # split all data to training/test (again)

rf.2 = randomForest(SalePrice ~ ., data=tt$train, nodesize=10, ntree=2000)
print(rf.2)
## 
## Call:
##  randomForest(formula = SalePrice ~ ., data = tt$train, nodesize = 10,      ntree = 2000) 
##                Type of random forest: regression
##                      Number of trees: 2000
## No. of variables tried at each split: 26
## 
##           Mean of squared residuals: 771053379
##                     % Var explained: 87.77
calc_RMSE(rf.2, tt$train, log=TRUE)
## [1] 0.07160225

Looks similar to the result we got before on the training data (where we keept some observations for validation). We now use this model to get the predictions for SalePrince on the test set,

pred.test = predict(rf.2, tt$test)

Before we submit our results, we plot the distribution of the predicted (logarithmised) SalePrice in our test set (red) on top of the distribution of the (logarithmised) SalePrice in the original training data (blue), to make sure everything looks okay.

hist(log(tt$train$SalePrice), main="Training (blue), test (red) predictions", xlab = "SalePrice (log)", freq=FALSE, col=scales::alpha('blue',0.5), ylim=c(0,1.5))
hist(log(pred.test),  freq=FALSE, add=T, col=scales::alpha('red',.5))

We note that the distributions look fairly similar, although there seem to be less outliers, high/low sale prices for the test predictions (red) compared to the training data (blue), but not sure if this is important.

Let’s save our test predictions and the corresponding ID’s (which we stored in the variable test_id at the beginning) as submission.csv, this file can then be submitted to Kaggle, I got the very modest score 0.150 using this model.

subm = cbind(test_id, pred.test)
colnames(subm) = c("Id", "SalePrice")
write.csv(subm, "submission.csv", row.names=FALSE)

EDIT: When submitting the result to kaggle, I use the kaggle CLI. Whenever you make a submission you also have to provide the name of the competition you want to provide a solution for, the file you want to send, and a message, which I like to set to something as descriptive as possible to keep track of what I’ve tried. Follow the information obtained by running kaggle --help for more information. For me, I would likely submit the above result with

# kaggle competitions submit -c "house-prices-advanced-regression-techniques" -f "submission.csv" -m "randomForest ntree=2000 nodesize=10"

XGBoost

We will also train a model based on XGBoost. We begin by importing the xgboost package (install if need be, it is available at CRAN) which provide an R interface to the XGBoost algorithm.

require(xgboost)
## Loading required package: xgboost

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

XGBoost, short for extreme gradient boosting, uses boosting which sequentially adds models that trains on/corrects the errors made by previous models. As we will see it can be a powerful method able to bring about great results at relatively low computational cost, although with a risk of overfitting.

Data preparation

We start by retrieving new training/validation/test data sets. Here, we need to consider that xgboost, unlike randomForest, only accepts numerical variables, as such we need to somehow recode our categorical values into integers. This can be done e.g., using one-hot encoding, where the presence of each category is indicated by a 1 and its absence is indicated by a 0.

Fortunately for us, this is very easy to do with the package fastDummies.

tt = get_tt(data_imp) # get training/test from imputed data

# one-hot encoding of categorical variables
require("fastDummies")
## Loading required package: fastDummies
tt$train = dummy_cols(tt$train, remove_selected_columns =TRUE, remove_first_dummy=TRUE)
tt$test = dummy_cols(tt$test, remove_selected_columns   =TRUE, remove_first_dummy=TRUE)

When creating the dummy columns we set remove_selected_columns=TRUE to make sure we also discard the old columns with factors. I also set the remove_first_dummy=TRUE, meaning that the first category is removed for each variable, as this somewhat reduces the amount of columns in our final matrix. Let’s continue and get the train/validation/test sets.

s = split_data(tt$train) # split data to train/test
train = s$train
val = s$val
test = tt$test

The package xgboost wants data in the DMatrix format. A DMatrix can be created from a matrix X of features (independent variables) and vector y of the corresponding targets (dependent variable).

To this end we will extract X (features) and y (target) for the training/validation/test sets and create the DMatrices we need for xgboost.

For xgboost, I found better results logarithmising the SalePrice before training the model, so this is what we’ll be doing here.

train_y = log(train$SalePrice)
train_X = data.matrix(subset(train, select = - SalePrice))

val_y = log(val$SalePrice)
val_X = data.matrix(subset(val, select = - SalePrice))

test_y = log(test$SalePrice)
test_X = data.matrix(subset(test, select = - SalePrice))

dtrain = xgb.DMatrix(data = train_X, label = train_y)
dtest = xgb.DMatrix(data = test_X, label = test_y)

Defining the model

Now we set the parameters for running xgboost. The ones we will set here is max_depth, maximum depth of each tree (default=6), lambda and alpha which are the regularization strengths for L1 and L2 regularization (default=0 for both), min_child_weight which for linear regression corresponds to the minimum number of instances in each node (default=1), eta meaning the learning rate (default=0.3). For the objective function, I found the squarederror to work well (objective=reg:squarederror).

We begin by setting the parameters at the default values and fit the model, to get a baseline result.

xgb.paras=list(max_depth=6,lambda=0.0,alpha=0.0,min_child_weight=0, booster ="gbtree",
               eta=0.3, objective="reg:squarederror")
xgb.fit0 = xgboost(data = dtrain, nrounds=300,params=xgb.paras, nthread = 6, verbose=0)

Then calculate the error on both training and validation

calc_RMSE(xgb.fit0, train_X, target=train_y, log=FALSE)
## [1] 0.001065989
calc_RMSE(xgb.fit0, val_X, target=val_y, log=FALSE)
## [1] 0.133795

Note that although we were able to get a very low error on the training data, the error on the validation was much higher. After some behind-the-scenes experimenting/scanning of parameter values, I found the parameters below to work better, providing a lower error on the validation set, feel free to see if you can find any improvements.

xgb.paras=list(max_depth=3,lambda=0.01,alpha=0.01,min_child_weight=0, booster ="gbtree",
               eta=0.01, objective="reg:squarederror")
xgb.fit = xgboost(data = dtrain, nrounds=2000,params=xgb.paras, nthread = 6, verbose=0)

Again, when calculating the errors for training and validation, the above parameters yields a somewhat lower validation error. As the training error also is higher, it might be that we avoided some overfitting, indicating a better model with less variance.

calc_RMSE(xgb.fit, train_X, target=train_y, log=FALSE)
## [1] 0.06326491
calc_RMSE(xgb.fit, val_X, target=val_y, log=FALSE)
## [1] 0.1187995

Now, we analyse the model by plotting the log-errors as a function of the log-SalePrice, as was done for the Random Forests. In the validation set, the errors seem evenly distributed, with no clear bias towards being better at predicting high/low sale prices, but with some outliers where the prediction was off by larger amounts.

plot(train_y, calc_RMSE(xgb.fit, train_X, target=train_y, log=TRUE, all_errors=TRUE)$errors,
     xlab="SalePrice (log)", ylab="root mean square log errors")

plot(val_y, calc_RMSE(xgb.fit, val_X, target=val_y, log=TRUE, all_errors=TRUE)$errors,
     xlab="SalePrice (log)", ylab="root mean square log errors")

Create final model and submit result

For the final model used in the Kaggle submission, we train on all available training data (training + validation) with the same parameters as the best model found above.

trainall_y = log(tt$train$SalePrice)
trainall_X = data.matrix(subset(tt$train, select = - SalePrice))
dtrainall = xgb.DMatrix(data = trainall_X, label = trainall_y)

xgb.fitall = xgboost(data = dtrainall, nrounds=2000,params=xgb.paras, nthread = 6, verbose=0)

To get predictions on test set, we exponentiate to get back the price in dollars.

pred2.test = exp(predict(xgb.fitall, dtest))

To check the sanity of our result on the test set, we plot the distribution of value of salePrice in the training data (blue) and the distribution of predicted SalePrice for the test set using xgboost (green)

hist(log(tt$train$SalePrice), main="Training (blue), test (green) predictions",
     xlab = "SalePrice (log)", freq=FALSE, col=scales::alpha('blue',0.5), ylim=c(0,1.5))
hist(log(pred2.test), freq=FALSE, add=T, col=scales::alpha('green',.5))

This is just to quickly check the results before submitting, in order to see that we haven’t accidentally exponentiated the target an extra time or something else. As they seem similar enough, we then save the result and submit to Kaggle.

subm = cbind(test_id, pred2.test)
colnames(subm) = c("Id", "SalePrice")
write.csv(subm, "submission.csv", row.names=FALSE)

The best score I was able to get using this approach and these parameter values was 0.131 when submitting to the Kaggle leaderboard. Interestingly, a substantial improvement to the result from the Random Forest model, showing how in this case XGBoost is able to improve the prediction even further compared to RandomForests, when building models sequentially training on the errors of the previous models.

Conclusion

At the end of this tutorial, we have made a submission to Kaggle that deals with a varied set of features related to housing prices, in which we had to deal with missing values and where we predicted the sale price using Random Forests and XGBoost.

To handle the missing values, we used the package missForest available in R which imputed the missing values with a Random Forest algorithm. Though not touched upon in this tutorial, my own casual observation is that missForest does improve accuracy of our final score compared to that of a simpler imputation (e.g. replacing all missing values with their means/mode), but at the cost of time-consuming computations, and my feeling is that in a production environment where new data is continuously added, one should use the simpler, less computationally-intensive way of imputation using the means/modes.

Moving on, we found that the best score on the test set was obtained with a XGBoost model. We attempted to reduce the number of features in our data set with recursive feature selection (RFE), where we only kept features that improved the out-of-bag (OOB) error of a recursively trained Random Forest model. However, as the end result of the RFE was a model with less accuracy on the validation set, we kept all variables for the final model. Though the RFE approach was unsuccessful and reduced the accuracy of the final model, the results presented here could provide some insight into the dataset, e.g. suggesting important variables.

Appendix: Catboost

With catboost, which is a gradient boosting method specializing on categorical/mixed data, developed by https://en.wikipedia.org/wiki/Yandex, even better results were possible, out-of-the-box.

As I was impressed with the performance, I will give a quick introduction on how to install and run a catboost model here.

Installation

The documentation https://catboost.ai/docs/installation/r-installation-binary-installation.html tells us to install catboost with

# install.packages('devtools')
# devtools::install_url('BINARY_URL', INSTALL_opts = c("--no-multiarch"))

where BINARY_URL is a link to the version you want to install. For example, the latest version at the time of writing (Windows) is https://github.com/catboost/catboost/releases/download/v0.24.4/catboost-R-Windows-0.24.4.tgz see e.g. https://github.com/catboost/catboost/releases.

Unfortunately version 0.24.4 did not seem to work for me, as I seem to get a similar error as in https://github.com/catboost/catboost/issues/1525, Therefore, as a bonus, here are the R-commands required to unload the catboost package from the current R-environment and remove the package, so you can reinstall a new version, after restarting the R session,

# detach("package:catboost", unload=TRUE) # detach catboost from current R session
# remove.packages("catboost") # uninstall catboost

As version 0.24.3 seems to work fine, we instead install this version, and load it

# devtools::install_url('https://github.com/catboost/catboost/releases/download/v0.24.3/catboost-R-Windows-0.24.3.tgz')
require("catboost")
## Loading required package: catboost

Preparing the data

As before we split the data to training/test, and then we split training to training/validation.

tt = get_tt(data_imp) # get training/test data
s = split_data(tt$train) # split training to training/validation

Now we define the different datasets, training data, validation data, all training data (validation+training) and test data, in the format catboost wants,

train_pool = catboost.load_pool(data = select(s$train, -c("SalePrice")), label = log(s$train$SalePrice))
val_pool = catboost.load_pool(data = select(s$val, -c("SalePrice")), label = log(s$val$SalePrice))
alltrain_pool = catboost.load_pool(data = select(tt$train, -c("SalePrice")), label = log(tt$train$SalePrice))
test_pool = catboost.load_pool(data = select(tt$test, -c("SalePrice")), label = log(tt$test$SalePrice))

Training the model

After some more hyperparameter tuning, I found a good parameter set to be depth = 4, learning_rate = 0.01, iterations = 5000, l2_leaf_reg = 1, rsm = 0.1, border_count = 254 and early_stopping_rounds=200. I will not go into the meanings of this parameters, but please see the catboost documentation for more details, and possibly find your own, even better, values. Now we will train the model. Note that the catboost.train function takes both a training dataset and a validation dataset, allowing us to monitor the performance on the validation set during the training.

cb.paras = list(loss_function = 'RMSE', iterations = 5000,
                metric_period=100, depth=4, rsm=0.1,
                border_count=254, learning_rate=0.01,
                l2_leaf_reg=1, early_stopping_rounds=200)

cb.model = catboost.train(train_pool,  val_pool, params = cb.paras)
## Warning: Overfitting detector is active, thus evaluation metric is calculated on every iteration. 'metric_period' is ignored for evaluation metric.
## 0:   learn: 0.3955673    test: 0.4034571 best: 0.4034571 (0) total: 53.2ms   remaining: 4m 26s
## 100: learn: 0.2467607    test: 0.2512793 best: 0.2512793 (100)   total: 940ms    remaining: 45.6s
## 200: learn: 0.1836174    test: 0.1873397 best: 0.1873397 (200)   total: 1.84s    remaining: 43.9s
## 300: learn: 0.1555228    test: 0.1617281 best: 0.1617281 (300)   total: 2.75s    remaining: 42.9s
## 400: learn: 0.1409895    test: 0.1493931 best: 0.1493931 (400)   total: 3.65s    remaining: 41.9s
## 500: learn: 0.1315766    test: 0.1419048 best: 0.1419048 (500)   total: 4.55s    remaining: 40.9s
## 600: learn: 0.1246155    test: 0.1372702 best: 0.1372702 (600)   total: 5.45s    remaining: 39.9s
## 700: learn: 0.1187299    test: 0.1332261 best: 0.1332261 (700)   total: 6.33s    remaining: 38.9s
## 800: learn: 0.1135539    test: 0.1299086 best: 0.1299086 (800)   total: 7.23s    remaining: 37.9s
## 900: learn: 0.1092132    test: 0.1273869 best: 0.1273869 (900)   total: 8.12s    remaining: 37s
## 1000:    learn: 0.1052751    test: 0.1254555 best: 0.1254555 (998)   total: 9.03s    remaining: 36.1s
## 1100:    learn: 0.1021420    test: 0.1238649 best: 0.1238565 (1099)  total: 9.94s    remaining: 35.2s
## 1200:    learn: 0.0996795    test: 0.1227769 best: 0.1227736 (1199)  total: 10.9s    remaining: 34.4s
## 1300:    learn: 0.0973729    test: 0.1218977 best: 0.1218921 (1298)  total: 11.8s    remaining: 33.4s
## 1400:    learn: 0.0953014    test: 0.1210086 best: 0.1210085 (1399)  total: 12.7s    remaining: 32.5s
## 1500:    learn: 0.0932589    test: 0.1203908 best: 0.1203908 (1500)  total: 13.6s    remaining: 31.6s
## 1600:    learn: 0.0916012    test: 0.1197485 best: 0.1197455 (1596)  total: 14.5s    remaining: 30.7s
## 1700:    learn: 0.0900183    test: 0.1193086 best: 0.1193086 (1700)  total: 15.4s    remaining: 29.8s
## 1800:    learn: 0.0884368    test: 0.1187568 best: 0.1187568 (1800)  total: 16.3s    remaining: 28.9s
## 1900:    learn: 0.0869875    test: 0.1183434 best: 0.1183434 (1900)  total: 17.2s    remaining: 28s
## 2000:    learn: 0.0857432    test: 0.1180121 best: 0.1180121 (2000)  total: 18.1s    remaining: 27.1s
## 2100:    learn: 0.0845322    test: 0.1177919 best: 0.1177919 (2100)  total: 19s  remaining: 26.2s
## 2200:    learn: 0.0834684    test: 0.1175702 best: 0.1175670 (2197)  total: 19.9s    remaining: 25.3s
## 2300:    learn: 0.0824500    test: 0.1174580 best: 0.1174477 (2298)  total: 20.8s    remaining: 24.4s
## 2400:    learn: 0.0814324    test: 0.1173079 best: 0.1173079 (2400)  total: 21.7s    remaining: 23.5s
## 2500:    learn: 0.0804397    test: 0.1171058 best: 0.1171034 (2491)  total: 22.6s    remaining: 22.6s
## 2600:    learn: 0.0794627    test: 0.1169852 best: 0.1169852 (2600)  total: 23.5s    remaining: 21.7s
## 2700:    learn: 0.0784878    test: 0.1169022 best: 0.1168870 (2650)  total: 24.4s    remaining: 20.8s
## 2800:    learn: 0.0776579    test: 0.1167957 best: 0.1167957 (2800)  total: 25.3s    remaining: 19.9s
## 2900:    learn: 0.0767674    test: 0.1167102 best: 0.1167095 (2899)  total: 26.2s    remaining: 19s
## 3000:    learn: 0.0758981    test: 0.1166598 best: 0.1166439 (2986)  total: 27.1s    remaining: 18.1s
## 3100:    learn: 0.0750977    test: 0.1166367 best: 0.1166183 (3075)  total: 28s  remaining: 17.2s
## 3200:    learn: 0.0743912    test: 0.1165727 best: 0.1165596 (3190)  total: 28.9s    remaining: 16.3s
## 3300:    learn: 0.0736714    test: 0.1165328 best: 0.1165268 (3245)  total: 29.8s    remaining: 15.4s
## 3400:    learn: 0.0729281    test: 0.1164713 best: 0.1164681 (3399)  total: 30.7s    remaining: 14.4s
## 3500:    learn: 0.0722066    test: 0.1164088 best: 0.1163956 (3473)  total: 31.6s    remaining: 13.5s
## 3600:    learn: 0.0714780    test: 0.1164251 best: 0.1163956 (3473)  total: 32.5s    remaining: 12.6s
## 3700:    learn: 0.0706155    test: 0.1163951 best: 0.1163863 (3617)  total: 33.4s    remaining: 11.7s
## 3800:    learn: 0.0698717    test: 0.1162976 best: 0.1162955 (3798)  total: 34.3s    remaining: 10.8s
## 3900:    learn: 0.0691744    test: 0.1162971 best: 0.1162836 (3888)  total: 35.2s    remaining: 9.93s
## 4000:    learn: 0.0685293    test: 0.1162951 best: 0.1162812 (3933)  total: 36.2s    remaining: 9.03s
## 4100:    learn: 0.0678657    test: 0.1163229 best: 0.1162812 (3933)  total: 37.1s    remaining: 8.13s
## Stopped by overfitting detector  (200 iterations wait)
## 
## bestTest = 0.1162811684
## bestIteration = 3933
## 
## Shrink model to first 3934 iterations.

To verify our result, we make a prediction and calculate the RMSE log error on the validation set

prediction = catboost.predict(cb.model, val_pool)
logrmse_val =  (mean( ((prediction-log(s$val$SalePrice))**2)))**0.5
cat("Log error (validation):", logrmse_val)
## Log error (validation): 0.1162812

Seems good! Now we train the model on all training data (training + validation). We set the second parameter in catboost.train to NULL as we don’t supply any validation data.

cb.modelf = catboost.train(alltrain_pool,  NULL, params = cb.paras)
## 0:   learn: 0.3972835    total: 4.03ms   remaining: 20.1s
## 100: learn: 0.2469311    total: 916ms    remaining: 44.4s
## 200: learn: 0.1844478    total: 1.83s    remaining: 43.7s
## 300: learn: 0.1559904    total: 2.75s    remaining: 42.9s
## 400: learn: 0.1407558    total: 3.64s    remaining: 41.8s
## 500: learn: 0.1309642    total: 4.54s    remaining: 40.8s
## 600: learn: 0.1239503    total: 5.45s    remaining: 39.9s
## 700: learn: 0.1187411    total: 6.36s    remaining: 39s
## 800: learn: 0.1141205    total: 7.25s    remaining: 38s
## 900: learn: 0.1103811    total: 8.16s    remaining: 37.1s
## 1000:    learn: 0.1071739    total: 9.07s    remaining: 36.2s
## 1100:    learn: 0.1043081    total: 9.97s    remaining: 35.3s
## 1200:    learn: 0.1021137    total: 10.9s    remaining: 34.4s
## 1300:    learn: 0.0999649    total: 11.8s    remaining: 33.5s
## 1400:    learn: 0.0980403    total: 12.7s    remaining: 32.6s
## 1500:    learn: 0.0963822    total: 13.6s    remaining: 31.7s
## 1600:    learn: 0.0948042    total: 14.5s    remaining: 30.8s
## 1700:    learn: 0.0931807    total: 15.4s    remaining: 29.9s
## 1800:    learn: 0.0916737    total: 16.3s    remaining: 29s
## 1900:    learn: 0.0904573    total: 17.3s    remaining: 28.1s
## 2000:    learn: 0.0891984    total: 18.1s    remaining: 27.2s
## 2100:    learn: 0.0879630    total: 19.1s    remaining: 26.3s
## 2200:    learn: 0.0868724    total: 19.9s    remaining: 25.4s
## 2300:    learn: 0.0858766    total: 20.8s    remaining: 24.5s
## 2400:    learn: 0.0849283    total: 21.8s    remaining: 23.5s
## 2500:    learn: 0.0839723    total: 22.7s    remaining: 22.6s
## 2600:    learn: 0.0830420    total: 23.6s    remaining: 21.7s
## 2700:    learn: 0.0821909    total: 24.5s    remaining: 20.8s
## 2800:    learn: 0.0813601    total: 25.4s    remaining: 19.9s
## 2900:    learn: 0.0805193    total: 26.3s    remaining: 19s
## 3000:    learn: 0.0796804    total: 27.2s    remaining: 18.1s
## 3100:    learn: 0.0789579    total: 28.1s    remaining: 17.2s
## 3200:    learn: 0.0782867    total: 29s  remaining: 16.3s
## 3300:    learn: 0.0776087    total: 29.9s    remaining: 15.4s
## 3400:    learn: 0.0768693    total: 30.8s    remaining: 14.5s
## 3500:    learn: 0.0762374    total: 31.7s    remaining: 13.6s
## 3600:    learn: 0.0755818    total: 32.6s    remaining: 12.7s
## 3700:    learn: 0.0749239    total: 33.5s    remaining: 11.8s
## 3800:    learn: 0.0742657    total: 34.5s    remaining: 10.9s
## 3900:    learn: 0.0736331    total: 35.4s    remaining: 9.96s
## 4000:    learn: 0.0730605    total: 36.3s    remaining: 9.06s
## 4100:    learn: 0.0724690    total: 37.2s    remaining: 8.16s
## 4200:    learn: 0.0717551    total: 38.1s    remaining: 7.25s
## 4300:    learn: 0.0711890    total: 39s  remaining: 6.34s
## 4400:    learn: 0.0706128    total: 39.9s    remaining: 5.44s
## 4500:    learn: 0.0700200    total: 40.9s    remaining: 4.53s
## 4600:    learn: 0.0694611    total: 41.8s    remaining: 3.62s
## 4700:    learn: 0.0689162    total: 42.7s    remaining: 2.71s
## 4800:    learn: 0.0683931    total: 43.6s    remaining: 1.81s
## 4900:    learn: 0.0678758    total: 44.5s    remaining: 899ms
## 4999:    learn: 0.0673540    total: 45.4s    remaining: 0us

Obtaining predictions on test set

Predict the SalePrice on the test data with the above model,

pred3.test = exp(catboost.predict(cb.model, test_pool))

As before plot the true SalePrice for the training data (blue), versus the test prediction (green) for SalePrice to make sure we have not made any errors,

hist(log(tt$train$SalePrice), main="Training (blue), test (green) predictions",
     xlab = "SalePrice (log)", freq=FALSE, col=scales::alpha('blue',0.5), ylim=c(0,1.5))
hist(log(pred3.test), freq=FALSE, add=T, col=scales::alpha('green',.5))

Save file and submit to Kaggle

As it looks fine we save it to a file that we submit to Kaggle.

subm = cbind(test_id, pred3.test)
colnames(subm) = c("Id", "SalePrice")
write.csv(subm, "submission.csv", row.names=FALSE)

# kaggle competitions submit -c "house-prices-advanced-regression-techniques" -f "submission.csv" -m "catboost! paraset al1"

With the catboost model above, I got an error of 0.122 on the Kaggle test set, which is pretty huge improvement from the xgboost result of 0.131, considering both models were implemented as-is except for a similar hyperparameter tuning process for both, based on cross-validation/parameter scanning (caret package).