Kaggle Auto Insurance

6 minute read

Porto Seguro Safe Driver Prediction

Introduction

In this study, I deal with a dataset given from the Brazilian insurance company, Porto Seguro. The data have many different auto insurance policies with many characteristics of each driver; the training data include whether or not they have filed a claim.
I will perform analysis to understand the data, and learn something as to which variables are correlated with a claim being filed. I will also build some models and see how they perform to predict this variable.

Exploring the data

Lets take a look at the structure of our data and understand what each variable means. Here, I only show a subset of the data to show what it looks like

##     id target ps_ind_01 ps_ind_02_cat ps_ind_03 ps_ind_04_cat
##  1:  7      0         2             2         5             1
##  2:  9      0         1             1         7             0
##  3: 13      0         5             4         9             1
##  4: 16      0         0             1         2             0
##  5: 17      0         0             2         0             1
##  6: 19      0         5             1         4             0
##  7: 20      0         2             1         3             1
##  8: 22      0         5             1         4             0
##  9: 26      0         5             1         3             1
## 10: 28      1         1             1         2             0

The “Target” column is what we are trying to predict; it displays a 1 or 0 if the driver filed a claim or not, respectively. The explanatory variables are labeled into various categories based on the type of variable or characteristic. But what they actually represent has been hidden for confidentiality reasons. The variable groups are “ind”, “reg”, “car”, and “calc.” . Binary variables are denoted by “bin,” categorical denoted by “cat,” and continuous or ordinal variables are not denoted by anything.

Graphical Objects

Here I create graphics for every variable to see how each is related to whether or not a claim was filed.

Bar plots for the binary variables

Each variable is placed on the x-axis, and the percentage of claims filed is placed on the y-axis. This way, we are able to show how each variable is correlated with a claim being filed.

Bar plots for the categorical variables

Some variables allow us to discover something, others seem completely uncorrelated. These variables we explore here are merely categorical variables and are not in any specific order.

Ordinal Integer Variables

These variables are ordinal. This way, we are able to see how more or less of a certain variable affects the probabilty of a claim being filed.

Continuous Variables

To properly graphically represent these variables, I found it best to create a model of just 1 variable at a time as a predictor. Then using this prediction we can plot the probabilities. I use a recursive partitioning algorithm for the plots.

Model Building

Until now, we’ve looked at all of our data and have seen which variables have a higher correlation with a claim being filed. Now with these data, let’s build several statistical models to discover more about our data. After discovering a good model, we can use this on testing data to predict the probability that a policy holder files a claim.

Recursive Partitioning Model

The recursive paritioning model uses a classification decision tree method. Our main objective is to give the probability that a certain policy holder will file a claim. The model works in a very interesting way, it takes each variable and learns how a change in that variable affects the probability. The model feeds each observation through this decision tree until reaching a conclusion. In the diagram below, we see a simple representation of how our recursive partitioning model works. It takes each variable and creates a split at a certain value. Based on this split, it either goes to another variable or assigns it a class (0 or 1). This diagram is a very simplified version of the entire model.

I randomized the data into testing and training groups to test how well our model performs. Now we can check the model’s accuracy. When creating a classification model, it is important to review a few key ratio measures of performance: accuracy, precision, and recall. Accuracy is calculated by taking the number of true positives and the number of true negatives, divided by the total number of predictions. This is basically the percentage of how often the model was accurate. We want this to be as close to 1 as possible. Precision is the number of true positives, divided by the sum of the true positives and false positives. This gives us an idea if our model overpredicted the positivity. We want this to be as close to 1 as possible. Recall is the number of true positives, divided by the sum of the true positives and false negatives. This gives us an idea if the model overpredicted the number of negative results. Like the 2 measures above, we also want this to be as close to 1 as possible.

##     accuracy precision       recall
## 1: 0.9641753 0.3333333 0.0001563477

The recursive partitioning model I created performed poorly on the data. This may be due to overfitting as I used every variable in the dataset.

Linear Model

A linear model is a much simpler type of model. In the graphic below we can show what happens when some key variables are altered. We can only show so many variables on 1 graph, so this graphic is an oversimplification. We assess the performance of this type of model by calculating the root mean squared error. We want this to be as low as possible. The value is displayed below and isn’t as low as we would have hoped.

## [1] 0.1854534

Random Forest

The random forest model is similar to the recursive partitioning model. A random forest algorithm takes a subset of our training data and builds a model. Then it replaces those data points, takes another sample subset and creates another model. It repeates this process over and over again until we have many different decision tree models. Then it combines all of these models created and creates an aggregate model based on the mean of all these models. Each of these individual models is called an ensemble model. Each ensemble model we build is created using the decision tree method, and we create many of them with random data each time, hence the name “Random Forest”. This whole process above is called bootstrap aggregation or bagging for short. I fine tuned a couple of parameters to improve the model to optimum performance. We use the same ratios as we did with the recursive partitioning model to determine how well this performed. Once again those are accuracy, precision, and recall.

##     accuracy  precision    recall
## 1: 0.9071425 0.04681463 0.0802279

The Random Forest Model did not give satisfying results.

Showing impactful variables

None of the models gave satisfying results. Instead, I will use the linear model created to show the impact that each variable has towards the target variable.

Conclusion

Variables ps_car_12, ps_car_13, and ps_ind_17_bin had the strongest positive impact on a policy holder filing a claim. Variables ps_car_07_cat, ps_ind_10_bin, and ps_ind_11_bin had the strongest negative impact on a policy holder filing a claim.

More time can be taken to create other model types or adjust parameters to discover a better model.