02 Mar 2018

Allstate Insurance Claim Value


The following is my solution to a Kaggle contest that ran in early 2017 regarding predicting the value of insurance claim payout as a function of many different given features. For privacy purposes (and also to protect internal information I assume), all of the variables are anonymized. There are a mix of both continuous and categorical variables, with the latter providing the vast majority of the features in the dataset. In order to treat the categorical variables, we must ‘one-hot-encode’ them, so for each combination of categorical variable and possible value for that variable, we create a new column with either a 0 if the entry doesn’t have this characteristic or 1 if it does.

21 Feb 2018

Data Cleaning


One of the more common tasks of a data scientists is that of data cleaning. Unfortunately, not all datasets come to us ready for analysis out of the box. Some require a little work to be ready, and some require a little more. I recently came across a dataset that was REALLY in need of some serious cleaning…

13 Feb 2018

Titanic Analysis


When first foraying into the world of Machine Learning (ML), there are a few very popular datasets that most students will use as a springboard to learn about the various ML algorithms on the market. The Titanic dataset is certainly one of these. The RMS Titanic was a famous ship that was designed to be ‘unsinkable’; however, as we know, it unfortunately was anything but. Upon collision with an iceberg, the Titanic sank. Of the ~2224 people on board, 68% ultimately lost their lives.