P4D2: Cleaning data for scikit-learn and machine learning

What is scikit-learn?

Let’s review a few items from last class.

scikit-learn is an open source project, meaning that it is free to use and distribute, and anyone can easily obtain the source code to see what is going on behind the scenes. The scikit-learn project is constantly being developed and improved, and it has a very active user community. It contains a number of state-of-the-art machine learning algorithms, as well as comprehensive documentation about each algorithm. scikit-learn is a very popular tool, and the most prominent Python library for machine learning. It is widely used in industry and academia, and a wealth of tutorials and code snippets are available online. Introduction to Machine Learning with Python by Andreas C. Müller, Sarah Guido

import sys
!{sys.executable} -m pip install numpy scipy matplotlib ipython scikit-learn pandas pillow

Then we can start our script as follows

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

I want to start making predictions.

Let’s…

create our repo from the template and clone it.
download the .csv data. It is small enough that we can version it with our code.

Start our eda.py script and import our data.

 dat = pd.read_csv('SalesBook_2013.csv')
 # select variables we will use in class.
 # drop homes that are not single family or duplexes
 dat_ml = (dat
     .filter(['NBHD', 'PARCEL', 'LIVEAREA', 'FINBSMNT',  
         'BASEMENT', 'YRBUILT', 'CONDITION', 'QUALITY',
         'TOTUNITS', 'STORIES', 'GARTYPE', 'NOCARS',
         'NUMBDRM', 'NUMBATHS', 'ARCSTYLE', 'SPRICE',
         'DEDUCT', 'NETPRICE', 'TASP', 'SMONTH',
         'SYEAR', 'QUALIFIED', 'STATUS'])
     .rename(columns=str.lower)
     .query('totunits <= 2'))

Write a short paragraph describing your data. What does a row represent? What measures do we have for each row?
look at our variables/features and imagine which ones might be good predictors of the age of a house.
find our target variable. What do we need to do with our target?
create a few plots of potential predictors colored by built before 1980 status.
fix our character or categorical variables?
- Which are nominal and which are ordinal?
- What is pd.get_dummies() default behavior for the columns that are created? Should we change that behavior?
- What do we do with the ordinal variables (condition)?
fix our columns with missing values?
split our data - train_test_split().
decide on an ML method would we like to use.
fit our model.
evaluate our model.

Creating our target variable

Handling ordinal categories

What ordinal variables do we have?

dat_ml.condition.value_counts()
replace_dictionary = {
    "Excel":3,
    "Poor":-2,
}
dat_ml.condition.replace(replace_dictionary)

There should be some thought to replacing ordinal values. Evaluating methods for handling missing ordinal data in structural equation modeling discusses the topic.

Handling nominal categories

What nominal variables do we have? Which have too many categories?

Fixing our missing values

.fillna()

Building our training and testing data

train_test_split()

X_pred = dat_ml.drop(<column list>, axis = 1)
y_pred = dat_ml.filter([<target column>])

X_train, X_test, y_train, y_test = train_test_split(
    X_pred, y_pred, test_size = .34, random_state = 76)