P4D2: Cleaning data for scikit-learn and machine learning
What is scikit-learn?
Let’s review a few items from last class.
scikit-learn
is an open source project, meaning that it is free to use and distribute, and anyone can easily obtain the source code to see what is going on behind the scenes. Thescikit-learn
project is constantly being developed and improved, and it has a very active user community. It contains a number of state-of-the-art machine learning algorithms, as well as comprehensive documentation about each algorithm.scikit-learn
is a very popular tool, and the most prominent Python library for machine learning. It is widely used in industry and academia, and a wealth of tutorials and code snippets are available online. Introduction to Machine Learning with Python by Andreas C. Müller, Sarah Guido
import sys
!{sys.executable} -m pip install numpy scipy matplotlib ipython scikit-learn pandas pillow
Then we can start our script as follows
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
I want to start making predictions.
Let’s…
- create our repo from the template and clone it.
- download the
.csv
data. It is small enough that we can version it with our code. -
Start our
eda.py
script and import our data.dat = pd.read_csv('SalesBook_2013.csv') # select variables we will use in class. # drop homes that are not single family or duplexes dat_ml = (dat .filter(['NBHD', 'PARCEL', 'LIVEAREA', 'FINBSMNT', 'BASEMENT', 'YRBUILT', 'CONDITION', 'QUALITY', 'TOTUNITS', 'STORIES', 'GARTYPE', 'NOCARS', 'NUMBDRM', 'NUMBATHS', 'ARCSTYLE', 'SPRICE', 'DEDUCT', 'NETPRICE', 'TASP', 'SMONTH', 'SYEAR', 'QUALIFIED', 'STATUS']) .rename(columns=str.lower) .query('totunits <= 2'))
- Write a short paragraph describing your data. What does a row represent? What measures do we have for each row?
- look at our variables/features and imagine which ones might be good predictors of the age of a house.
- find our target variable. What do we need to do with our target?
- create a few plots of potential predictors colored by built before 1980 status.
- fix our character or categorical variables?
- Which are nominal and which are ordinal?
- What is
pd.get_dummies()
default behavior for the columns that are created? Should we change that behavior? - What do we do with the ordinal variables (
condition
)?
- fix our columns with missing values?
- split our data -
train_test_split()
. - decide on an ML method would we like to use.
- fit our model.
- evaluate our model.
Creating our target variable
Handling ordinal categories
What ordinal variables do we have?
dat_ml.condition.value_counts()
replace_dictionary = {
"Excel":3,
"Poor":-2,
}
dat_ml.condition.replace(replace_dictionary)
There should be some thought to replacing ordinal values. Evaluating methods for handling missing ordinal data in structural equation modeling discusses the topic.
Handling nominal categories
What nominal variables do we have? Which have too many categories?
Fixing our missing values
Building our training and testing data
X_pred = dat_ml.drop(<column list>, axis = 1)
y_pred = dat_ml.filter([<target column>])
X_train, X_test, y_train, y_test = train_test_split(
X_pred, y_pred, test_size = .34, random_state = 76)