## McNulty // w4d1

Winter 2015
02/02/2015

### Planned schedule and activities

9:00 am: Coffee and more coffee

09:15 am: Introduction to Supervised Learning

10:15 am: Supervised Learning challenges

12:00 pm: Lunch

1:30 pm: Challenges

5:00 pm: Leaves of grass

### Lecture notes

Randy's notebook

Math beyond Logistic Regression

More math beyond supervised learning (linear AND logistic regression)

More on K Nearest Neighbors

General textbook suggestions for different parts of the entire bootcamp, including probability, statistics, regression, machine learning, etc.

### Challenges

These are votes of U.S. House of Representatives Congressmen on 16 key issues in 1984.
Read the description of the fields and download the data: `house-votes-84.data`

We will try to see if we can predict the house members' party based on their votes.
We will also use some of the general machine learning tools we learned (a bit more efficiently this time).

#### Challenge 1

Load the data into a pandas dataframe. Replace `'y'`s with `1`s, `'n'`s with `0`s.

Now, almost every representative has a `?`. This represents a no vote (they were absent or some other similar reason). If we dropped all the rows that had a `?`, we would throw out most of our data. Instead, we will replace `?` with the best guess in the Bayesian sense: in the absence of any other information, we will say that the probability of the representative saying YES is the ratio of others that said YES over the whole votes.

So, convert each `?` to this probability (when yes=1 and no=0, this is the mean of the column)

#### Challenge 2

Split the data into a test and training set. But this time, use this function:

``from sklearn.cross_validation import train_test_split                                                                                                                                              ``

#### Challenge 3

Using scikit.learn's KNN algorithm, train a model that predicts the party (republican/democrat):

``````from sklearn.neighbors import KNeighborsClassifier
``````

Try it with a lot of different k values (number of neighbors), from 1 to 20, and on the test set calculate the accuracy (number of correct predictions / number of all predictions) for each k

You can use this to calculate accuracy:

``````from sklearn.metrics import accuracy_score
``````

Which k value gives the highest accuracy?

#### Challenge 4

Make a similar model but with `LogisticRegression` instead, calculate test accuracy.

#### Challenge 5

Make a bar graph of democrats and republicans. How many of each are there?

Make a very simple predictor that predicts 'democrat' for every incoming example.
(Just make a function that takes in an `X` --an array or matrix with input examples--, and returns an array of the same length as `X`, where each value is 'democrat'. For example, if `X` is three rows, your function should return `['democrat','democrat','democrat']`) Make a `y_predicted` vector using this and measure its accuracy.

Do the same with predicting 'republican' all the time and measure its accuracy.

#### Challenge 6

Plot the accuracies as a function of k. Since k only matters for KNN, your logistic regression accuracy, 'democrat' predictor accuracy and 'republican' predictor accuracy will stay the same over all k, so each of these three will be a horizontal line. But the KNN accuracy will change with k.

#### Challenge 7

Plot a learning curve for the logistic regression model. But instead of going through the painstaking steps of doing it yourself, use this function:

``````from sklearn.learning_curve import learning_curve
``````

This will give you the m, training errors and testing errors. All you need to do is plot them. You don't even need to give it separate training/test sets. It will do crossvalidation all by itself. Easy, isn't it? : )

Remember, since it does cross-validation, it doesn't have a single training error or test error per m value. Instead, it has one for each fold (separate partition) of the cross validation. A good idea is to take the mean of these errors from different folds. This gives you a meaningful single number per m. What I mean is that doing something like:

``````train_cv_err = np.mean(train_err,axis=1)
test_cv_err = np.mean(ts_err,axis=1)
``````

before plotting `m` vs `train_cv_err` and `m` vs `test_cv_err`, where `train_err` and `test_err` are the vectors returned by the learning curve function. The `np.mean(...., axis=1)` means take the mean along axis 1 (axis 1 is the columns axis-- for each row, you have a bunch of columns, each corresponding to a cross validation fold, you are averaging these columns for each row).

Draw the learning curve for KNN with the best k value as well.

#### Challenge 8

This is a preview of many other classification algorithms that we will go over. Scikit.learn has the same interface for all of these, so you can use them exactly the same way as you did `LogisticRegression` and `KNeighborsClassifier`. Use each of these to classify your data and print the test accuracy of each:

Gaussian Naive Bayes

``````from sklearn.naive_bayes import GaussianNB

``````

SVM (Support Vector Machine) Classifier

``````from sklearn.svm import SVC

``````

Decision Tree

``from sklearn.tree import DecisionTreeClassifier                                                                                                                                                    ``

Random Forest

``````from sklearn.ensemble import RandomForestClassifier
``````

#### Challenge 9

There is actually a way to do cross validation quickly to get your accuracy results for an algorithm, without separating training and test yourself:

``````from sklearn.cross_validation import cross_val_score
``````

Just like the `learning_curve` function, this takes a classifier object, `X` and `Y`. Returns accuracy (or whatever score you prefer by using the `_scoring_` keyword argument). Of course, it will return a score for each cross validation fold, so to get the generalized accuracy, you need to take the mean of what it returns.

Use this function to calculate the cross validation score of each of the classifiers you tried before.

#### Challenge 10

Instead of 'democrat' or 'republican', can you predict the vote of a representative based on their other votes?

Reload the data from scratch. Convert y-->1, n-->0.

Choose one vote. Build a classifier (logistic regression or KNN), that uses the other votes (do not use the party as a feature) to predict if the vote will be 1 or 0.

Convert each ? to the mode of the column (if a senator has not voted, make their vote 1 if most others voted 1, make it 0 if most others voted 0).

Calculate the cross validation accuracy of your classifier for predicting how each representative will vote on the issue.

#### Challenge 11

Back to your movie data! Choose one categoric feature to predict. I chose MPAA Rating, but genre, month, etc. are all decent choices. If you don't have any non-numeric features, you can make two bins out of a numeric one (like "Runtime>100 mins" and "Runtime<=100 mins")

Make a bar graph of how many of each movie there is in the data. For example, with Ratings, show how many G, PG, PG-13, R movies there are, etc. (basically a histogram of your labels).

Predict your outcome variable (labels) using KNN and logistic regression. Calculate their accuracies.

Make a baseline stupid predictor that always predicts the label that is present the most in the data. Calculate its accuracy on a test set.

How much better do KNN and logistic regression do versus the baseline?

What are the coefficients of logistic regression? Which features affect the outcome how?

#### Challenge 12

Now you are a classification master. The representative votes dataset only had 0s and 1s. Let's just swiftly tackle the breast cancer surgery data we talked about in class.

Get it from here.

• What is the average and standard deviation of the age of all of the patients?
• What is the average and standard deviation of the age of those patients that survived 5 or more years after surgery?
• What is the average and standard deviation of the age of those patients who survived fewer than 5 years after surgery?
• Plot a histogram of the ages side by side with a histogram of the number of axillary nodes.
• What is the earliest year of surgery in this dataset?
• What is the most recent year of surgery?
• Use logistic regression to predict survival after 5 years. How well does your model do?
• What are the coefficients of logistic regression? Which features affect the outcome how?
• Draw the learning curve for logistic regression in this case.