# Machine Learning - f14

## Contents

## Machine Learning

This course will be similar to http://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_10-601_in_Fall_2013 - we will turn data into knowledge.

All programming will be in Python and R.

Texts:

- Required: "Doing Data Science" - http://shop.oreilly.com/product/0636920028529.do (or better deals used)
- Optional: "Python for Data Science" - http://shop.oreilly.com/product/0636920023784.do

Programming: We will be using both R and Python. You may wish to purchase texts or read tutorials to help you with these languages.

## Week 1

For Aug 28:

- Read chapters 2 and 3 of DDS and be ready to discuss
- Install R and R Studio on your personal machine
- Install Anaconda on your personal machine

August 28: /Introduction to R

## Week 2

For September 2:

- Create a page for this class in your wiki account and add it to the student list
- Use R to investigate data in one of the other data files of NYT data. Create a graph that shows something interesting in your data. Post to the wiki the commands you used and the final graph generated.

September 4:

- Data: media:ml-regression.zip
- Notes: /Logistic Regression

Homework:

Before class Tuesday, take the pima data and clean it more thoroughly and look at the predictive power of two other columns, using both linear and logical regressions to predict. Post your results in the wiki.

## Week 3

- Notes: /Multi-Variable Regression

## Week 4

Dataset: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

## Week 5

See the R project for knn

Assignment: (Due Friday, October 3)

Use knn to predict diabetes using the same pima dataset used earlier. Discard rows that contain NA. Within this dataset, find the two most predictive factors (you can use your regression experience to figure out which two) and visualize the space of these factors. Submit as a zipped html created by R studio; use comments to explain what you're doing. How does this compare to a regression on the same two variables? How does using all dimensions improve this? Try a new different k values.

Do the same thing for this dataset: http://archive.ics.uci.edu/ml/datasets/Breast+Tissue - figure out two "good" factors by looking at the parameters individually. Compare using all parameters to using just two.

## Week 6

Clustering Algorithms.

See:

- http://www.statmethods.net/advstats/cluster.html
- http://en.wikipedia.org/wiki/K-means_clustering
- http://en.wikipedia.org/wiki/Hierarchical_clustering

http://shiny.rstudio.com/gallery/kmeans-example.html

## Week 7

Homework: investigate clustering on the iris data - can we use clustering to predict the species?

Support Vector Machines

## Week 8

Assignment:

Create an RStudio project that includes the following and be able to show this off in class:

- A function to load, clean, and scale the pima data
- A svm model for the pima data that uses all of the columns to predict; print out the confusion matrix for this
- A svm model that uses two columns only - pick the best two you can. Include a visualization of the decision space
- A knn model that uses the same two columns
- A comparison of the knn and svn approach - evaluate with 50% of the data used to predict and 50% to evaluate the model

Finish this assignment and place a zipped html file in the wiki. Your code should demonstrate the following things:

- A second data set taken from http://archive.ics.uci.edu/ml/ - please include a link back to the data set.
- Adjustments of the parameters to svm in some situation - the kernel is particularly interesting. These adjustments should be visualized in a 2-D situation.
- Comparison of svm to knn and a linear model (don't work too hard to optimize the knn / linear model).
- Do one of the following:
- Experiment with either factors in the input by converting the factors to numerics and comparing vs breaking a factor into a set of 0 / 1 parameters.
- Experiment with more general classification problems in which the svm classifies data into more than 2 categories.

This is due midnight Wednesday.

http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

to find the kernels
http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf

## Week 9

Homework: Due Friday, October 31.

Find a data set and use rpart to create a decision tree. Find the optimum level of pruning and then evaluate the tree by comparing it to a svm run on the same data. Don't spend too much time optimizing the svm - just use this as a benchmark to compare to. Hand in a zipped html file via the wiki.

## Week 10

Neural nets: http://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf

Example: http://www.r-bloggers.com/using-neural-networks-for-credit-scoring-a-simple-example/

Theory: http://mi.eng.cam.ac.uk/~mjfg/local/I10/i10_hand4.pdf

Visualizing: http://beckmw.wordpress.com/2013/11/14/visualizing-neural-networks-in-r-update/

Expect a neural net homework next week. A forest assignment will be due next week.

Homework: due November 7

Continue your last assignment on decision trees to generate a forest - explore 3 different ways to generate the forest and explain / evaluate each one.

## Week 11

Homework: due November 21

Investigate neural nets - you can use either neuralnet or nnet. Find a data set (you can use one that has already been used before or a new one) and attempt to train the net with it. You will need to deal with factors in an intelligent manner - don't just let them turn into integers by default. Use cross validation to evaluate your net. If you have problems training the net, explain the strategies that you attempted. Once you have a net that works create one using just two columns so that you can visualize the results of the net. Please find two columns that are significant predictors - the only way to do this is to build the net on two columns and evaluate it.

## Week 12

Digit recognition with Andy

## Week 13

- The nnet package: http://cran.r-project.org/web/packages/nnet/nnet.pdf

- We start on Recommender Systems: http://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf

## Fun with Google Analytics

Please look at CIS 397: Google Analytics in R

## Week 14

Homework (due Wednesday of finals week)

Either:

Write a movie recommender using the data here: http://grouplens.org/datasets/movielens/ - I will do some adapting on this data and present my work Thursday.

R Project: File:Movielens.zip

Use the Google Analytics data to analyze our web traffic.

## Week 15

Presentations. Each student needs to make a presentation of about 10 minutes. You can either use a live demo in R or the generated HTML.

You presentation should contain the following:

- A description of your data set - this has to be one that was not used as an example in class
- Any cleaning of your data
- Visualizations of the relationship between specific fields in your data and the result
- A demonstration of 2 or more learning algorithms, with visualization
- Assessment through cross-validation

Schedule:

Tuesday:

- Matt
- Quinn
- Lucas
- Jaden
- Tsering
- Luke

Thursday:

- Charles
- Bendix
- Lindsey
- Rob
- Blake