Machine Learning - f14
This course will be similar to http://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_10-601_in_Fall_2013 - we will turn data into knowledge.
All programming will be in Python and R.
- Required: "Doing Data Science" - http://shop.oreilly.com/product/0636920028529.do (or better deals used)
- Optional: "Python for Data Science" - http://shop.oreilly.com/product/0636920023784.do
Programming: We will be using both R and Python. You may wish to purchase texts or read tutorials to help you with these languages.
For Aug 28:
- Read chapters 2 and 3 of DDS and be ready to discuss
- Install R and R Studio on your personal machine
- Install Anaconda on your personal machine
August 28: /Introduction to R
For September 2:
- Create a page for this class in your wiki account and add it to the student list
- Use R to investigate data in one of the other data files of NYT data. Create a graph that shows something interesting in your data. Post to the wiki the commands you used and the final graph generated.
Before class Tuesday, take the pima data and clean it more thoroughly and look at the predictive power of two other columns, using both linear and logical regressions to predict. Post your results in the wiki.
- Notes: /Multi-Variable Regression
See the R project for knn
Assignment: (Due Friday, October 3)
Use knn to predict diabetes using the same pima dataset used earlier. Discard rows that contain NA. Within this dataset, find the two most predictive factors (you can use your regression experience to figure out which two) and visualize the space of these factors. Submit as a zipped html created by R studio; use comments to explain what you're doing. How does this compare to a regression on the same two variables? How does using all dimensions improve this? Try a new different k values.
Do the same thing for this dataset: http://archive.ics.uci.edu/ml/datasets/Breast+Tissue - figure out two "good" factors by looking at the parameters individually. Compare using all parameters to using just two.
Homework: investigate clustering on the iris data - can we use clustering to predict the species?
Support Vector Machines
Create an RStudio project that includes the following and be able to show this off in class:
- A function to load, clean, and scale the pima data
- A svm model for the pima data that uses all of the columns to predict; print out the confusion matrix for this
- A svm model that uses two columns only - pick the best two you can. Include a visualization of the decision space
- A knn model that uses the same two columns
- A comparison of the knn and svn approach - evaluate with 50% of the data used to predict and 50% to evaluate the model
Finish this assignment and place a zipped html file in the wiki. Your code should demonstrate the following things:
- A second data set taken from http://archive.ics.uci.edu/ml/ - please include a link back to the data set.
- Adjustments of the parameters to svm in some situation - the kernel is particularly interesting. These adjustments should be visualized in a 2-D situation.
- Comparison of svm to knn and a linear model (don't work too hard to optimize the knn / linear model).
- Do one of the following:
- Experiment with either factors in the input by converting the factors to numerics and comparing vs breaking a factor into a set of 0 / 1 parameters.
- Experiment with more general classification problems in which the svm classifies data into more than 2 categories.
This is due midnight Wednesday.
to find the kernels http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
Homework: Due Friday, October 31.
Find a data set and use rpart to create a decision tree. Find the optimum level of pruning and then evaluate the tree by comparing it to a svm run on the same data. Don't spend too much time optimizing the svm - just use this as a benchmark to compare to. Hand in a zipped html file via the wiki.
Expect a neural net homework next week. A forest assignment will be due next week.
Homework: due November 7
Continue your last assignment on decision trees to generate a forest - explore 3 different ways to generate the forest and explain / evaluate each one.
Homework: due November 21
Investigate neural nets - you can use either neuralnet or nnet. Find a data set (you can use one that has already been used before or a new one) and attempt to train the net with it. You will need to deal with factors in an intelligent manner - don't just let them turn into integers by default. Use cross validation to evaluate your net. If you have problems training the net, explain the strategies that you attempted. Once you have a net that works create one using just two columns so that you can visualize the results of the net. Please find two columns that are significant predictors - the only way to do this is to build the net on two columns and evaluate it.
Digit recognition with Andy
- The nnet package: http://cran.r-project.org/web/packages/nnet/nnet.pdf
- We start on Recommender Systems: http://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
Fun with Google Analytics
Please look at CIS 397: Google Analytics in R
Homework (due Wednesday of finals week)
Write a movie recommender using the data here: http://grouplens.org/datasets/movielens/ - I will do some adapting on this data and present my work Thursday.
R Project: File:Movielens.zip
Use the Google Analytics data to analyze our web traffic.
Presentations. Each student needs to make a presentation of about 10 minutes. You can either use a live demo in R or the generated HTML.
You presentation should contain the following:
- A description of your data set - this has to be one that was not used as an example in class
- Any cleaning of your data
- Visualizations of the relationship between specific fields in your data and the result
- A demonstration of 2 or more learning algorithms, with visualization
- Assessment through cross-validation