CS 5785 COMBINED-XLIST Applied Machine Learning (2020FA)

CS 5785 COMBINED-XLIST Applied Machine Learning (2020FA)

Learn and apply key concepts of modeling, analysis and validation from machine learning, data mining and signal processing to analyze and extract meaning from data. Implement algorithms and perform experiments on images, text, audio and mobile sensor measurements. Gain working knowledge of supervised and unsupervised techniques including classification, regression, clustering, feature selection, and dimensionality reduction.

First Lecture Information

The first lecture is going to be on Thursday 08/27. Please use the following Zoom links to connect to the lecture:

You will also find Zoom links to all the lectures in Canvas under the "Zoom" tab.

Instruction Format

Meeting times

The class will be held twice a week, on Tuesdays and Thursdays. Instruction will be fully remote, and the lectures will be given via Zoom (see above for the URLs).

In order to accommodate students in different time zones, most Cornell Tech classes are going to have two sessions per lecture day: one earlier in the day, and one later in the day. See below for the session times for our course. The material in each session will be the same; in other words, you would normally attend only Session 1 or Session 2 on each lecture day and choose the one that's most convenient for you.

Reverse classroom

Our lectures will have the format of a "reverse classroom". This means that the core material will be pre-recorded and available to view ahead of time. You should review the material ahead of the lecture. We will use the time in class to answer student questions, go over homework exercises, conduct tutorials on class pre-requisites, etc.

The focus of the initial sessions/lectures will be on tutorials (linear algebra, probability, programming); once we cover more of the material, we will be doing more Q&A and working through problems.

Information

Instructor: Volodymyr Kuleshov

Credits: 3

Course Frequency: Fall Term

Times:Tues/ Thurs - 1:00-1:50pm & 11:00-11:50 pm Eastern Time.

Important: Thursday evening classes that would typically occur on sprint dates September 24, October 22, and November 12 (starting at 2 p.m. and on) will hold make up class meetings the Sunday evening (EST) following the sprint.

Teaching Staff and Office Hours

Volodymyr Kuleshov (Instructor). Office Hours: Tue 10pm-11pm ET; Thu 1:50pm-2:45pm ET.

Jin Sun (Course Coordinator). http://www.cs.cornell.edu/~jinsun/

Andrew Bennett (Teaching Assistant). Office Hours: Wed and Fri 3pm-4pm ET. https://awbennett.net/

Kai Zhang (Teaching Assistant). Office Hours: Tue and Thu 10:30am-11:30am ET. https://kai-46.github.io/website/

Shachi Deshpande (Teaching Assistant). Office Hours: Sun and Wed 11pm-12am ET. https://www.cs.cornell.edu/~shachi/ 

The Zoom URL for office hours are available in Canvas.

Student Outcomes

  1. Be able to analyze and extract meaning from data by applying key concepts of modeling, analysis, and validation from Machine Learning, Data Mining, and Signal Processing. 
  2. Implement algorithms and perform experiments on images, text, audio, and other modalities. 
  3. Demonstrate an understanding of modern machine learning algorithms like tree-based models boosting, and deep neural networks.
  4. Gain working knowledge of supervised and unsupervised techniques and their relevant trade-offs in practical usage.

Preparation

Math. Students need to be comfortable with multivariable calculus, primarily integration and differentiation in multiple dimensions.  Course will also require a basic understanding of probability at the level of an introductory undergraduate course. Teaching staff will hold review sessions to cover background material.

Programming. Students should have a basic programming ability. Course will use Python and related data science libraries, including numpy, scipy, scikit-learn, and tensorflow or pytorch. Familiarity with these libraries is preferred, but we expect students to be able to learn parts of these libraries during the course. Teaching staff will hold review sessions to cover background material.

Prerequisites. CS 2800 or equivalent, Linear Algebra, Probability, and experience programming with Python, or permission of the instructor

Textbooks and Other Materials

  • Textbooks (Optional) 
    • T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition), Springer-Verlag, 2008. (available for free)
    • K. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
    • C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

Grading

Homework 1

Combination of theory and programming questions

15%

Homework 2

Combination of theory and programming questions

15%

Homework 3

Combination of theory and programming questions

15%

Homework 4

Combination of theory and programming questions

15%

Project Proposal

Brief description of the planned project, around 300 words.

5%

Project Milestone

Mid-semester progress report on course project, 3-5 pages in length.

10%

Final Project

Final report on the course project, 5 pages in length.

25%

Total Points

100%

Assignments

Written Assignments: Homeworks should be written up clearly and succinctly; you may lose points if your answers are unclear or unnecessarily complicated. You are encouraged to use LaTeX to writeup your homeworks, but this is not a requirement. Assignments will be submitted on Gradescope. You may work in teams of two: make sure to put both of your names on the submission and submit as a team in Gradescope.

Late Submissions: You have 6 late days which you can use at any time during the term without penalty (for both assignments and projects). The final project writeup cannot be submitted late because we need to grade it in a short amount of time. Once you run out of late days, you will incur in a 20% penalty for each extra late day you use. When submitting as a team, each one of you must use a late day. Each late submission should be clearly marked as “Late” on the first page. No submission will be accepted 3 days after the deadline.

Course Project

The course project will give the students a chance to explore machine learning in greater detail. Course projects will be done in groups of up to 3 students and can fall into one or more of the following categories:

  • Application of machine learning to a practical problem or a dataset.
  • Improvements to machine learning algorithms.
  • Theoretical analysis of any aspect of machine learning models.

Pick a topic that's meaningful to you and that excites you. For example, if you do PhD research in biology, you can do a project related a dataset that you work with. If you're in Urban Tech, you can work with a city dataset that you find interesting. You are encouraged to find something on your own, but we are also going to share topic ideas in Canvas and you should feel free to talk to the teaching team during office hours.

Proposal (Due 10/04 at 11:59pm ET)

Your proposal should give the title of the project, the project category, the names of your team members, their NetID, and a 300-500 word description of what you plan to do. It should contain the following information.

  • Motivation: What problem are you tackling? Is this an application or a theoretical result?
  • Method: What machine learning techniques are you planning to apply or improve upon and how?
  • Future work: What experiments are you planning to perform or what theorems do you want to prove?

The goal of the proposal is make sure you're on the right track. As long as you follow the above guidelines, you should do well.

Please submit the proposal via Gradescope and make sure to submit as a team.

Milestone (Due 11/08 at 11:59pm ET)

The milestone submission should describe what you've accomplished so far, and briefly say what else you plan to do. The format should be the same as of the final project, with a maximum length of 3 pages (excluding references). The goal is to make sure that you are on track to finish the final project.

  • Motivation: What problem are you tackling? Is this an application or a theoretical result?
  • Method: What machine learning techniques are you planning to apply or improve upon and how?
  • Preliminary experiments: Describe the experiments that you've run, the outcomes, and any error analysis that you've done. You should have tried at least one baseline.
  • Future work: What else do you plan to do?

The goal of the milestone is make sure you're on the right track. As long as you follow the above guidelines, you should do well.

Please submit the milestone via Gradescope and make sure to submit as a team.

Final Writeup (Due 12/14 at 11:59pm ET -- no late days!)

The final writeup should describe all the work you did for your course project and summarize the main results. You can think of it as a technical report that presents your findings to a general machine learning audience.

The style and format of the writeup should be similar to that of a research paper. The maximum length is 5 pages, excluding references. We provide a Latex template adapted from the NeurIPS style files for your reference (Go to Files->Project->AML Project Report NeurIPS Template.zip). 

There are no strict requirements on the structure of the final writeup, but one way to structuring it would be include the following sections, which are fairly standard for a research paper.

  • Abstract: Summarize the problem, novel contributions, and results in one paragraph.
  • Introduction: Provide motivation for the problem and expand upon the overview in the abstract.
  • Background: Briefly summarize the background knowledge needed to understand the work.
  • Method: Describe the methods that will be used or implemented in the paper.
  • Theoretical analysis: If you are doing a theory project, describe your theoretical results here.
  • Experimental analysis: Describe in detail your experiments.
  • Discussion and Prior Work: Discuss the key takeaways from your experiments. Put your results in the context of previous work
  • Conclusion. You may summarize the paper or talk about open problems and open directions.

Regardless of how the writeup is structured, please make sure to cover the following points.

  • Motivation: What problem are you tackling? Why is it interesting? What type of project will this be (application, method, theory)?
  • Method: What machine learning techniques are you planning to apply or improve upon and how? Make sure to describe them in detail and provide enough context for the reader to understand the methods at least at a high level. Provide any background that is necessary for that.
  • Experiments: Describe the experiments that you've run, the outcomes, and any error analysis that you've done. Make sure that the setup is described in enough detail for someone else to reproduce your results. Also, if you have an experimental project, make sure to provide a detailed experimental analysis. Things you should consider including are: train/test performance, learning curves, model samples, error analyses, ablation analyses, etc. Most projects should also include baselines.
  • Theory: If doing a theory project, state your results formally as theorems. Make sure that all the symbols are defined. Also, the best presentation of theoretical results tends to also explain the results in plain language and conveys the intuition behind them.
  • Context: Explain how you build upon previous work and how your results compare to what has been done previously.

Writeups will be evaluated for their presentation clarity, the respect of the above guidelines, the significance of the project (does it explore a toy dataset or a real problem) and the technical quality of the work (the level of depth in the experimental or theoretical analyses, does the approach make sense technically, are the algorithms implemented reasonable and studied in enough detail, etc.).

Please submit the writeup via Gradescope and make sure to submit as a team.

Format

Project Guidelines: Projects should be written up clearly and succinctly. You are encouraged to use LaTeX, but this is not a requirement. Projects will be submitted on Gradescope. You may work in teams of up to three: make sure to put all of your names on the submission and submit as a team in Gradescope.

Late Submissions: You have 6 late days which you can use at any time during the term without penalty (for both assignments and projects). The final project writeup cannot be submitted late because we need to grade it in a short amount of time. Once you run out of late days, you will incur in a 20% penalty for each extra late day you use. When submitting as a team, each one of you must use a late day. Each late submission should be clearly marked as “Late” on the first page. No submission will be accepted 3 days after the deadline.

Collaboration Policy and Honor Code

You are free to form study groups and discuss homeworks and projects. However, you must write up homeworks and code from scratch independently without referring to any notes from the joint session. You should not copy, refer to, or look at the solutions in preparing their answers from previous years’ homeworks. It is an honor code violation to intentionally refer to a previous year’s solutions, either official or written up by another student. Anybody violating the honor code will be referred to the Office of Judicial Affairs.

Contents

Date Recorded Lecture Course Contents Reading References
8/27/2020 1 Introduction to Applied Machine Learning. Supervised, unsupervised, reinforcement learning, The Elements of Statistical Learning (ESL) Chapter 1
9/1/2020 2
Supervised Learning: Introduction. Models, features, objectives, model training, ordinary least squares.
ESL 1, 2.1-2.3, 3.1-3.2
9/3/2020 3
Supervised Learning: Linear Regression. Optimization by gradient descent, normal equations, polynomial feature expansion, extensions of linear regression.
ESL 3.1-3.2, 5.1-5.2
9/8/2020 4 Supervised Learning: Why Does It Work? Data distribution, hypothesis classes, Bayes optimality, over/under fitting, regularization ESL 2.6-2.9, ESL 18
9/10/2020 5 Supervised Learning: A Probabilistic Perspective. Maximum likelihood learning, Bayesian ML, MAP Learning ESL 2.4-2.6, 8.1-8.3
9/15/2020 5b Supervised Learning: A Probabilistic Perspective. Example Algorithms.  
9/17/2020 6 Supervised Learning: Classification: KNN and Logistic Regression ESL 4
9/22/2020 7 Supervised Learning: Generative models. Gaussian Discriminant Analysis ESL 6.6
9/24/2020 8 Supervised Learning: Naive Bayes. Bag of words representations, generative vs. discriminative methods. ESL 6.6
9/29/2020 9 Supervised Learning: Support Vector Machines. Margins, max-margin classifiers, hinge loss, subgradient descent ESL 12.1-12.3
10/1/2020 10 Supervised Learning: Dual Formulation of SVMs. Lagrange Duality, dual formulation of SVM, SMO algorithm ESL 12.1-12.3
10/6/2020 11 Supervised Learning: Kernels. Mercer's theorem, RBF kernels. ESL 6.1-6.7, 12.3
10/8/2020 12 Supervised Learning: Decision Trees. Bagging, ensembling, CART. ESL 8.7, 9.1-9.2
10/15/2020 13 Supervised Learning: Boosting. Adaboost, gradient boosting. ESL 10.1-10.10
10/20/2020 14 Supervised Learning: Neural Networks. Perceptrons, multi-layer neural networks. ESL 11.1-11.5
10/22/2020 15 Supervised Learning: Deep Learning. Convolutional neural networks and applications.
10/27/2020 Guest Lecture: Advanced Deep Learning
10/29/2020 Guest Lecture: Advanced Deep Learning
11/3/2020 US Elections. No Lecture.
11/5/2020 16 Unsupervised Learning: Introduction ESL 14
11/10/2020 17 Unsupervised Learning: Density estimation. Probabilistic Models. K-Nearest Neighbors. ESL 14
11/12/2020 18 Unsupervised Learning: Clustering. K-means, expectation-maximization. ESL 14
11/17/2020 19 Unsupervised Learning: Dimensionality Reduction. PCA, ICA. ESL 14
11/19/2020 20 Applying Machine Learning: Evaluation. Dataset splits; cross-validation, performance measures ESL 7.10 
11/24/2020 21 Applying Machine Learning: Diagnosis. Model iteration process, bias/variance tradeoff, baselines, learning curves ESL 2.9, 7.1-7.5
12/1/2020 22 Applying Machine Learning: Diagnosis. Error analysis, data integrity, human-level performance  
12/3/2020 23 Understanding Machine Learning: Bias/variance tradeoff. Empirical risk minimization. Learning theory.  
12/8/2020 Review of the course. Taxonomy of ML algorithms. Research directions.

 

Course Summary:

Date Details Due