CS260: Machine Learning Theory -- Fall 2011

Monday/Wednesday, 2-3:50pm, Boelter 5422
Instructor: Jenn Wortman Vaughan

Recent Announcements

November 21: We will discuss the logistics of the project presentations in class this Monday.
November 9: The last homework assignment has been posted. Don't forget that progress reports are due November 16.
October 28: There will be no office hours on Friday, November 4. I am happy to chat and answer questions during the break or after class on Wednesday.

Course Overview

This course will provide a broad overview of the theoretical foundations underlying common machine learning algorithms. We will cover the following topics, though not necessarily in this order:

An introduction to the Probably Approximately Correct (PAC) model of learning
Uniform convergence bounds
VC theory
Online learning, mistake bounds, the Perceptron algorithm, and Winnow
Learning from expert advice, no-regret learning, and the randomized Weighted Majority algorithm
Ensemble methods and boosting
Support vector machines and kernels
Connections to game theory

Who should take this course? This course is primarily intended for students with some background and interest in machine learning who would like to know more about the theoretical foundations, and students with a background and interest in theoretical computer science who would like to know more about machine learning. Although this course has no official prerequisites, students who enroll must have a basic knowledge of probability (roughly at the level of the first half of 112) and must be comfortable reading and writing formal mathematical proofs.

Who should not take this course? This course is not intended for students who are primarily interested in applications of machine learning. There is no programming required, and we won't spend much time talking about implementation. If you would prefer a more applied view of machine learning, consider taking 276A.

This course does count for credit for the AI major and minor field requirements for computer science Ph.D. students.

Staff Contact Info

Instructor: Jenn Wortman Vaughan
Office Hours: Friday, 2-3pm, 4532H Boelter Hall
Contact: jenn at cs

Grader: Jacob Mathew
Contact: jacobgmathew at gmail

Breakdown of Grades

Grades will be based on the following components:

Homework Assignments (60%): There will be 4 homework assignments due in class. Assignments submitted up to 24 hours late will be penalized 25%. No assignments will be accepted for credit more than 24 hours after the deadline. All solutions must be typed and printed out. Solutions that are not typed will be penalized 25%, and unreadable answers will not be graded. Your solutions will be graded on both correctness and clarity. If you cannot solve a problem completely, you will get more partial credit if you identify the gaps in your argument than you will if you try to cover them up. For some assignments, it might be the case that only a subset of the problems will be graded in detail. You may or may not be told which problems will be graded in advance.
Final Project (40%): Projects consist of a written report and in-class presentation. All projects must be completed in groups of 2-3 students; there are no exceptions to this rule. 25% of your project grade will be based on your proposal and status report, 50% on your final write-up, and 25% on your final presentation, which will be scored by your peers in class.

If you believe that a mistake was made in the grading of a homework assignment, you may submit a request for a regrade. This request must be submitted in writing, and must include a clear explanation of the reason you believe you should have received more points. All regrade requests must be received within one week.

Schedule & Lecture Notes

This schedule is tentative and likely to shift as we get further into the material. Lecture notes will be posted here as the quarter progresses.

Part 1: Classification and the PAC Model

Monday, September 26 (lecture notes, slides, course syllabus, academic honesty policy)
Overview of course logistics. Introduction to machine learning and the goals of learning theory. The consistency model.
See also lecture 1 and lecture 2 from Rob Schapire's course at Princeton.
Wednesday, September 28 (lecture notes)
Generalizability, the PAC learning model, PAC learning thresholds.
See also Chapter 1 of the recommended Kearns and Vazirani text (on hold at the Engineering library) or lecture 3 from Rob Schapire's course. For a general probability review, try the probability review from David Blei's course at Princeton, or Appendix C.2 and C.3 of CLRS. (See Piazza for more suggestions, or to suggest your own references.)
Monday, October 3 (lecture notes)
Sample complexity bounds for finite classes, modified definition of the PAC model.
See also Chapter 1 of Kearns and Vazirani (which explores the idea of the "size" of a function in much more detail than we will in this class).
Wednesday, October 5 (lecture notes)
The agnostic setting, Hoeffding's inequality and uniform convergence.
See also lecture 8 of Rob Schapire's course, or, for more on Hoeffding's inequality, the introduction to this survey paper on concentration inequalities by Boucheron, Bousquet, and Lugosi.
Monday, October 10 (lecture notes)
Infinite function classes, VC dimension.
See also Andrew Ng's lecture notes on learning theory or Chapter 3 of Kearns and Vazirani.
Wednesday, October 12 (lecture notes)
VC dimension lower bound. Overview of the course projects and discussion of Problem Set 1.

Part 2: Online Learning in Adversarial Settings

Monday, October 17 (lecture notes)
Online learning, mistake bounds, the halving algorithm. More discussion of Problem Set 1.
See also Section 3 of Avrim Blum's online learning survey.
Wednesday, October 19 (lecture notes)
The Perceptron algorithm.
See also the really nice proof of the Perceptron mistake bound from Andrew Ng's class. He also has a nice description of margins (Sections 1-3 only... we'll come back to the rest).
Monday, October 24 (lecture notes)
Irrelevant features and the Winnow algorithm.
See also Section 3 of Avrim Blum's online learning survey (again).
Wednesday, October 26 (lecture notes)
Introduction to learning from expert advice, regret minimization, and randomized weighted majority.
Monday, October 31 (lecture notes)
Follow the Regularized Leader.
See also these notes from Adam Kalai's course at Georgia Tech. For a more advanced and more general description of this area than we'll cover in class, see Elad Hazan's draft book chapter on online convex optimization.
Wednesday, November 2 (lecture notes)
The randomized weighted majority regret bound, and a proof of the minimax theorem of game theory.
See also Adam Kalai's notes For the proof of the minimax theorem, see Chapter 4.4.2 of Algorithmic Game Theory.

Part 3: Some Practical Successes of Learning Theory

Monday, November 7 (lecture notes)
Weak vs. strong learning and the Adaboost algorithm.
See also Rob Schapire's overview of boosting (from 2003) or Chapter 4 of Kearns and Vazirani.
Wednesday, November 9 (lecture notes)
Generalization error of Adaboost, boosting the margin.
Same recommended references as last time.
Monday, November 14 (lecture notes)
Convex optimization, maximizing the margin directly, support vector machines.
See also Andrew Ng's lecture notes. For more on Lagrange duality and convex optimization, check out Convex Optimization by Stephen Boyd and UCLA's Lieven Vandenberghe, available online for download.
Wednesday, November 16 (lecture notes)
Properties of SVMs, SMO, soft margin, and kernel functions.
See also Andrew Ng's lecture notes. For full details on the SMO algorithm, see John Platt's book chapter.

Part 4: Wrapping Things Up

Monday, November 21 (lecture notes)
Finishing up kernels. Course wrap-up and take-away messages for theorists and practitioners.
Wednesday, November 2
NO LECTURE!! Work on your final projects!
Monday, November 28 (peer review form)
Project presentations.
Monday, November 30 (peer review form)
Project presentations.

Final projects will be due within a few days of the last day of class.

Additional Reading

In addition to the lecture notes that will be provided, students may find the following textbooks useful for exploring some of the topics that we cover in more depth. In particular, they provide excellent coverage of the PAC learning model and the material we will cover during the first segment of the course:

An Introduction to Computational Learning Theory by Michael Kearns and Umesh Vazirani
Computational Learning Theory by Martin Anthony and Norman Biggs

Several copies of the Kearns and Vazirani book will be on hold at the Science and Engineering Library. Please take advantage of these.

You might also find it helpful to look through lecture notes and slides from similar courses that have been offered at other universities such as Avrim Blum's course at CMU or Rob Schapire's course at Princeton. Links to specific notes from other courses that are especially relevant to particular lectures are included in the schedule above.

Piazza Message Board

As an experiment, we will try using a Piazza message board for class discussions this quarter. UCLA students can sign up by following the instructions here. Using Piazza is recommended, but not required.

Good uses of Piazza include:

Asking for clarifications about material that was covered in class or in the reading.
Sharing useful references that you've found with other students (as long as they do not contain homework solutions).
Looking for project partners or other students with whom to discuss assignments.
Answering questions posted by other students to solidify your own understanding of the material.
Discussing the course material and learning theory in general.

The course Academic Honesty Policy must be followed on the message boards at all times. Do not post or request homework solutions! And please be polite.

Homework Assignments

Homework assignments will be made available here:

Homework 1: Due Monday, October 10
Homework 2: Due Monday, October 24
Homework 3: Due Monday, November 7
Homework 4: Due Monday, November 21
(Corrections: On problem 1, you should use a definition of PAC learnability for a particular distribution D that is based on the modified definition of PAC learnability from lecture 3, i.e., that requires the number of samples to be polynomial in n. And there are 2^{2^n} functions in the class C.)

All homework solutions must be typed. You are strongly encouraged (though not required) to use LaTeX for your assignments. LaTeX makes it simple to typeset mathematical equations, and is extremely useful for grad students to know. Most of the academic papers you read were written with LaTeX, and probably most of the textbooks too.

If you are new to LaTeX, many good tutorials can be found online. A good reference for many common symbols is available here, or try Detexify, a cool machine learning tool that automatically finds the symbols you need. If you are a Mac user and would like to install LaTeX on your own machine, I highly recommend the bundled version of TeXShop and MacTeX available here. Windows users can try TeXnicCenter or MiKTeX. LaTeX is also installed on department-run linux machines.

Academic Honesty Policy

Collaboration on the homework assignments is encouraged! Discussing the problems with others can help you learn. Students are free to discuss the homework problems with anyone in the class under the following conditions:

Each student must write down his or her solutions independently, and must understand the solutions he or she writes down. Talking over solutions is fine, but reading or copying another student's answers is not acceptable!
Each student must write a list of all of his or her collaborators at the top of each assignment. This list should include anyone with whom the assignment was discussed.

These policies are described in more detail in the Academic Honesty Policy that must be signed by every student in the class. The Dean of Students also has a a guide to Academic Integrity.

Final Projects

A full description of the final project is given in these final project guidelines. The guidelines provide details on the types of projects you can pursue, the grading criteria for each project type, and all key dates and milestones. Be sure to read the guidelines carefully and make a note of the deadlines!

Below are some suggestions for topics that could be explored in more detail for the final project, and a couple of examples of relevant papers for each topic. All of these would make good literature synthesis topic, though you should be prepared to find additional relevant papers on your own! (Google Scholar is a good place to start.) This list might also inspire ideas for research projects. Of course you are welcome to choose a topic that is not on the list.

Active Learning:

Dasgupta. Two Faces of Active Learning. To appear in Theoretical Computer Science, 2011.
Balcan, Beygelzimer, and Langford. Agnostic Active Learning. Journal of Computer and System Sciences, 2009.
Dasgupta. Coarse Sample Complexity Bounds for Active Learning. NIPS, 2005.
Dasgupta, Kalai, and Monteleoni. Analysis of Perceptron-Based Active Learning. COLT, 2005.

The Multi-Armed Bandit Problem:

Abernethy, Hazan, and Rakhlin. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization. COLT, 2008.
Even-Dar, Mannor, and Mansour. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, 2006.
Auer, Cesa-Bianchi, Freund, and Schapire. The nonstochastic multiarmed bandit problem. SICOMP, 2002.
Auer, Cesa-Bianchi, and Fischer. Finite time analysis of the multiarmed bandit problem. MLJ, 2002.

Domain Adaptation and/or Multi-Source Learning:

Ben-David, Blitzer, Crammer, Kulesza, Pereira, and Vaughan. A Theory of Learning from Different Domains. MLJ, 2010.
Mansour, Mohri, and Rostamizadeh. Domain adaptation: Learning bounds and algorithms. COLT, 2009.
Crammer, Kearns, and Wortman. Learning from Multiple Sources. JMLR, 2008.

Privacy-Preserving Machine Learning:

Chaudhuri, Monteleoni, Sarwate. Differentially Private Empirical Risk Minimization. JMLR, 2010.
Kasiviswanathan, Lee, Nissim, Raskhodnikova, and Smith. What Can We Learn Privately? SICOMP, 2011.
Blum, Ligett, and Roth. A Learning Theory Approach to Non-Interactive Database Privacy. STOC, 2008.

Relationships Between Markets and Learning:

Chen and Vaughan. A New Understanding of Prediction Markets Via No-Regret Learning. ACM EC, 2010.
Storkey. Machine Learning Markets. AISTATS, 2011.
Kutty and Sami. A Prediction Market Approach to Learning with Sequential Advice. NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010.

Learning from a Crowd:

Dekel and Shamir. Good Learners for Evil Teachers. ICML, 2009.
Dekel and Shamir. Vox Populi: Collecting High-Quality Labels from a Crowd. COLT, 2009.

Learning Bounds for Reinforcement Learning:

Li, Littman, and Walsh. Knows What It Knows: A Framework for Self-Aware Learning. ICML, 2008.
Kearns and Singh. Near-Optimal Reinforcement Learning in Polynomial Time. MLJ, 2002.

Online Convex Optimization:

Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. ICML, 2003.
Hazan. The convex optimization approach to regret minimization. Draft book chapter, 2011.
Kalai and Vempala. Efficient Algorithms for On-line Optimization. Journal of Computer and System Sciences, 2005.

Machine Learning for Finance:

Hazan and Kale. On Stochastic and Worst-case Models for Investing. NIPS, 2009.
Blum and Kalai. Universal Portfolios With and Without Transaction Costs. MLJ, 1999.
Helmbold, Schapire, Singer, and Warmuth. On-line portfolio selection using multiplicative updates. Mathematical Finance, 1998.

Clustering:

Balcan, Blum, and Gupta. Approximate Clustering without the Approximation. SODA, 2009.
Balcan, Blum, and Vempala. A Discriminative Framework for Clustering via Similarity Functions. STOC, 2008.

PAC-Bayes Bounds:

McAllester. Some PAC-Bayes Theorems. MLJ, 1999.
McAllester. PAC-Bayesian Stochastic Model Selection. MLJ, 2003.

Ranking:

Agarwal and Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. JMLR, 2009.
Rudin and Schapire. Margin-Based Ranking and an Equivalence Between AdaBoost and RankBoost. JMLR, 2009.
Freund, Iyer, Schapire, and Singer. An efficient boosting algorithm for combining preferences. JMLR, 2003.

If you would like to suggest your own topic, try flipping through some recent publications from COLT ( 2011, 2010, 2009) to get a sense of the latest research in learning theory. You might also find a bit of theoretical work and some interesting applications that inspire you in the proceedings of NIPS ( 2011, 2010), ICML (2011, 2010), or UAI (2011, 2010).