Seminar Course: The unreasonable effectiveness of overparameterized machine learning models (3 hp)
September 2021
Figure Description: Double descent when modeling dynamic data. Mean square error (MSE) in training and test datasets as a function of the ratio between the number of parameters and number of data points in training data. paper - code.
Description
Along the last decade, machine learning underwent many breakthroughs, usually led by empirical improvements that allowed very large models to perform well and attain state-of-the-art performance on tasks ranging from image classification to playing online games. These models, however, often have millions or, even, billions of parameters. Not only it has been shown that they do have enough capacity to almost perfectly fit the data they are trained in but, also, that as we increase the size these models usually present improved performance.
These observations seem at odds with one of the basic tenets of machine learning and statistics. When you create a model based on data, traditionally, if your model has too much capacity and is capable of fitting the datapoints too well, it might learn spurious random effects on top of the actual relations that are there. In this case, the model might generalize poorly to unseen data points, even though it performs well on the data it has seen during its construction. On the other hand, if it has too little capacity, it will fail to capture the basic behavior it needs to capture in order to properly describe the observed phenomena (or perform a given task). Hence, the optimal model should be somewhere in between these two extremes. This concept is known as the bias-variance tradeoff and is a central idea in predictive modeling, statistics and machine learning. The concept can even be thought of as a domain-adapted version of Occam's razor principle and the maxima that “everything should be made as simple as possible, but no simpler”.
In recent developments, these apparently conflicting observations are starting to be reconciled. The study of model generalization in the oveparametrized regime –– i.e., for which the model has enough capacity to perfectly (or almost perfectly) fit the training data –– is an active research topic that has generated interesting outcomes. It has been observed that, as we increase the model flexibility, it is possible to reach a point where the training error is zero. At this point, the model does not generalize well and will perform poorly on an unseen test dataset. However, if we continue to increase the model complexity beyond this point, the model can eventually start to generalize effectively again.
The purpose of this course is to walk the student through the latest theoretical and empirical developments through a series of key papers on this recent topic.
Outline
The seminar course will cover 5 papers which will be discussed in detail.
a. Empirical Studies (2 papers): The course will cover experimental papers that observe the double-descent phenomena and study the generalization properties of overparametrized models. The students will be asked to reproduce some of the results. Tentative list:
- [0] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” International Conference on Learning Representations (ICLR), 2017.
- [1] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proc Natl Acad Sci USA, vol. 116, no. 32, pp. 15849–15854, Aug. 2019, doi: 10.1073/pnas.1903070116.
b. Overparametrization in Linear Regression (2 papers): We start by considering perhaps the simplest setting where the second descent in risk has been observed: linear regression. The purpose of using a simplified setting is twofold: first, to make it amenable to theoretical analysis; and, second, to make it possible to isolate the role of overparametrization. We study papers that have different approaches and use different set of tools to deal with the same question. Tentative list:
- [2] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” arXiv:1903.08560, Nov. 2019, Accessed: Jul. 23, 2020. [Online]. Available: http://arxiv.org/abs/1903.08560.
- [3] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, “Benign overfitting in linear regression,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30063--30070, Apr. 2020, doi: 10.1073/pnas.1907378117.
c. Connections with other topics (2 papers): Finally, we study papers that while not directly associated with double descent help establishing connections between the phenomena and other ideas in machine learning. Tentative list:
- [4] A. Jacot, F. Gabriel, and C. Hongler, “Neural Tangent Kernel: Convergence and Generalization in Neural Networks,” Advances in Neural Information Processing Systems 31, 2018, Accessed: Jul. 27, 2020. [Online]. Available: http://arxiv.org/abs/1806.07572.
- [5] D. LeJeune, H. Javadi, and R. Baraniuk, “The implicit regularization of ordinary least squares ensembles,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020, vol. 108, pp. 3525–3535, [Online]. Available: http://proceedings.mlr.press/v108/lejeune20b.html.
Course schedule
There will be 5 sessions. Each session will consist of a quick presentation of the paper by some of the students (~15 min), a discussion (~1 hour), followed by the introduction (~15 minutes) of the next paper, outlining some key points of it.
Sem. | Date | Place on Campus** | Under discussion |
---|---|---|---|
S1 | 2021-09-09 @ 10:15-12:00 | Room 1245 ITC | Paper [1] |
S2 | 2021-09-23 @ 10:15-12:00 | Room 1245 ITC | Paper [2]* |
S3 | 2021-10-07 @ 10:15-12:00 | Room 1245 ITC | Paper [3] |
S4 | 2021-10-21 @ 10:15-12:00 | Room 2345 ITC | Paper [4] |
S5 | 2021-11-04 @ 10:15-12:00 | Room 2345 ITC | Paper [5] |
- * We will only cover sections 1, 2, and 3 of the paper [2] in S2.
- ** The participation on Zoom is also allowed.
Attendence
Since the interaction and discussion are part of the course, attendance is required for getting the credits. You can have one fault without justification. If for some reason you think you will need to lose more than one class please get in touch with Antonio Ribeiro to find a solution.
Format
The initial plan is to have a course in mixed format: sessions will be held on campus, but also allowing for simultaneous remote participation via Zoom.
Contigency plan
In case of changes in the guidelines for social distance, the contingency plan is to move the class to remote format and held the sessions in Zoom.
Prerequisites
Linear algebra, basic probability theory
Examination
- Paper presentation
- Homework
- Student participation
Registration
Registration for students that wish to take credits is now closed. Students and researchers interested in participating in the sessions are still welcome to do so. In this case, please contact Antonio Ribeiro.
Note: All students that sent an email before August 30 have been registered for the course. Information will be sent by email in the following days.