Statistical learning theory

Printer-friendly version

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data?
If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.

Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension.

Machine learning people always talk about this in terms of classification, which is what VC-dimension gives.

I don’t care about classification problems in general;
moreover, VC-dimensions seems to only be applicable analytically to limited classes.
Perhaps I can save time by going staight to Rademacher/Gaussian-style complexity and learn something about regression loss?

Modern results seem to avoid a lot of this by appealing to matrix concentration inequalities.

Percy Liang’s notes: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also
function approximation for a different kind of approximation error, and
information criteria for one way to control it post hoc, or model selection for the statisticians’ approach to this problem in general.


Bartlett, P. L., & Mendelson, S. (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 3(Nov), 463–482.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K.(1989) Learnability and the Vapnik-Chervonenkis Dimension. J. ACM, 36(4), 929–965. DOI.
Bousquet, O., Boucheron, S., & Lugosi, G. (2004) Introduction to Statistical Learning Theory. In O. Bousquet, U. von Luxburg, & G. Rätsch (Eds.), Advanced Lectures on Machine Learning (pp. 169–207). Springer Berlin Heidelberg
D’Souza, A. A.(2004) Towards Tractable Parameter-free Statistical Learning. . University of Southern California, Los Angeles, CA, USA
Gnecco, G., & Sanguineti, M. (2008) Approximation Error Bounds via Rademacher’s Complexity. Applied Mathematical Sciences, 2(4), 153–176.
Krishnapuram, B., Carin, L., Figueiredo, M. A. T., & Hartemink, A. J.(2005) Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27(6), 957–968. DOI.
Liang, P. (n.d.) CS229T/STAT231: Statistical Learning Theory (Winter 2014).
Natarajan, B. K.(1989) On learning sets and functions. Machine Learning, 4(1), 67–97. DOI.
Schölkopf, B., & Smola, A. J.(2003) A Short Introduction to Learning with Kernels. In S. Mendelson & A. J. Smola (Eds.), Advanced Lectures on Machine Learning (pp. 41–64). Springer Berlin Heidelberg
Vapnik, V. (2010) The Nature of Statistical Learning Theory. (Softcover reprint of hardcover 2nd ed. 2000.). Springer
Vapnik, V., & Chervonenkis, A. (1971) On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2), 264–280. DOI.
Vapnik, V., Levin, E., & Cun, Y. L.(1994) Measuring the VC-Dimension of a Learning Machine. Neural Computation, 6(5), 851–876. DOI.
von Luxburg, U., & Schölkopf, B. (2008) Statistical Learning Theory: Models, Concepts, and Results. arXiv:0810.4752 [Math, Stat].

See original: The Living Thing / Notebooks Statistical learning theory