Statistical learning theory

Printer-friendly version

Given some amount of noisy data, how complex a model can I learn before I’m going to be failing to generalise to new data?
If I can answer this question a priori, I can fit a complex model with some messy hyperparameter and choose that hyperparameter without doing boring cross-validation.

Rademacher complexity, Gaussian complexity, Vapnik-Chernovenkis dimension.

Machine learning people always talk about this in terms of classification, which is what VC-dimension gives.

I don’t care about classification problems in general;
moreover, VC-dimensions seems to only be applicable analytically to limited classes.
Perhaps I can save time by going staight to Rademacher/Gaussian-style complexity and learn something about regression loss?

Modern results seem to avoid a lot of this by appealing to matrix concentration inequalities.

Percy Liang’s notes: CS229T/STAT231: Statistical Learning Theory (Winter 2014).

See also
function approximation for a different kind of approximation error, and
information criteria for one way to control it post hoc, or model selection for the statisticians’ approach to this problem in general.

Refs

BaMe02
Bartlett, P. L., & Mendelson, S. (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 3(Nov), 463–482.
BEHW89
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K.(1989) Learnability and the Vapnik-Chervonenkis Dimension. J. ACM, 36(4), 929–965. DOI.
BoBL04
Bousquet, O., Boucheron, S., & Lugosi, G. (2004) Introduction to Statistical Learning Theory. In O. Bousquet, U. von Luxburg, & G. Rätsch (Eds.), Advanced Lectures on Machine Learning (pp. 169–207). Springer Berlin Heidelberg
Dsou04
D’Souza, A. A.(2004) Towards Tractable Parameter-free Statistical Learning. . University of Southern California, Los Angeles, CA, USA
GnSa08
Gnecco, G., & Sanguineti, M. (2008) Approximation Error Bounds via Rademacher’s Complexity. Applied Mathematical Sciences, 2(4), 153–176.
KCFH05
Krishnapuram, B., Carin, L., Figueiredo, M. A. T., & Hartemink, A. J.(2005) Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27(6), 957–968. DOI.
Lian00
Liang, P. (n.d.) CS229T/STAT231: Statistical Learning Theory (Winter 2014).
Nata89
Natarajan, B. K.(1989) On learning sets and functions. Machine Learning, 4(1), 67–97. DOI.
ScSm03
Schölkopf, B., & Smola, A. J.(2003) A Short Introduction to Learning with Kernels. In S. Mendelson & A. J. Smola (Eds.), Advanced Lectures on Machine Learning (pp. 41–64). Springer Berlin Heidelberg
Vapn10
Vapnik, V. (2010) The Nature of Statistical Learning Theory. (Softcover reprint of hardcover 2nd ed. 2000.). Springer
VaCh71
Vapnik, V., & Chervonenkis, A. (1971) On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2), 264–280. DOI.
VaLC94
Vapnik, V., Levin, E., & Cun, Y. L.(1994) Measuring the VC-Dimension of a Learning Machine. Neural Computation, 6(5), 851–876. DOI.
LuSc08
von Luxburg, U., & Schölkopf, B. (2008) Statistical Learning Theory: Models, Concepts, and Results. arXiv:0810.4752 [Math, Stat].

See original: The Living Thing / Notebooks Statistical learning theory