Penalised regression

Printer-friendly version

On regression estimation with penalties on the model.
Practically this means choosing appropriate smoothing to do good model selection, and possibly using some clever optimisation method.
Related to compressed sensing but here we consider sampling complexity,
the effect of measurement noise, and more general penalties than just \(\ell_1\).

See also matrix factorisations,
multiple testing,
concentration inequalities,
sparse flavoured icecream.

To discuss:

LARS, LASSO, Group LASSO, de-biassed LASSO, Elastic net, etc.

In nonparametric statistics we might estimate simultaneously what look like
many, many parameters, which we constrain in some clever fashion,
which usually boils down to something we can interpret as a “smoothing”
parameters, controlling how many parameters we still have to model
from a subset of the original.

The “regularisation” nomenclature claims descent from Tikhonov, (eg TiGl65 etc) who wanted to solve ill-conditioned integral and differential equations, so it’s slightly more general.
“Smoothing” seems to be common in the
spline and
kernel estimate communities of
Wahba (Wahb90) and Silverman (Silv82) et al,
who usually actually want to smooth curves.

Penalization” has a geneology unknown to me, but is probably the least abstruse for common usage.

These are, AFAICT, more or less the same thing.
“smoothing” is more common in my communities which is fine,
but we have to remember that “smoothing” an estimator might not always infer smooth dynamics in the estimand;
it could be something else being smoothed, such as variance in the estimate of parameters of a rough function.

In every case, you wish to solve an ill-conditioned inverse problem, so you tame it by adding a penalty to solutions you feel one should be reluctant to accept.

TODO: make comprehensible

TODO: examples

TODO: discuss connection with model selection

TODO: discuss connection with compressed sensing.

The real classic approach here is spline smoothing of functional data.
More recent approaches are things like sparse regression.

Debiassed LASSO

See GBRD14 and Geer14c.


I’m not going to mention LASSO in (generalised) linear regression,
since everything does that these days (Oh alright,
Jerome Friedman’s glmnet for R is the fastest,
and has a MATLAB version.

But SPAMS (C++, MATLAB, R, python) by Mairal himself, looks interesting.
It’s an optimisation library for many various in sparse problems.

Sparse Filtering in Theano.


Abramovich, F., Benjamini, Y., Donoho, D. L., & Johnstone, I. M.(2006) Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 34(2), 584–653. DOI.
Azizyan, M., Krishnamurthy, A., & Singh, A. (2015) Extreme Compressive Sampling for Covariance Estimation. arXiv:1506.00898 [Cs, Math, Stat].
Bach, F. (n.d.) Model-Consistent Sparse Estimation through the Bootstrap.
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012) Optimization with Sparsity-Inducing Penalties. Found. Trends Mach. Learn., 4(1), 1–106. DOI.
Bahmani, S., & Romberg, J. (2014) Lifting for Blind Deconvolution in Random Mask Imaging: Identifiability and Convex Relaxation. arXiv:1501.00046 [Cs, Math, Stat].
Banerjee, A., Chen, S., Fazayeli, F., & Sivakumar, V. (2014) Estimation with Norm Regularization. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 1556–1564). Curran Associates, Inc.
Barbier, J. (2015) Statistical physics and approximate message-passing algorithms for sparse linear estimation problems in signal processing and coding theory. arXiv:1511.01650 [Cs, Math].
Baron, D., Sarvotham, S., & Baraniuk, R. G.(2010) Bayesian compressive sensing via belief propagation. Signal Processing, IEEE Transactions on, 58(1), 269–280. DOI.
Battiti, R. (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI.
Bayati, M., & Montanari, A. (2012) The LASSO Risk for Gaussian Matrices. IEEE Transactions on Information Theory, 58(4), 1997–2017. DOI.
Borgs, C., Chayes, J. T., Cohn, H., & Zhao, Y. (2014) An $L^p$ theory of sparse graph convergence I: limits, sparse random graph models, and power law distributions. arXiv:1401.2906 [Math].
Brunton, S. L., Proctor, J. L., & Kutz, J. N.(2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15), 3932–3937. DOI.
Bühlmann, P., & Geer, S. van de. (2011) Additive models and many smooth univariate functions. In Statistics for High-Dimensional Data (pp. 77–97). Springer Berlin Heidelberg
Bühlmann, P., & van de Geer, S. (2015) High-dimensional inference in misspecified linear models. arXiv:1503.06426 [Stat], 9(1), 1449–1473. DOI.
Candès, E. J., & Fernandez-Granda, C. (2013) Super-Resolution from Noisy Data. Journal of Fourier Analysis and Applications, 19(6), 1229–1254. DOI.
Candès, E. J., & Plan, Y. (2010) Matrix Completion With Noise. Proceedings of the IEEE, 98(6), 925–936. DOI.
Candès, E. J., Romberg, J. K., & Tao, T. (2006) Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8), 1207–1223. DOI.
Carmi, A. Y.(2013) Compressive system identification: Sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI.
Carmi, A. Y.(2014) Compressive System Identification. In A. Y. Carmi, L. Mihaylova, & S. J. Godsill (Eds.), Compressed Sensing & Sparse Filtering (pp. 281–324). Springer Berlin Heidelberg
Cevher, V., Duarte, M. F., Hegde, C., & Baraniuk, R. (2009) Sparse Signal Recovery Using Markov Random Fields. In Advances in Neural Information Processing Systems (pp. 257–264). Curran Associates, Inc.
Chen, M., Silva, J., Paisley, J., Wang, C., Dunson, D., & Carin, L. (2010) Compressive Sensing on Manifolds Using a Nonparametric Mixture of Factor Analyzers: Algorithm and Performance Bounds. IEEE Transactions on Signal Processing, 58(12), 6140–6155. DOI.
Chen, Y.-C., & Wang, Y.-X. (n.d.) Discussion on “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression”.
Daneshmand, H., Gomez-Rodriguez, M., Song, L., & Schölkopf, B. (2014) Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm. In ICML.
Diaconis, P., & Freedman, D. (1984) Asymptotics of Graphical Projection Pursuit. The Annals of Statistics, 12(3), 793–815.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32(2), 407–499. DOI.
Ewald, K., & Schneider, U. (2015) Confidence Sets Based on the Lasso Estimator. arXiv:1507.05315 [Math, Stat].
Flynn, C. J., Hurvich, C. M., & Simonoff, J. S.(2013) Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models. arXiv:1302.2068 [Stat].
Ghazi, B., Hassanieh, H., Indyk, P., Katabi, D., Price, E., & Shi, L. (2013) Sample-optimal average-case sparse Fourier Transform in two dimensions. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton) (pp. 1258–1265). DOI.
Giryes, R., Sapiro, G., & Bronstein, A. M.(2014) On the Stability of Deep Networks. arXiv:1412.5896 [Cs, Math, Stat].
Gui, J., & Li, H. (2005) Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics, 21(13), 3001–3008. DOI.
Hallac, D., Leskovec, J., & Boyd, S. (2015) Network Lasso: Clustering and Optimization in Large Graphs. arXiv:1507.00280 [Cs, Math, Stat]. DOI.
Hansen, N. R., Reynaud-Bouret, P., & Rivoirard, V. (2015) Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21(1), 83–143. DOI.
Hastie, T. J., Tibshirani, Rob, & Wainwright, M. J.(2015) Statistical Learning with Sparsity: The Lasso and Generalizations. . Boca Raton: Chapman and Hall/CRC
Hawe, S., Kleinsteuber, M., & Diepold, K. (2013) Analysis operator learning and its application to image reconstruction. IEEE Transactions on Image Processing, 22(6), 2138–2150. DOI.
He, D., Rish, I., & Parida, L. (2014) Transductive HSIC Lasso. In M. Zaki, Z. Obradovic, P. N. Tan, A. Banerjee, C. Kamath, & S. Parthasarathy (Eds.), Proceedings of the 2014 SIAM International Conference on Data Mining (pp. 154–162). Philadelphia, PA: Society for Industrial and Applied Mathematics
Hebiri, M., & Geer, S. van de. (2011) The Smooth-Lasso and other ℓ1+ℓ2-penalized methods. Electronic Journal of Statistics, 5, 1184–1226. DOI.
Hegde, C., Indyk, P., & Schmidt, L. (2015) A nearly-linear time framework for graph-structured sparsity. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 928–937).
Hesterberg, T., Choi, N. H., Meier, L., & Fraley, C. (2008) Least angle and ℓ1 penalized regression: A review. Statistics Surveys, 2, 61–93. DOI.
Hoerl, A. E., & Kennard, R. W.(1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55–67. DOI.
Hormati, A., Roy, O., Lu, Y. M., & Vetterli, M. (2010) Distributed Sampling of Signals Linked by Sparse Filtering: Theory and Applications. IEEE Transactions on Signal Processing, 58(3), 1095–1109. DOI.
Hu, T., Pehlevan, C., & Chklovskii, D. B.(2015) A Hebbian/Anti-Hebbian Network for Online Sparse Dictionary Learning Derived from Symmetric Matrix Factorization. arXiv:1503.00690 [Cs, Q-Bio, Stat].
Javanmard, A., & Montanari, A. (2014) Confidence Intervals and Hypothesis Testing for High-dimensional Regression. Journal of Machine Learning Research, 15(1), 2869–2909.
Kabán, A. (2014) New Bounds on Compressive Linear Least Squares Regression. (pp. 448–456). Presented at the Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
Langford, J., Li, L., & Zhang, T. (2009) Sparse Online Learning via Truncated Gradient. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E.(2013) Exact post-selection inference, with application to the lasso. arXiv:1311.6238 [Math, Stat].
Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014) A significance test for the lasso. The Annals of Statistics, 42(2), 413–468. DOI.
Meier, L., van de Geer, S., & Bühlmann, P. (2008) The group lasso for logistic regression. Group, 70(Part 1), 53–71.
Montanari, A. (2012) Graphical models concepts in compressed sensing. Compressed Sensing: Theory and Applications, 394–438.
Müller, P., & van de Geer, S. (2015) Censored linear model in high dimensions: Penalised linear regression on high-dimensional data with left-censored response variable. TEST. DOI.
Nam, S., & Gribonval, R. (2012) Physics-driven structured cosparse modeling for source localization. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5397–5400). DOI.
Needell, D., & Tropp, J. A.(2008) CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. arXiv:0803.2392 [Cs, Math].
Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., & Ng, A. Y.(2011) Sparse Filtering. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 24 (pp. 1125–1133). Curran Associates, Inc.
Nickl, R., & van de Geer, S. (2013) Confidence sets in sparse regression. The Annals of Statistics, 41(6), 2852–2876. DOI.
Peleg, T., Eldar, Y. C., & Elad, M. (2010) Exploiting Statistical Dependencies in Sparse Representations for Signal Recovery. IEEE Transactions on Signal Processing, 60(5), 2286–2303. DOI.
Peng, Z., Gurram, P., Kwon, H., & Yin, W. (2015) Optimal Sparse Kernel Learning for Hyperspectral Anomaly Detection. arXiv:1506.02585 [Cs].
Pouget-Abadie, J., & Horel, T. (2015) Inferring Graphs from Cascades: A Sparse Recovery Framework. In Proceedings of The 32nd International Conference on Machine Learning.
Rahimi, A., & Recht, B. (2009) Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320). Curran Associates, Inc.
Ravishankar, S., & Bresler, Y. (2015a) Efficient Blind Compressed Sensing Using Sparsifying Transforms with Convergence Guarantees and Application to MRI. arXiv:1501.02923 [Cs, Stat].
Ravishankar, S., & Bresler, Y. (2015b) Sparsifying Transform Learning With Efficient Optimal Updates and Convergence Guarantees. IEEE Transactions on Signal Processing, 63(9), 2389–2404. DOI.
Rish, I., & Grabarnik, G. (2014) Sparse Signal Recovery with Exponential-Family Noise. In A. Y. Carmi, L. Mihaylova, & S. J. Godsill (Eds.), Compressed Sensing & Sparse Filtering (pp. 77–93). Springer Berlin Heidelberg
Rish, I., & Grabarnik, G. Y.(2015) Sparse modeling: theory, algorithms, and applications. . Boca Raton, FL: CRC Press, Taylor & Francis Group
Schelldorfer, J., Bühlmann, P., & De Geer, S. V.(2011) Estimation for High-Dimensional Linear Mixed-Effects Models Using ℓ1-Penalization. Scandinavian Journal of Statistics, 38(2), 197–214. DOI.
Silverman, B. W.(1982) On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method. The Annals of Statistics, 10(3), 795–810. DOI.
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5).
Smith, V., Forte, S., Jordan, M. I., & Jaggi, M. (2015) L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework. arXiv:1512.04011 [Cs].
Stine, R. A.(2004) Discussion of “Least angle regression” by Efron et al. The Annals of Statistics, 32(2), 407–499.
Su, W., Bogdan, M., & Candès, E. J.(2015) False Discoveries Occur Early on the Lasso Path. arXiv:1511.01957 [Cs, Math, Stat].
Taddy, M. (2013) One-step estimator paths for concave regularization. arXiv:1308.5623 [Stat].
Thrampoulidis, C., Abbasi, E., & Hassibi, B. (2015) The LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements. arXiv:1506.02181 [Cs, Math, Stat].
Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
Tibshirani, R. J.(2014) A General Framework for Fast Stagewise Algorithms. arXiv:1408.5801 [Stat].
Tikhonov, A. N., & Glasko, V. B.(1965) Use of the regularization method in non-linear problems. USSR Computational Mathematics and Mathematical Physics, 5(3), 93–107. DOI.
Tropp, J. A., & Wright, S. J.(2010) Computational Methods for Sparse Solution of Linear Inverse Problems. Proceedings of the IEEE, 98(6), 948–958. DOI.
Uematsu, Y. (2015) Penalized Likelihood Estimation in High-Dimensional Time Series Models and its Application. arXiv:1504.06706 [Math, Stat].
van de Geer, S. (2007) The deterministic Lasso.
van de Geer, S. (2014a) Statistical Theory for High-Dimensional Models. arXiv:1409.8557 [Math, Stat].
van de Geer, S. (2014b) Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86. DOI.
van de Geer, S. (2014c) Worst possible sub-directions in high-dimensional models. In arXiv:1403.7023 [math, stat] (Vol. 131).
van de Geer, S. A., Bühlmann, P., & Zhou, S. (2011) The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electronic Journal of Statistics, 5, 688–749. DOI.
van de Geer, S., Bühlmann, P., Ritov, Y. ’acov, & Dezeure, R. (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3), 1166–1202. DOI.
Veitch, V., & Roy, D. M.(2015) The Class of Random Graphs Arising from Exchangeable Random Measures. arXiv:1512.03099 [Cs, Math, Stat].
Wahba, G. (1990) Spline Models for Observational Data. . SIAM
Wang, D., Wu, P., Zhao, P., & Hoi, S. C. H.(2015) A Framework of Sparse Online Learning and Its Applications. arXiv:1507.07146 [Cs].
Wu, T. T., & Lange, K. (2008) Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1), 224–244. DOI.
Yaghoobi, M., Nam, S., Gribonval, R., & Davies, M. E.(2012) Noise aware analysis operator learning for approximately cosparse signals. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5409–5412). DOI.
Yun, S., & Toh, K.-C. (2009) A coordinate gradient descent method for ℓ 1-regularized convex minimization. Computational Optimization and Applications, 48(2), 273–307. DOI.
Zhang, C.-H., & Zhang, S. S.(2014) Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 217–242. DOI.
Zhang, L., Yang, T., Jin, R., & Zhou, Z.-H. (2015) Sparse Learning for Large-scale and High-dimensional Data: A Randomized Convex-concave Optimization Approach. arXiv:1511.03766 [Cs].
Zhou, T., Tao, D., & Wu, X. (2011) Manifold elastic net: a unified framework for sparse dimension reduction. Data Mining and Knowledge Discovery, 22(3), 340–371.
Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101(476), 1418–1429. DOI.
Zou, H., & Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. DOI.
Zou, H., Hastie, T., & Tibshirani, R. (2007) On the “degrees of freedom” of the lasso. The Annals of Statistics, 35(5), 2173–2192. DOI.

See original: The Living Thing / Notebooks Penalised regression