Skip to main content Accessibility help
Hostname: page-component-55597f9d44-ssw5r Total loading time: 0.396 Render date: 2022-08-14T07:05:23.314Z Has data issue: true Feature Flags: { "shouldUseShareProductTool": true, "shouldUseHypothesis": true, "isUnsiloEnabled": true, "useRatesEcommerce": false, "useNewApi": true } hasContentIssue true


Published online by Cambridge University Press:  31 March 2022

Stephen J. Wright
University of Wisconsin, Madison
Benjamin Recht
University of California, Berkeley
Get access


Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'
Publisher: Cambridge University Press
Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Allen-Zhu, Z. 2017. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18(1), 81948244.Google Scholar
Attouch, H., Chbani, Z., Peypouquet, J., and Redont, P. 2018. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168(1–2), 123175.CrossRefGoogle Scholar
Beck, A., and Teboulle, M. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167175.CrossRefGoogle Scholar
Beck, A., and Teboulle, M. 2009. A Fast iterative shrinkage-threshold algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183202.CrossRefGoogle Scholar
Beck, A., and Tetruashvili, L. 2013. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23(4), 20372060.CrossRefGoogle Scholar
Bertsekas, D. P. 1976. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Transactions on Automatic Control, AC-21, 174184.CrossRefGoogle Scholar
Bertsekas, D. P. 1982. Constrained Optimization and Lagrange Multiplier Methods. New York: Academic Press.Google Scholar
Bertsekas, D. P. 1997. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4), 913926.CrossRefGoogle Scholar
Bertsekas, D. P. 1999. Nonlinear Programming. Second edition. Belmont, MA: Athena Scientific.Google Scholar
Bertsekas, D. P. 2011. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Pages 85119 of: Sra, S., Nowozin, S., and Wright, S.J.(eds),Optimization for Machine Learning. NIPS Workshop Series. Cambridge, MA: MIT Press.Google Scholar
Bertsekas, D. P., and Tsitsiklis, J. N. 1989. Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
Bertsekas, D. P., Nedić, A., and Ozdaglar, A. E. 2003. Convex Analysis and Optimization. Optimization and Computation Series. Belmont, MA: Athena Scientific.Google Scholar
Blatt, D., Hero, A. O., and Gauchman, H. 2007. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1), 2951.CrossRefGoogle Scholar
Bolte, J., and Pauwels, E. 2021. Conservative set valued fields, automatic differentiation, stochastic gradient methods, and deep learning. Mathematical Programming, 188(1), 1951.CrossRefGoogle Scholar
Boser, B. E., Guyon, I. M., and Vapnik, V. N. 1992. A training algorithm for optimal margin classifiers. Pages 144152 of: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. Pittsburgh, PA: ACM Press.CrossRefGoogle Scholar
Boyd, S., and Vandenberghe, L. 2003. Convex Optimization. Cambridge: Cambridge University Press.Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction methods of multipliers. Foundations and Trends in Machine Learning, 3(1), 1122.CrossRefGoogle Scholar
Bubeck, S., Lee, Y. T., and Singh, M. 2015. A geometric alternative to Nesterov’s accelerated gradient descent. Technical Report arXiv:1506.08187. Microsoft Research.Google Scholar
Burachik, R. S., and Jeyakumar, V. 2005. A Simple closure condition for the normal cone intersection formula. Transactions of the American Mathematical Society, 133(6), 17411748.Google Scholar
Burer, S., and Monteiro, R. D. C. 2003. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorizations. Mathematical Programming, Series B, 95, 329257.CrossRefGoogle Scholar
Burke, J. V., and Engle, A. 2018. Line search methods for convex-composite optimization. Technical Report arXiv:1806.05218. Department of Mathematics, University of Washington.Google Scholar
Candès, E., and Recht, B. 2009. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9, 717772.CrossRefGoogle Scholar
Chouzenoux, E., Pesquet, J.-C., and Repetti, A. 2016. A block coordinate variable metric forward-backward algorithm. Journal of Global Optimization, 66, 457485.CrossRefGoogle Scholar
Conn, A. R., Gould, N. I. M., and Toint, P. L. 1992. LANCELOT: A Fortran Package for Large-Scale Nonlinear Optimization. Springer Series in Computational Mathematics, vol. 17. Heidelberg: Springer-Verlag.CrossRefGoogle Scholar
Cortes, C., and Vapnik, V. N. 1995. Support-vector networks. Machine Learning, 20, 273297.CrossRefGoogle Scholar
Danskin, J. M. 1967. The Theory of Max-Min and Its Application to Weapons Allocation Problems. Springer.CrossRefGoogle Scholar
Davis, D., Drusvyatskiy, D., Kakade, S., and Lee, J. D. 2020. Stochastic subgradient method converges on tame functions. Foundations of Computational Mathematics, 20(1), 119154.CrossRefGoogle Scholar
Defazio, A., Bach, F., and Lacoste-Julien, S. 2014. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Pages 1646– 1654 of: Advances in Neural Information Processing Systems, November 2014, Montreal, Canada.Google Scholar
Dem’yanov, V. F., and Rubinov, A. M. 1967. The minimization of a smooth convex functional on a convex set. SIAM Journal on Control, 5(2), 280294.CrossRefGoogle Scholar
Dem’yanov, V. F., and Rubinov, A. M. 1970. Approximate Methods in Optimization Problems. Vol. 32. New York: Elsevier.Google Scholar
Drusvyatskiy, D., Fazel, M., and Roy, S. 2018. An optimal first order method based on optimal quadratic averaging. SIAM Journal on Optimization, 28(1), 251271.CrossRefGoogle Scholar
Dunn, J. C. 1980. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization, 18(5), 473487.CrossRefGoogle Scholar
Dunn, J. C. 1981. Global and asymptotic convergence rate estimates for a class of projected gradient processes. SIAM Journal on Control and Optimization, 19(3), 368400.CrossRefGoogle Scholar
Eckstein, J., and Bertsekas, D. P. 1992. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55, 293318.CrossRefGoogle Scholar
Eckstein, J., and Yao, W. 2015. Understanding the convergence of the alternating direction method of multipliers: Theoretical and computational perspectives. Pacific Journal of Optimization, 11(4), 619644.Google Scholar
Fercoq, O., and Richtarik, P. 2015. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25, 19972023.CrossRefGoogle Scholar
Fletcher, R., and Reeves, C. M. 1964. Function minimization by conjugate gradients. Computer Journal, 7, 149154.CrossRefGoogle Scholar
Frank, M., and Wolfe, P. 1956. An algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 3, 95110.CrossRefGoogle Scholar
Gabay, D., and Mercier, B. 1976. A dual algorithm for the solution of nonlinear variational problems via finite element approximations. Computers and Mathematics with Applications, 2, 1740.CrossRefGoogle Scholar
Gelfand, I. 1941. Normierte ringe. Recueil Mathématique [Matematicheskii Sbornik], 9, 324.Google Scholar
Glowinski, R., and Marrocco, A. 1975. Sur l’approximation, par elements finis d’ordre un, en al resolution, par penalisation-dualité, d’une classe dre problems de Dirichlet non lineares. Revue Francaise d’Automatique, Informatique, et Recherche Operationelle, 9, 4176.Google Scholar
Goldstein, A. A. 1964. Convex programming in Hilbert space. Bulletin of the American Mathematical Society, 70, 709710.CrossRefGoogle Scholar
Goldstein, A. A. 1974. On gradient projection. Pages 3840 of: Proceedings of the 12th Allerton Conference on Circuit and System Theory, Allerton Park, Illinois.Google Scholar
Golub, G. H., and van Loan, C. F. 1996. Matrix Computations. Third edition. Baltimore: The Johns Hopkins University Press.Google Scholar
Griewank, A., and Walther, A. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Second edition. Frontiers in Applied Mathematics. Philadelphia, PA: SIAM.CrossRefGoogle Scholar
Hestenes, M. R. 1969. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4, 303320.CrossRefGoogle Scholar
Hestenes, M., and Steifel, E. 1952. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6), 409436.CrossRefGoogle Scholar
Hu, B., Wright, S. J., and Lessard, L. 2018. Dissipativity theory for accelerating stochastic variance reduction: A unified analysis of SVRG and Katyusha using semidefinite programs. Pages 20382047 of: International Conference on Machine Learning (ICML).Google Scholar
Jaggi, M. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. Pages 427435 of: International Conference on Machine Learning (ICML).Google Scholar
Jain, P., Netrapalli, P., Kakade, S. M., Kidambi, R., and Sidford, A. 2018. Accelerating stochastic gradient descent for least squares regression. Pages 545604 of: Conference on Learning Theory (COLT).Google Scholar
Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. Pages 315323 of: Advances in Neural Information Processing Systems.Google Scholar
Kaczmarz, S. 1937. Angenäherte Auflösung von Systemen linearer Gleichungen. Bulletin International de l’Académie Polonaise des Sciences et des Lettres. Classe des Sciences Mathématiques et Naturelles. Série A, Sciences Mathématiques, 35, 355357.Google Scholar
Karimi, H., Nutini, J., and Schmidt, M. 2016. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. Pages 795811 of: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer.CrossRefGoogle Scholar
Kiwiel, K. C. 1990. Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46(1–3), 105122.CrossRefGoogle Scholar
Kurdyka, K. 1998. On gradients of functions definable in o-minimal structures. Annales de l’Institut Fourier, 48, 769783.CrossRefGoogle Scholar
Lang, S. 1983. Real Analysis. Second edition. Reading, MA: Addison-Wesley.Google Scholar
Le Roux, N., Schmidt, M., and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in Neural Information Processing Systems, 25, 26632671.Google Scholar
Lee, C.-P., and Wright, S. J. 2018. Random permutations fix a worst case for cyclic coordinate descent. IMA Journal of Numerical Analysis, 39, 12461275.CrossRefGoogle Scholar
Lee, Y. T., and Sidford, A. 2013. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. Pages 147156 of: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE.CrossRefGoogle Scholar
Lemaréchal, C. 1975. An extension of Davidon methods to non differentiable problems. Pages 95109 of: Nondifferentiable Optimization. Springer.CrossRefGoogle Scholar
Lemaréchal, C., Nemirovskii, A., and Nesterov, Y. 1995. New variants of bundle methods. Mathematical Programming, 69(1–3), 111147.CrossRefGoogle Scholar
Lessard, L., Recht, B., and Packard, A. 2016. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1), 5795.CrossRefGoogle Scholar
Levitin, E. S., and Polyak, B. T. 1966. Constrained minimization problems. USSR Journal of Computational Mathematics and Mathematical Physics, 6, 150.CrossRefGoogle Scholar
Li, X., Zhao, T., Arora, R., Liu, H., and Hong, M. 2018. On Faster convergence of cyclic block coordinate descent-type methods for strongly convex minimization. Journal of Machine Learning Research, 18, 124.Google Scholar
Liu, J., and Wright, S. J. 2015. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1), 351376.CrossRefGoogle Scholar
Liu, J., Wright, S. J., , C., Bittorf, V., and Sridhar, S. 2015. An asynchronous parallel stochastic coordinate descent algorithm. Journal of Machine Learning Research, 16, 285322.Google Scholar
Łojasiewicz, S. 1963. Une propriété topologique des sous-ensembles analytiques réels. Les Équations aus Dérivées Partielles, 117, 8789.Google Scholar
Lu, Z., and Xiao, L. 2015. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, Series A, 152, 615642.CrossRefGoogle Scholar
Luo, Z.-Q., Sturm, J. F., and Zhang, S. 2000. Conic convex programming and self-dual embedding. Optimization Methods and Software, 14, 169218.CrossRefGoogle Scholar
Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B., and Doucet, A. 2018. Hamiltonian descent methods. arXiv preprint arXiv:1809.05042.Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4), 15741609.CrossRefGoogle Scholar
Nesterov, Y. 1983. A method for unconstrained convex problem with the rate of convergence O(1/k2). Doklady AN SSSR, 269, 543547.Google Scholar
Nesterov, Y. 2004. Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer Academic Publishers.CrossRefGoogle Scholar
Nesterov, Y. 2012. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(January), 341362.CrossRefGoogle Scholar
Nesterov, Y. 2015. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1–2), 381404.CrossRefGoogle Scholar
Nesterov, Y., and Nemirovskii, A. S. 1994. Interior Point Polynomial Methods in Convex Programming. Philadelphia, PA: SIAM.CrossRefGoogle Scholar
Nesterov, Y., and Stich, S. U. 2017. Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1), 110123.CrossRefGoogle Scholar
Nocedal, J., and Wright, S. J. 2006. Numerical Optimization. Second edition. New York: Springer.Google Scholar
Parikh, N., and Boyd, S. 2013. Proximal algorithms. Foundations and Trends in Optimization, 1(3), 123231.Google Scholar
Polyak, B. T. 1963. Gradient methods for minimizing functionals (in Russian). Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 643653.Google Scholar
Polyak, B. T. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4, 117.CrossRefGoogle Scholar
Powell, M. J. D. 1969. A method for nonlinear constraints in minimization problems. Pages 283298 of: Fletcher, R. (ed), Optimization. New York: Academic Press.Google Scholar
Rao, C. V., Wright, S. J., and Rawlings, J. B. 1998. Application of interior-point methods to model predictive control. Journal of Optimization Theory and Applications, 99, 723757.CrossRefGoogle Scholar
Recht, B., Fazel, M., and Parrilo, P. 2010. Guaranteed Minimum-rank solutions to linear matrix equations via nuclear norm minimization. SIAM Review, 52(3), 471501.CrossRefGoogle Scholar
Richtarik, P., and Takac, M. 2014. Iteration complexity of a randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, Series A, 144(1), 138.CrossRefGoogle Scholar
Richtarik, P., and Takac, M. 2016a. Distributed coordinate descent method for learning with big data. Journal of Machine Learning Research, 17, 125.Google Scholar
Richtarik, P., and Takac, M. 2016b. Parallel coordinate descent methods for big data optimization. Mathematical Programming, Series A, 156, 433484.CrossRefGoogle Scholar
Robbins, H., and Monro, S. 1951. A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400407.CrossRefGoogle Scholar
Rockafellar, R. T. 1970. Convex Analysis. Princeton, NJ: Princeton University Press.CrossRefGoogle Scholar
Rockafellar, R. T. 1973. The multiplier method of Hestenes and Powell applied to convex programming. Journal of Optimization Theory and Applications, 12(6), 555562.CrossRefGoogle Scholar
Rockafellar, R. T. 1976a. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1, 97116.CrossRefGoogle Scholar
Rockafellar, R. T. 1976b. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14, 877898.CrossRefGoogle Scholar
Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.CrossRefGoogle Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. 2011. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 330.CrossRefGoogle Scholar
Shi, B., Du, S. S., Jordan, M. I., and Su, W. J. 2018. Understanding the acceleration phenomenon via high-resolution differential equations. arXiv preprint arXiv:1810.08907.Google Scholar
Sion, M. 1958. On general minimax theorems. Pacific Journal of Mathematics, 8(1), 171176.CrossRefGoogle Scholar
Stellato, B., Banjac, G., Goulart, P., Bemporad, A., and Boyd, S. 2020. OSQP: An operator splitting solver for quadratic programs. Mathematical Programming Computation, 12(4), 637672.CrossRefGoogle Scholar
Strohmer, T., and Vershynin, R. 2009. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2), 262.CrossRefGoogle Scholar
Su, W., Boyd, S., and Candès, E. 2014. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. Pages 25102518 of: Advances in Neural Information Processing Systems.Google Scholar
Sun, R., and Hong, M. 2015. Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. Pages 1306–1314 of: Advances in Neural Information Processing Systems.Google Scholar
Teo, C. H., Vishwanathan, S. V. N., Smola, A., and Le, Q. V. 2010. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11(1), 311365.Google Scholar
Tibshirani, R. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society B, 58, 267288.Google Scholar
Todd, M. J. 2001. Semidefinite optimization. Acta Numerica, 10, 515560.CrossRefGoogle Scholar
Tseng, P., and Yun, S. 2010. A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Computational Optimization and Applications, 47(2), 179206.CrossRefGoogle Scholar
Vandenberghe, L. 2016. Slides for EE236C: Optimization Methods for Large-Scale Systems.Google Scholar
Vandenberghe, L., and Boyd, S. 1996. Semidefinite programming. SIAM Review, 38, 4995.CrossRefGoogle Scholar
Vapnik, V. 1992. Principles of risk minimization for learning theory. Pages 831–838 of: Advances in Neural Information Processing Systems.Google Scholar
Vapnik, V. 2013. The Nature of Statistical Learning Theory. Berlin: Springer Science & Business Media.Google Scholar
Wibisono, A., Wilson, A. C., and Jordan, M. I. 2016. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47), E7351E7358.CrossRefGoogle ScholarPubMed
Wolfe, P. 1975. A method of conjugate subgradients for minimizing nondifferentiable functions. Pages 145173 of: Nondifferentiable Optimization. Springer.CrossRefGoogle Scholar
Wright, S. J. 1997. Primal-Dual Interior-Point Methods. Philadelphia, PA: SIAM.CrossRefGoogle Scholar
Wright, S. J. 2012. Accelerated block-coordinate relaxation for regularized optimization. SIAM Journal on Optimization, 22(1), 159186.CrossRefGoogle Scholar
Wright, S. J. 2018. Optimization algorithms for data analysis. Pages 4997 of: Mahoney, M., Duchi, J. C., and Gilbert, A. (eds), The Mathematics of Data. IAS/Park City Mathematics Series, vol. 25. AMS.CrossRefGoogle Scholar
Wright, S. J., and Lee, C.-P. 2020. Analyzing random permutations for cyclic coordinate descent. Mathematics of Computation, 89, 22172248.CrossRefGoogle Scholar
Wright, S. J., Nowak, R. D., and Figueiredo, M. A. T. 2009. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(August), 24792493.CrossRefGoogle Scholar
Zhang, T. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Page 116 of: Proceedings of the Twenty-First International Conference on Machine Learning.CrossRefGoogle Scholar

Save book to Kindle

To save this book to your Kindle, first ensure is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the or variations. ‘’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Bibliography
  • Stephen J. Wright, University of Wisconsin, Madison, Benjamin Recht, University of California, Berkeley
  • Book: Optimization for Data Analysis
  • Online publication: 31 March 2022
  • Chapter DOI:
Available formats

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Bibliography
  • Stephen J. Wright, University of Wisconsin, Madison, Benjamin Recht, University of California, Berkeley
  • Book: Optimization for Data Analysis
  • Online publication: 31 March 2022
  • Chapter DOI:
Available formats

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Bibliography
  • Stephen J. Wright, University of Wisconsin, Madison, Benjamin Recht, University of California, Berkeley
  • Book: Optimization for Data Analysis
  • Online publication: 31 March 2022
  • Chapter DOI:
Available formats