Structure-preserving deep learning

E. CELLEDONI; M. J. EHRHARDT; C. ETMANN; R. I. MCLACHLAN; B. OWREN; C.-B. SCHONLIEB; F. SHERRY

doi:10.1017/S0956792521000139

Structure-preserving deep learning

Published online by Cambridge University Press: 27 May 2021

C. ETMANN ,

B. OWREN ,

and

E. CELLEDONI: Affiliation:
Department of Mathematical Sciences, NTNU, N-7491 Trondheim, Norway emails: elena.celledoni@ntnu.no; brynjulf.owren@ntnu.no
M. J. EHRHARDT: Affiliation:
Institute for Mathematical Innovation, University of Bath, Bath BA2 7JU, UK email: m.ehrhardt@bath.ac.uk
C. ETMANN: Affiliation:
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK emails: cetmann@damtp.cam.ac.uk; cbs31@cam.ac.uk; fs436@cam.ac.uk
R. I. MCLACHLAN: Affiliation:
School of Fundamental Sciences, Massey University, Private Bag 11-222, Palmerston North, New Zealand email: r.mclachlan@massey.ac.nz
B. OWREN: Affiliation:
Department of Mathematical Sciences, NTNU, N-7491 Trondheim, Norway emails: elena.celledoni@ntnu.no; brynjulf.owren@ntnu.no
C.-B. SCHONLIEB: Affiliation:
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK emails: cetmann@damtp.cam.ac.uk; cbs31@cam.ac.uk; fs436@cam.ac.uk
F. SHERRY: Affiliation:
Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK emails: cetmann@damtp.cam.ac.uk; cbs31@cam.ac.uk; fs436@cam.ac.uk

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Over the past few years, deep learning has risen to the foreground as a topic of massive interest, mainly as a result of successes obtained in solving large-scale image processing tasks. There are multiple challenging mathematical problems involved in applying deep learning: most deep learning methods require the solution of hard optimisation problems, and a good understanding of the trade-off between computational effort, amount of data and model complexity is required to successfully design a deep learning approach for a given problem.. A large amount of progress made in deep learning has been based on heuristic explorations, but there is a growing effort to mathematically understand the structure in existing deep learning methods and to systematically design new deep learning methods to preserve certain types of structure in deep learning. In this article, we review a number of these directions: some deep neural networks can be understood as discretisations of dynamical systems, neural networks can be designed to have desirable properties such as invertibility or group equivariance and new algorithmic frameworks based on conformal Hamiltonian systems and Riemannian manifolds to solve the optimisation problems have been proposed. We conclude our review of each of these topics by discussing some open problems that we consider to be interesting directions for future research.

Keywords

Deep learning ordinary differential equations optimal control structure-preserving methods

Type: Papers
Information: European Journal of Applied Mathematics , Volume 32 , Special Issue 5: 30th Anniversary Special Issue , October 2021 , pp. 888 - 936

DOI: https://doi.org/10.1017/S0956792521000139 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2021. Published by Cambridge University Press

References

Absil, P.-A., Mahony, R. & Sepulchre, R. (2008) Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton, NJ. With a foreword by Paul Van Dooren.CrossRef Google Scholar

Amari, S.-I. (1998) Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276.CrossRef Google Scholar

Amari, S.-I., Cichocki, A. & Yang, H. H. (1996) A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems, pp. 757–763.Google Scholar

Amari, S.-I. & Douglas, S. C. (1998) Why natural gradient? In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 2, IEEE, pp. 1213–1216.Google Scholar

Ambrosio, L., Gigli, N. & Savaré, G. (2008) Gradient flows: in Metric Spaces and in the Space of Probability Measures, Springer Science & Business Media, Berlin.Google Scholar

Arridge, S., Maass, P., Öktem, O. & Schönlieb, C.-B. (2019) Solving inverse problems using data-driven models. Acta Numerica 28, 1–174.CrossRef Google Scholar

Asorey, M., Cariñena, J. F. & Ibort, L. A. (1983) Generalized canonical transformations for time-dependent systems. J. Math. Phys. 24(12), 2745–2750.CrossRef Google Scholar

Bécigneul, G. & Ganea, O.-E. (2019) Riemannian adaptive optimization methods. In: International Conference on Learning Representations.Google Scholar

Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D. & Jacobsen, J.-H. (2019) Invertible residual networks. In: Chaudhuri, K. and Salakhutdinov, R. (editors), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, 09–15 June 2019, PMLR, pp. 573–582.Google Scholar

Behrmann, J., Vicol, P., Wang, K. C., Grosse, R. & Jacobsen, J. H. (2021) Understanding and mitigating exploding inverses in invertible neural networks. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1792–1800.Google Scholar

Bekkers, E. J., Lafarge, M. W., Veta, M., Eppenhof, K. A. J., Pluim, J. P. W. & Duits, R. (2018) Roto-translation covariant convolutional networks for medical image analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, pp. 440–448.Google Scholar

Benning, M., Celledoni, E., Ehrhardt, M. J., Owren, B. & Schönlieb, C.-B. (2019) Deep learning as optimal control problems: models and numerical methods. J. Comput. Dyn. 6(2), 171–198.CrossRef Google Scholar

Bhatt, A., Floyd, D. & Moore, B. E. (2016) Second order conformal symplectic schemes for damped Hamiltonian systems. J. Sci. Comput. 66(3), 1234–1259.CrossRef Google Scholar

Bogachev, V. I. (2007) Measure Theory, Vol. 1, Springer Science & Business Media, Berlin.CrossRef Google Scholar

Bölcskei, H., Grohs, P., Kutyniok, G. & Petersen, P. (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1), 8–45.CrossRef Google Scholar

Bonnans, J. F. (2019) Course on Optimal Control. http://www.cmap.polytechnique.fr/~bonnans/notes/oc/ocbook.pdf.Google Scholar

Cardoso, J.-F. & Laheld, B. H. (1992) Equivariant adaptive source separation. IEEE Trans. Signal Process. 44, 3017–3030.CrossRef Google Scholar

Carnegie Mellon University Graphics Lab. (2003) Motion capture database. http://mocap.cs.cmu.edu/.Google Scholar

Celledoni, E., Eslitzbichler, M. & Schmeding, A. (2016) Shape analysis on Lie groups with applications in computer animation. J. Geom. Mech. 8(3), 273–304.CrossRef Google Scholar

Celledoni, E. & Fiori, S. (2004) Neural learning by geometric integration of reduced ‘rigid-body’ equations. J. Comput. Appl. Math. 172(2), 247–269.CrossRef Google Scholar

Celledoni, E. & Høiseth, E. H. (2017) Energy-Preserving and Passivity-Consistent Numerical Discretization of Port-Hamiltonian Systems. arXiv preprint arXiv:1706.08621.Google Scholar

Celledoni, E., Marthinsen, H. & Owren, B. (2014) An introduction to Lie group integrators—basics, new developments and applications. J. Comput. Phys. 257(part B), 1040–1061.CrossRef Google Scholar

Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D. & Holtham, E. (2018) Reversible architectures for arbitrarily deep residual neural networks. In: Thirty-Second AAAI Conference on Artificial Intelligence, Vol. 32, AAAI Press, Palo Alto, pp. 2811–2818.Google Scholar

Chen, T. Q., Behrmann, J., Duvenaud, D. & Jacobsen, J.-H. (2019) Residual flows for invertible generative modeling. In: Advances in Neural Information Processing Systems, pp. 9913–9923.Google Scholar

Chen, T. Q., Rubanova, Y., Bettencourt, J. & Duvenaud, D. (2018) Neural ordinary differential equations. In: Advances in Neural Information Processing Systems, pp. 6572–6583.Google Scholar

Chizat, L. & Bach, F. (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3036–3046.Google Scholar

Chizat, L. & Bach, F. (2020) Implicit Bias of Gradient Descent for Wide Two-Layer Neural Networks Trained with the Logistic Loss. arXiv preprint arXiv:2002.04486.Google Scholar

Cho, M. & Lee, J. (2017) Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, pp. 5225–5235.Google Scholar

Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C. & Gomez, F. (2018) NAIS-Net: stable deep networks from non-autonomous differential equations. In: Advances in Neural Information Processing Systems, pp. 3025–3035.Google Scholar

Clason, C. (2020) Regularization of Inverse Problems. arXiv:2001.00617.Google Scholar

Cohen, T., Geiger, M. & Weiler, M. (2019) A general theory of equivariant CNNs on homogeneous spaces. In: Advances in Neural Information Processing Systems 32, pp. 9145–9156.Google Scholar

Cohen, T. S., Geiger, M., Koehler, J. & Welling, M. (2018) Spherical CNNs. arXiv:1801.10130.Google Scholar

Cohen, T. S. & Welling, M. (2016) Group equivariant convolutional networks. In: International Conference on Machine Learning, pp. 2990–2999.Google Scholar

Cohen, T. S. & Welling, M. (2017) Steerable CNNs, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings.Google Scholar

Conn, A. R., Gould, N. I. M. & Toint, P. L. (2000) Trust-Region Methods, MPS-SIAM Series on Optimization, Vol. 1. MPS/SIAM, Philadelphia.Google Scholar

Cook, P., Bai, Y., Nedjati-Gilani, S., Seunarine, K., Hall, M., Parker, G. & Alexander, D. (2006) Camino: open-source diffusion-MRI reconstruction and processing. In: Proceedings of the 14th Scientific Meeting of ISMRM, Seattle WA, USA, Vol. 2759.Google Scholar

Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314.CrossRef Google Scholar

Dahlquist, G. (1979) Generalized disks of contractivity for explicit and implicit Runge-Kutta methods. Technical report, CM-P00069451.Google Scholar

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 248–255.CrossRef Google Scholar

Dinh, L., Krueger, D. & Bengio, Y. (2014) NICE: Non-Linear Independent Components Estimation. arXiv preprint arXiv:1410.8516.Google Scholar

Dinh, L., Sohl-Dickstein, J. & Bengio, S. (2016) Density Estimation Using Real NVP. arXiv preprint arXiv:1605.08803.Google Scholar

Du, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R. & Singh, A. (2018) How Many Samples are Needed to Estimate a Convolutional or Recurrent Neural Network? arXiv:1805.07883.Google Scholar

Duchi, J., Shalev-Shwartz, S., Singer, Y. & Chandra, T. (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning - ICML, pp. 272–279.CrossRef Google Scholar

Dupont, E., Doucet, A. & Teh, Y. W. (2019) Augmented neural ODEs. In: Advances in Neural Information Processing Systems.Google Scholar

Durkan, C., Bekasov, A., Murray, I. & Papamakarios, G. (2019) Neural spline flows. In: Advances in Neural Information Processing Systems, pp. 7509–7520.Google Scholar

E, W. (2017) A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5(1), 1–11.CrossRef Google Scholar

E, W., Han, J. & Li, Q. (2018) A Mean-Field Optimal Control Formulation of Deep Learning. arXiv:1807.01083v1.CrossRef Google Scholar

E, W., Han, J. & Li, Q. (2019) A mean-field optimal control formulation of deep learning. Res. Math. Sci. 6(1), 1–41.CrossRef Google Scholar

E, W., Ma, C. & Wang, Q. (2019) A Priori Estimates of the Population Risk for Residual Networks. arXiv, pp. 1–19.Google Scholar

Engl, H. W., Hanke, M. & Neubauer, A. (1996) Regularization of Inverse Problems, Mathematics and Its Applications, Springer, Berlin.Google Scholar

Esteves, C., Allen-Blanchette, C., Makadia, A. & Daniilidis, K. (2018) Learning SO(3) equivariant representations with spherical CNNs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68.CrossRef Google Scholar

Etmann, C., Ke, R. & Schönlieb, C.-B. (2020) iUNets: Fully Invertible U-Nets with Learnable Up-and Downsampling. arXiv preprint arXiv:2005.05220.Google Scholar

França, G., Sulam, J., Robinson, D. P. & Vidal, R. (2019) Conformal Symplectic and Relativistic Optimization. arXiv preprint arXiv:1903.04100.Google Scholar

Gallot, S., Hulin, D. & Lafontaine, J. (2004) Riemannian Geometry, 3rd ed., Universitext, Springer-Verlag, Berlin.CrossRef Google Scholar

García Trillos, N. & Slepčev, D. (2016) Continuum limit of total variation on point clouds. Arch. Ration. Mech. Anal. 220(1), 193–241.CrossRef Google Scholar

Gholami, A., Keutzer, K. & Biros, G. (2019) ANODE: unconditionally accurate memory-efficient gradients for neural ODEs. In: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2019, pp. 730–736.Google Scholar

Gomez, A. N., Ren, M., Urtasun, R. & Grosse, R. B. (2017) The reversible residual network: backpropagation without storing activations. In: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. and Garnett, R. (editors), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp. 2214–2224.Google Scholar

Grönwall, T. H. (1919) Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math. 20(4), 292–296.CrossRef Google Scholar

Günther, S., Ruthotto, L., Schroder, J. B., Cyr, E. C. & Gauger, N. R. (2020) Layer-parallel training of deep residual neural networks. SIAM J. Math. Data Sci. 2(1), 1–23.CrossRef Google Scholar

Haber, E. & Ruthotto, L. (2017) Stable architectures for deep neural networks. Inverse Probl. 34(1), 014004.CrossRef Google Scholar

Hager, W. W. (2000) Runge-Kutta methods in optimal control and the transformed adjoint system. Numerische Mathematik 87(2), 247–282.CrossRef Google Scholar

Hairer, E., Lubich, C. & Wanner, G. (2006) Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations, Vol. 31, Springer Science & Business Media, Berlin.Google Scholar

Hairer, E., Nørsett, S. P. & Wanner, G. (1993) Solving Ordinary Differential Equations I, 2nd ed., Springer Series in Computational Mathematics, Springer-Verlag, Berlin, Heidelberg.Google Scholar

Hairer, E. & Wanner, G. (2010) Solving Ordinary Differential Equations. II, Springer Series in Computational Mathematics, Vol. 14, Springer-Verlag, Berlin. Stiff and differential-algebraic problems, Second revised edition, paperback.Google Scholar

He, K., Zhang, X., Ren, S. & Sun, J. (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.CrossRef Google Scholar

Hochreiter, S. & Schmidhuber, J. (1997) Flat minima. Neural Comput. 9(1), 1–42.CrossRef Google Scholar PubMed

Hoogeboom, E., Van Den Berg, R. & Welling, M. Emerging convolutions for generative normalizing flows. In: Chaudhuri, K. and Salakhutdinov, R. (editors), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, 09–15 June 2019, PMLR, pp. 2771–2780.Google Scholar

Hopfield, J. J. (1982) Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. 79(8), 2554–2558.CrossRef Google Scholar PubMed

Hornik, K. (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251–257.CrossRef Google Scholar

Hutchinson, M. F. (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Commun. Stat. Simul. Comput. 19(2), 433–450.CrossRef Google Scholar

Hyvärinen, A. & Oja, E. (2000) Independent component analysis: algorithms and applications. Neural Networks 13, 411–430.CrossRef Google Scholar PubMed

Iserles, A., Munthe-Kaas, H. Z., Nørsett, S. P. & Zanna, A. (2000) Lie-group methods. In: Acta Numerica, 2000, Acta Numerica, Vol. 9, Cambridge University Press, Cambridge, pp. 215–365.CrossRef Google Scholar

Ito, K. & Jin, B. (2014) Inverse Problems - Tikhonov Theory and Algorithms, World Scientific, Singapore.CrossRef Google Scholar

Jacobsen, J.-H., Smeulders, A. W. M. & Oyallon, E. (2018) i-RevNet: deep invertible networks. In: International Conference on Learning Representations.Google Scholar

Karras, T., Laine, S. & Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410.CrossRef Google Scholar

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. (2017) On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR.Google Scholar

Kingma, D. P. & Ba, J. (2015) Adam: a method for stochastic optimization. In: ICLR.Google Scholar

Kingma, D. P. & Dhariwal, P. (2018) Glow: generative flow with invertible 1x1 convolutions. In: Advances in Neural Information Processing Systems, pp. 10215–10224.Google Scholar

Kobayashi, S. & Nomizu, K. (1996) Foundations of Differential Geometry, Vol. I, Wiley Classics Library, John Wiley & Sons, Inc., New York. Reprint of the 1963 original, A Wiley-Interscience Publication.Google Scholar

Kondor, R., Lin, Z. & Trivedi, S. (2018) Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network. Advances in Neural Information Processing Systems, 31, 10117–10126.Google Scholar

Kondor, R. & Trivedi, S. (2018) On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. arXiv:1802.03690.Google Scholar

Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097–1105.Google Scholar

LeCun, Y. (1988) A theoretical framework for back-propagation. In: Proceedings of the 1988 Connectionist Models Summer School, Vol. 1, CMU, Morgan Kaufmann, Pittsburgh, PA, pp. 21–28.Google Scholar

LeCun, Y. & Bengio, Y. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.Google Scholar

LeCun, Y., Bengio, Y. & Hinton, G. (2015) Deep learning. Nature 521(7553), 436–444.CrossRef Google Scholar PubMed

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. & Jackel, L. D. (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551.CrossRef Google Scholar

Li, J., Li, F. & Todorovic, S. (2019) Efficient Riemannian optimization on the Stiefel manifold via the Cayley transform. In: International Conference on Learning Representations.Google Scholar

Li, Q., Chen, L., Tai, C. & E, W. (2018) Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 1–29.Google Scholar

Li, Q. & Hao, S. (2018) An optimal control approach to deep learning and applications to discrete-weight neural networks. In: Proceedings of the 35th International Conference on Machine Learning.Google Scholar

Li, Q., Tai, C. & E, W. (2019) Stochastic modified equations and dynamics of stochastic gradient algorithms I: mathematical foundations. J. Mach. Learn. Res. 20, 1–47.Google Scholar

Li, S. T. J. & Fuxin, L. (2020) Efficient Riemannian optimization on the Stiefel manifold via the Cayley transform. In: ICLR 2020.Google Scholar

Linnainmaa, S. (1970) The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors. Master’s Thesis (in Finnish), University Helsinki, pp. 6–7.Google Scholar

Lu, Y., Zhong, A., Li, Q. & Dong, B. (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In: 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings.Google Scholar

Lyu, K. & Li, J. (2020) Gradient descent maximizes the margin of homogeneous neural networks. In: International Conference on Learning Representations.Google Scholar

Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. & Doucet, A. (2018) Hamiltonian Descent Methods. arXiv preprint arXiv:1809.05042.Google Scholar

Martens, J. (2014) New Insights and Perspectives on the Natural Gradient Method. arXiv preprint arXiv:1412.1193.Google Scholar

Marthinsen, H. & Owren, B. (2016) Geometric integration of non-autonomous linear Hamiltonian problems. Adv. Comput. Math. 42(2), 313–332.CrossRef Google Scholar

Massaroli, S., Poli, M., Califano, F., Faragasso, A., Park, J., Yamashita, A. & Asama, H. (2019) Port-Hamiltonian Approach to Neural Network Training. arXiv preprint arXiv:1909.02702.CrossRef Google Scholar

McLachlan, R. & Perlmutter, M. (2001) Conformal Hamiltonian systems. J. Geom. Phys. 39(4), 276–300.CrossRef Google Scholar

McLachlan, R. I. & Quispel, G. R. W. (2002) Splitting methods. Acta Numer. 11, 341–434.CrossRef Google Scholar

McLachlan, R. I., Quispel, G. R. W. & Robidoux, N. (1999) Geometric integration using discrete gradients. R. Soc. Lond. Philos. Trans. Ser. A Math. Phys. Eng. Sci. 357(1754), 1021–1045.CrossRef Google Scholar

Modin, K. (2016) Geometry of Matrix Decompositions seen through Optimal Transport and Information Geometry. arXiv preprint arXiv:1601.01875.Google Scholar

Ng, A. Y. (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the 21 st International Conference on Machine Learning.CrossRef Google Scholar

Nocedal, J. & Wright, S. (2006) Numerical Optimization, Springer Science & Business Media, Berlin.Google Scholar

O’Donoghue, B. & Maddison, C. J. (2019) Hamiltonian descent for composite objectives. In: Advances in Neural Information Processing Systems, pp. 14443–14453.Google Scholar

Parpas, P. & Muir, C. (2019) Predict Globally, Correct Locally: Parallel-in-Time Optimal Control of Neural Networks. arXiv, 1974.Google Scholar

Pascanu, R. & Bengio, Y. (2013) Revisiting Natural Gradient for Deep Networks. arXiv preprint arXiv:1301.3584.Google Scholar

Petersen, P. & Voigtlaender, F. (2019) Equivalence of approximation by convolutional neural networks and fully-connected networks. Proc. Am. Math. Soc. 148(4), 1567–1581.CrossRef Google Scholar

Pontryagin, L. S. (1987) Mathematical Theory of Optimal Processes, Classics of Soviet Mathematics, Taylor & Francis, Montreux.Google Scholar

Putzky, P. & Welling, M. (2019) Invert to learn to invert. In: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 446–456.Google Scholar

Ranzato, M. A., Boureau, Y.-L. & Le Cun, Y. (2009) Sparse feature learning for deep belief networks. In: Advances in Neural Information Processing Systems 20 - Proceedings of the 2007 Conference.Google Scholar

Reddi, S. J., Kale, S. & Kumar, S. (2018) On the convergence of Adam and beyond. In: ICLR. Google Scholar

Rezende, D. J. & Mohamed, S. (2015) Variational inference with normalizing flows. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org., pp. 1530–1538.Google Scholar

Robbins, H. & Monro, S. (1951) A stochastic approximation method. Ann. Math. Stat., 22(3), 400–407.CrossRef Google Scholar

Rocca, F., Prato, C. M. & Ferretti, A. (1997) An overview of ERS-SAR interferometry. In: Proceedings of the 3rdERS Symposium on Space at the Service of our Environment, Florence, Italy.Google Scholar

Ruthotto, L. & Haber, E. (2019) Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 1–13. Springer, Berlin.Google Scholar

Shalev-Shwartz, S. & Ben-David, S. (2014) Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge.CrossRef Google Scholar

Su, W., Boyd, S. P. & Candes, E. J. (2014) A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. In: NIPS, Vol. 14, pp. 2510–2518.Google Scholar

Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A. & Goldstein, T. (2016) Training neural networks without gradients: a scalable ADMM approach. In: ICML.Google Scholar

Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M. & Sugiyama, M. (2020) Coupling-based Invertible Neural Networks are Universal Diffeomorphism Approximators. arXiv preprint arXiv:2006.11469.Google Scholar

Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K. & Riley, P. (2018) Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds. arXiv:1802.08219.Google Scholar

Thorpe, M. & van Gennip, Y. (2018) Deep Limits of Residual Neural networks. arXiv preprint arXiv:1810.11741.Google Scholar

Udrişte, C. (1994) Convex Functions and Optimization Methods on Riemannian Manifolds, Mathematics and its Applications, Vol. 297, Kluwer Academic Publishers Group, Dordrecht.Google Scholar

Ulyanov, D., Vedaldi, A. & Lempitsky, V. (2018) Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454.Google Scholar

van der Schaft, A. & Jeltsema, D. (2014) Port-Hamiltonian systems theory: an introductory overview. Found. Trends Syst. Control 1(2–3), 173–378.CrossRef Google Scholar

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408.Google Scholar

Wang, X., Ma, S., Goldfarb, D. & Lu, W. (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956.CrossRef Google Scholar

Weiler, M., Geiger, M., Welling, M., Boomsma, W. & Cohen, T. (2018) 3D steerable CNNs: learning rotationally equivariant features in volumetric data. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10402–10413.Google Scholar

Weiler, M., Hamprecht, F. A. & Storath, M. (2018) Learning steerable filters for rotation equivariant CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849–858CrossRef Google Scholar

Weinmann, A., Demaret, L. & Storath, M. (2014) Total variation regularization for manifold-valued data. SIAM J. Imaging Sci. 7(4), 2226–2257.CrossRef Google Scholar

Withers, C. S. & Nadarajah, S. (2010) log det A = tr log A. Int. J. Math. Edu. Sci. Technol. 41(8), 1121–1124.CrossRef Google Scholar

Worrall, D. E., Garbin, S. J., Turmukhambetov, D. & Brostow, G. J. (2017) Harmonic networks: Deep translation and rotation equivariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037.CrossRef Google Scholar

Xie, Y., Byrd, R. H. & Nocedal, J. (2020) Analysis of the BFGS method with errors. SIAM J. Optim. 30(1), 182–209.CrossRef Google Scholar

Yang, H. H. & Amari, S.-i. (1997) Natural gradient descent for training multi-layer perceptrons. Submitted to IEEE Trans. Neural Networks. Google Scholar

Yang, Z., Liu, Y., Bao, C. & Shi, Z. (2020) Interpolation between residual and non-residual networks. In: International Conference on Machine Learning, PMLR, pp. 10736–10745.Google Scholar

Yarotsky, D. (2018) Universal Approximations of Invariant Maps by Neural Networks. arXiv:1804.10306.Google Scholar

Zaheer, M., Reddi, S., Sachan, D., Kale, S. & Kumar, S. (2018) Adaptive methods for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 9793–9803.Google Scholar

Zhang, G., Martens, J. & Grosse, R. B. (2019) Fast convergence of natural gradient descent for over-parameterized neural networks. In: Advances in Neural Information Processing Systems, pp. 8080–8091.Google Scholar

Zhang, L. & Schaeffer, H. (2020) Forward stability of resNet and its variants. J. Math. Imaging Vis. 62(3), 328–351.CrossRef Google Scholar

Article contents

Structure-preserving deep learning

Abstract

Keywords

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests