References

Daniel A. Roberts; Sho Yaida

References

Published online by Cambridge University Press: 05 May 2022

Daniel A. Roberts and

Sho Yaida

With contributions by

Boris Hanin

Show author details

Daniel A. Roberts: Affiliation:
Massachusetts Institute of Technology
Sho Yaida: Affiliation:
Meta AI
Boris Hanin: Affiliation:
Princeton University, New Jersey

Book contents

Get access

Summary

A summary is not available for this content so a preview has been provided. Please use the Get access link above for information on how to access this content.

Image of the first page of this content. For PDF version, please use the ‘Save PDF’ preceeding this image.'

Type: Chapter
Information: The Principles of Deep Learning Theory
An Effective Theory Approach to Understanding Neural Networks
, pp. 439 - 444

DOI: https://doi.org/10.1017/9781009023405 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2022

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Dirac, P. A. M., The Principles of Quantum Mechanics. No. 27 in The International Series of Monographs on Physics. Oxford University Press, 1930.Google Scholar

von Neumann, J., Mathematical Foundations of Quantum Mechanics. Princeton University Press, 1955.Google Scholar

Carnot, S., Reflections on the Motive Power of Heat and on Machines Fitted to Develop that Power. Wiley, J., 1890. Trans. by Thurston, R. H. from Réflexions sur la puissance motrice du feu et sur les machines propres àdévelopper cette puissance (1824).CrossRef Google Scholar

Bessarab, M., Landau. M., Moscow worker, 1971. Trans. by B. Hanin from the original Russian source. www.ega-math.narod.ru/Landau/Dau1971.htm.Google Scholar

Polchinski, J., “Memories of a Theoretical Physicist,” arXiv:1708.09093 [physics.hist-ph].Google Scholar

Rosenblatt, F., Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanism, tech. rep., Cornell Aeronautical Lab, Inc., 1961.Google Scholar

McCulloch, W. S. and Pitts, W., “A logical calculus of the ideas immanent in nervous activity,” The Bulletin of Mathematical Biophysics 5 no. 4, (1943) 115–133.CrossRef Google Scholar

Rosenblatt, F., “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review 65 no. 6, (1958) 386.Google Scholar

Krizhevsky, A., Sutskever, I., and Hinton, G. E., “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing System 25, (2012) 1097–1105.Google Scholar

Fukushima, K., “Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics 36 no. 4, (1980) 193–202.Google Scholar

LeCun, Y., Generalization and Network Design Strategies, tech. rep. CRG-TR-89-4, Department of Computer Science, University of Toronto, 1989.Google Scholar

LeCun, Y., Boser, B., Denker, J. S., et al., “Backpropagation applied to handwritten zip code recognition,” Neural Computation 1 no. 4, (1989) 541–551.CrossRef Google Scholar

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE 86 no. 11, (1998) 2278–2324.Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30, (2017) 5998–6008. arXiv:1706.03762 [cs.CL].Google Scholar

Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, Rumelhart, D. E., McClelland, J. L., and the PDP Research Group, eds., Ch. 8, MIT Press, 1986, pp. 318–362.Google Scholar

LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K.-R., “Efficient backprop,” in Neural Networks: Tricks of the Trade, pp. 9–48. Springer, 1998.Google Scholar

Gallant and White, “There exists a neural network that does not make avoidable mistakes,” in IEEE 1988 International Conference on Neural Networks, pp. 657–664 vol.1. 1988.Google Scholar

Nair, V. and Hinton, G. E., “Rectified linear units improve restricted Boltzmann machines,” in International Conference on Machine Learning. 2010.Google Scholar

Glorot, X., Bordes, A., and Bengio, Y., “Deep sparse rectifier neural networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323, JMLR Workshop and Conference Proceedings. 2011.Google Scholar

Maas, A. L., Hannun, A. Y., and Ng, A. Y., “Rectifier nonlinearities improve neural network acoustic models,” in ICML Workshop on Deep Learning for Audio, Speech, and Language Processing. 2013.Google Scholar

Dugas, C., Bengio, Y., Bélisle, F., Nadeau, C., and Garcia, R., “Incorporating second-order functional knowledge for better option pricing,” in Advances in Neural Information Processing Systems 13, (2000) 472–478.Google Scholar

Ramachandran, P., Zoph, B., and Le, Q. V., “Searching for activation functions,” arXiv:1710.05941 [cs.NE].Google Scholar

Hendrycks, D. and Gimpel, K., “Gaussian error linear units (GELUs),” arXiv:1606.08415 [cs.LG].Google Scholar

Turing, A. M., “The chemical basis of morphogenesis,” Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 237 no. 641, (1952) 37–72.Google Scholar

Saxe, A. M., McClelland, J. L., and Ganguli, S., “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv:1312.6120 [cs.NE].Google Scholar

Zavatone-Veth, J. A. and Pehlevan, C., “Exact priors of finite neural networks,” arXiv:2104.11734 [cs.LG].Google Scholar

McGreevy, J., “Holographic duality with a view toward many-body physics,” Advances in High Energy Physics 2010 (2010) 723105, arXiv:0909.0518 [hep-th].Google Scholar

Neal, R. M., “Priors for infinite networks,” in Bayesian Learning for Neural Networks, pp. 29–53. Springer, 1996.Google Scholar

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J., “Deep neural networks as Gaussian processes,” in International Conference on Learning Representations. 2018. arXiv:1711.00165 [stat.ML].Google Scholar

de, A. G. Matthews, G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z., “Gaussian process behaviour in wide deep neural networks,” in International Conference on Learning Representations. 2018. arXiv:1804.11271 [stat.ML].Google Scholar

Yaida, S., “Non-Gaussian processes and neural networks at finite widths,” in Mathematical and Scientific Machine Learning Conference. 2020. arXiv:1910.00019 [stat.ML].Google Scholar

Dyson, F. J., “The S matrix in quantum electrodynamics,” Physical Review 75 (Jun, 1949) 1736–1755.CrossRef Google Scholar

Schwinger, J., “On the Green’s functions of quantized fields. I,” Proceedings of the National Academy of Sciences 37 no. 7, (1951) 452–455.Google Scholar

Zeiler, M. D. and Fergus, R., “Visualizing and understanding convolutional networks,” in Computer Vision – ECCV 2014, (2014) 818–833.Google Scholar

Rahimi, A. and Recht, B., “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems 20, (2008) 1177–1184.Google Scholar

Devlin, J., Chang, M., Lee, K., and Toutanova, K., “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805 [cs.CL].Google Scholar

Gell-Mann, M. and Low, F. E., “Quantum electrodynamics at small distances,” Physical Review 95 (Sep, 1954) 1300–1312.Google Scholar

Wilson, K. G., “Renormalization group and critical phenomena. I. Renormalization group and the Kadanoff scaling picture,” Physical Review B 4 (Nov, 1971) 3174–3183.Google Scholar

Wilson, K. G., “Renormalization group and critical phenomena. II. Phase-space cell analysis of critical behavior,” Physical Review B 4 (Nov, 1971) 3184–3205.Google Scholar

Stueckelberg de Breidenbach, E. C. G. and Petermann, A., “Normalization of constants in the quanta theory,” Helvetica Physica Acta 26 (1953) 499–520.Google Scholar

Goldenfeld, N., Lectures on Phase Transitions and the Renormalization Group. CRC Press, 2018.CrossRef Google Scholar

Cardy, J., Scaling and Renormalization in Statistical Physics. Cambridge Lecture Notes in Physics. Cambridge University Press, 1996.Google Scholar

Minsky, M. and Papert, S. A., Perceptrons: An Introduction to Computational Geometry. MIT Press, 1988.Google Scholar

Coleman, S., Aspects of Symmetry: Selected Erice Lectures. Cambridge University Press, 1985.Google Scholar

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S., “Exponential expressivity in deep neural networks through transient chaos,” in Advances in Neural Information Processing Systems 29, (2016) 3360–3368. arXiv:1606.05340 [stat.ML].Google Scholar

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J., “On the expressive power of deep neural networks,” in International Conference on Machine Learning, pp. 2847–2854. 2017. arXiv:1606.05336 [stat.ML].Google Scholar

Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl-Dickstein, J., “Deep information propagation,” in 5th International Conference on Learning Representations. 2017. arXiv:1611.01232 [stat.ML].Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J., “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034. 2015. arXiv:1502.01852 [cs.CV].Google Scholar

Kadanoff, L., “Critical behavior. Universality and scaling,” in Proceedings of the International School of Physics Enrico Fermi, Course LI (27 July - 8 August 1970). 1971.Google Scholar

Jaynes, E. T., Probability Theory: The Logic of Science. Cambridge University Press, 2003.Google Scholar

Froidmont, L., On Christian Philosophy of the Soul. 1649.Google Scholar

MacKay, D. J., “Probable networks and plausible predictions–a review of practical Bayesian methods for supervised neural networks,” Network: Computation in Neural Systems 6 no. 3, (1995) 469–505.Google Scholar

Williams, C. K. I., “Computing with infinite networks,” in Advances in Neural Information Processing Systems 9, (1996) 295–301.Google Scholar

Hinton, G., Vinyals, O., and Dean, J., “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop. 2015. arXiv:1503.02531 [stat.ML].Google Scholar

Hebb, D., The Organization of Behavior: A Neuropsychological Theory. Taylor & Francis, 2005.CrossRef Google Scholar

Coleman, S., “Sidney Coleman’s Dirac lecture ‘quantum mechanics in your face’,” arXiv:2011.12671 [physics.hist-ph].Google Scholar

Jacot, A., Gabriel, F., and Hongler, C., “Neural tangent kernel: Convergence and generalization in neural networks,” in Advances in Neural Information Processing Systems 31, (2018) 8571–8580. arXiv:1806.07572 [cs.LG].Google Scholar

Hochreiter, S., Untersuchungen zu dynamischen neuronalen Netzen. Diploma, Technische Universität München, 1991.Google Scholar

Bengio, Y., Frasconi, P., and Simard, P., “The problem of learning long-term dependencies in recurrent networks,” in IEEE International Conference on Neural Networks, vol. 3, pp. 1183–1188, IEEE. 1993.Google Scholar

Pascanu, R., Mikolov, T., and Bengio, Y., “On the difficulty of training recurrent neural networks,” in International Conference on Machine Learning, pp. 1310–1318, PMLR. 2013. arXiv:1211.5063 [cs.LG].Google Scholar

Kline, M., Mathematical Thought From Ancient to Modern Times: Volume 3. Oxford University Press, 1990.Google Scholar

Lee, J., Xiao, L., Schoenholz, S., et al., “Wide neural networks of any depth evolve as linear models under gradient descent,” in Advances in Neural Information Processing Systems 32, (2019) 8572–8583. arXiv:1902.06720 [stat.ML].Google Scholar

Fix, E. and Hodges, J., “Discriminatory analysis. Nonparametric discrimination: Consistency properties,” USAF School of Aviation Medicine, Project Number: 21-49-004, Report Number: 4 (1951) .Google Scholar

Cover, T. and Hart, P., “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory 13 (1967) 21–27.Google Scholar

Einstein, A., “On the method of theoretical physics,” Philosophy of Science 1 no. 2, (1934) 163–169.Google Scholar

Hanin, B. and Nica, M., “Finite depth and width corrections to the neural tangent kernel,” in International Conference on Learning Representations. 2020. arXiv:1909.05989 [cs.LG].Google Scholar

Dyer, E. and Gur-Ari, G., “Asymptotics of wide networks from Feynman diagrams,” in International Conference on Learning Representations. 2020. arXiv:1909.11304 [cs.LG].Google Scholar

Giudice, G. F., “Naturally speaking: The naturalness criterion and physics at the LHC,” arXiv:0801.2562 [hep-ph].Google Scholar

Dyson, F. J, “Forward,” in Classic Feynman: All the Adventures of a Curious Character, Leighton, R., ed., W. W. Norton & Company Ltd., 2006, pp. 5–9.Google Scholar

Chizat, L., Oyallon, E., and Bach, F., “On lazy training in differentiable programming,” in Advances in Neural Information Processing Systems 32, (2019) 2937–2947. arXiv:1812. 07956 [math.OC].Google Scholar

MacKay, D. J., Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.Google Scholar

Kaplan, J., McCandlish, S., Henighan, T., et al., “Scaling laws for neural language models,” arXiv:2001.08361 [cs.LG].Google Scholar

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O., “Understanding deep learning requires rethinking generalization,” arXiv:1611.03530 [cs.LG].Google Scholar

Wolpert, D. H., “The lack of a priori distinctions between learning algorithms,” Neural Computation 8 no. 7, (1996) 1341–1390.Google Scholar

Wolpert, D. H. and Macready, W. G., “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation 1 no. 1, (1997) 67–82.Google Scholar

Boltzmann, L., “On certain questions of the theory of gases,” Nature 51 no. 1322, (1895) 413–415.Google Scholar

Boltzmann, L., Lectures on Gas Theory. Berkeley, University of California Press, 1964. Trans. by S. G. Brush from Vorlesungen ueber Gastheorie (2 vols., 1896 & 1898).Google Scholar

Shannon, C. E., “A mathematical theory of communication,” The Bell System Technical Journal 27 no. 3, (1948) 379–423.Google Scholar

Shannon, C. E., “A mathematical theory of communication,” The Bell System Technical Journal 27 no. 4, (1948) 623–656.Google Scholar

Jaynes, E. T., “Information theory and statistical mechanics,” Physical Review 106 (May, 1957) 620–630.Google Scholar

Jaynes, E. T., “Information theory and statistical mechanics. II,” Physical Review 108 (Oct, 1957) 171–190.Google Scholar

LeCun, Y., Denker, J., and Solla, S., “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, (1990) 598–605.Google Scholar

Frankle, J. and Carbin, M., “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning Representations. 2019. arXiv:1803.03635 [cs.LG].Google Scholar

Banks, T. and Zaks, A., “On the phase structure of vector-like gauge theories with massless fermions,” Nuclear Physics B 196 no. 2, (1982) 189–204.Google Scholar

Linsker, R., “Self-organization in a perceptual network,” Computer 21 no. 3, (1988) 105–117.Google Scholar

Becker, S. and Hinton, G. E., “Self-organizing neural network that discovers surfaces in random-dot stereograms,” Nature 355 no. 6356, (1992) 161–163.Google Scholar

Gale, B., Zemeckis, R., Fox, M. J., and Lloyd, C., Back to the Future Part II. Universal Pictures, 1989.Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. 2016.CrossRef Google Scholar

Ioffe, S. and Szegedy, C., “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, pp. 448–456. 2015. arXiv:1502.03167 [cs.LG].Google Scholar

Veit, A., Wilber, M., and Belongie, S., “Residual networks behave like ensembles of relatively shallow networks,” in Advances in Neural Information Processing Systems 30, (2016) 550–558. arXiv:1605.06431 [cs.CV].Google Scholar

Ba, J. L., Kiros, J. R., and Hinton, G. E., “Layer normalization,” in Deep Learning Symposium, Neural Information Processing Systems. 2016. arXiv:1607.06450 [stat.ML].Google Scholar

Book contents

References

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive