From HandWiki - Reading time: 23 min
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
(Learn how and when to remove this template message)
|
| Machine learning and data mining |
|---|
Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. While the computational implementations of ANNs relate to earlier discoveries in mathematics, their creation was inspired by biological neural circuitry. The first implementation of ANNs was the perceptron by Frank Rosenblatt.[‡ 1] Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling this period an "AI winter".[1]
Later, advances in hardware and the development of the backpropagation algorithm, as well as recurrent neural networks and convolutional neural networks, renewed interest in ANNs. The 2010s saw the development of a deep neural network (i.e., one with many layers) called AlexNet.[‡ 2] It greatly outperformed other image recognition models, and is thought to have launched the ongoing AI spring.[2] The transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language,[‡ 3] and is the predominant architecture used by large language models such as GPT-4. Diffusion models were first described in 2015, and became the basis of image generation models such as DALL-E in the 2020s.
Jürgen Schmidhuber suggests that the first neural network was the method of linear regression by least squares, first published by Adrien-Marie Legendre in 1805[3] and independently developed by Friedrich Gauss (who claimed use since 1795)[4] and Robert Adrain (1808),Cite error: Closing </ref> missing for <ref> tag[5] Alexander Bain's Mind and Body (1873) proposed that thoughts and bodily activity result from neuronal processes, with each thought corresponding to a distinct neural grouping.[6] William James's The Principles of Psychology (1890) advanced two principles on a quasi-neurological basis: first, that when two brain processes are active together, one tends to propagate excitement into the other, and second, that activity at any brain point is the sum of tendencies from all other points discharging into it.[6][7][lower-alpha 1]
The neuron's place as the primary functional unit of the nervous system was first recognized in the late 19th century through the work of Santiago Ramón y Cajal, notably through his 1888 paper presenting staining of axons in the cerebellum of birds.[8]
Warren McCulloch and Walter Pitts's 1943 paper "A Logical Calculus of the Ideas Immanent in Nervous Activity" studied several abstract models for neural networks, using the symbolic logic of Rudolf Carnap and Principia Mathematica. The paper argued that several abstract models of neural networks (some learning, some not) have the same computational power as Turing machines.[‡ 4][9] This model paved the way for research to split into two approaches: one focused on biological processes, while the other focused on the application of neural networks to artificial intelligence.[citation needed] This also led to work on nerve networks and their link to finite automata.[10][importance?] Some[who?] consider McCulloch and Pitts to be the founders of connectionism, a theory of mind in opposition to classical computationalism.[11][6][clarification needed]
In his 1948 report "Intelligent Machinery", published posthumously in 1969, Alan Turing proposed randomly connected networks of neuron-like nodes trainable through "education", defining A-type machines with random networks of NAND gates and B-type machines with modifiable connections.[12]
In 1949, the psychologist Donald O. Hebb published The Organization of Behavior, which proposed a learning hypothesis based on the mechanism of neural plasticity which became known as Hebbian learning, summarized as "neurons that fire together, wire together".[‡ 5][13] Similar observations were made by Jerzy Konorski in 1948.[14] The concept was used in many early neural networks, such as Rosenblatt's perceptron and the Hopfield network.[citation needed] This evolved into models for long-term potentiation.[citation needed]
Belmont Farley and Wesley A. Clark (1954) were the first to use computational machines to simulate a Hebbian network.[‡ 6][15] Other neural network computational machines were simulated by Nathaniel Rochester, John Holland, Lois Haibt and William Duda (1956).[‡ 7][16]: 31
In 1959, a biological model was proposed by David H. Hubel and Torsten Wiesel based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.[17][importance?]
The perceptron was created by Frank Rosenblatt in 1957 while working at the Cornell Aeronautical Laboratory, publishing the details the following year.[‡ 1] The perceptron was designed to classify objects into two categories, updating based on error feedback.[18][clarification needed] He initially simulated the perceptron on an IBM 704, later designing the Mark I Perceptron, the first hardware neural net.[19] In 1958, Rosenblatt proposed the multilayer perceptron (MLP) model, consisting of an input layer, a hidden non-learning layer with randomized weights, and an output layer with learnable connections. He published the book Principles of Neurodynamics in 1962, which also introduced variants and computer experiments, including a version (developed alongside Henry David Block and Bruce Knight) with four-layer perceptrons where the last two layers have learned weights.[‡ 8][‡ 9]
Bernard Widrow and his doctoral student Marcian Hoff developed ADALINE (Adaptive Linear Neuron) in 1960. Unlike Rosenblatt's perceptron, ADALINE adjusted weights based on their least mean squares (LMS) algorithm before applying the threshold function.[6][20] MADALINE, the multilayer extension, was used to eliminate echo on phone lines, likely the first artificial neural network applied to a real‑world engineering problem.[6][21]
Group method of data handling, a method to train arbitrarily deep neural networks, was published by Alexey Ivakhnenko and Valentin Lapa in 1965; they regarded it as a form of polynomial regression[‡ 10] or a generalization of Rosenblatt's perceptron.[‡ 11] A 1971 paper described a deep network with the equivalent of eight layers trained by this method.[‡ 12][13]
The first deep learning multilayer perceptron trained by stochastic gradient descent was published in 1967 by Shun'ichi Amari.[‡ 13] According to Amari, in computer experiments conducted by his student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearly separable pattern classes.[22] Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique.[citation needed]
In 1969, Marvin Minsky and Seymour Papert published Perceptrons with the goal of showing the limitations of neural network systems.[16]: 97, 110 They proved single-layer perceptrons cannot compute non-linearly separable functions such as XOR. They also demonstrated that certain problems (e.g. determining parity) are impossible for single-layer networks under restriction of "conjunctive locality",[clarification needed] and the maximum number of connections required to compute others (e.g. connectedness) grows arbitrarily large with input size.[‡ 14][16]: 141–150
Despite the limitations of the conclusions, for example to networks with at most two layers,Template:Copyedit inline there was a subsequent decrease in funding from American and British agencies given to neural network projects[23][24] in favor of symbolic AI,[citation needed] and a reduction in the number of computer scientists working in the field.[25]
Up until the 1970s, neural networks were limited by their capacity to learn and update their neurons. The "learning rule" used by Rosenblatt for the perceptron only allowed for training a single layer of a neural network. The terminology "back-propagating errors" was introduced by Rosenblatt in 1962 to describe a (hypothetical) multilayer generalization of his perceptron learning algorithm.[‡ 15][26] The aforementioned least mean squares (LMS) algorithm, also known as the Widrow–Hoff learning rule or the Delta rule, was more general but still limited to single layers.[citation needed]
Backpropagation is an efficient application of the chain rule derived by Gottfried Wilhelm Leibniz in 1673 to networks of differentiable nodes.[‡ 16] Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of control theory,[‡ 17] also discovered by Arthur E. Bryson independently around the same time.[‡ 18] He presented a form of gradient descent to solve problems where neurons have continuous output, as opposed to the discrete (binary) output of existing neural networks.[27][28]
The modern form of backpropagation was developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's 1970 master thesis.[‡ 19][29] His FORTRAN code efficiently computed the derivatives of nested, differentiable functions by caching intermediate steps,[30] used to calculate arithmetic rounding errors for the results of complex expressions.[29] He published some of his results in English in 1976.[‡ 20][29] Paul Werbos developed it independently in 1971 or 1972,[31]: 342 published in his PhD thesis in 1974.[32][33] In 1982, he became the first person to apply backpropagation to neural networks.[‡ 21][30] In 1986, David E. Rumelhart et al. popularized backpropagation.[‡ 22]
One origin of the recurrent neural network (RNN) was statistical mechanics. The Ising model was developed by Wilhelm Lenz[‡ 23] and Ernst Ising[‡ 24] in the 1920s[34] as a simple statistical mechanical model of magnets at equilibrium. Glauber in 1963 studied the Ising model evolving in time, as a process towards equilibrium (Glauber dynamics), adding in the component of time.[‡ 25] Shun'ichi Amari in 1972 proposed to modify the weights of an Ising model by Hebbian learning rule as a model of associative memory, adding in the component of learning.[‡ 26] This was popularized as the Hopfield network (1982).[‡ 27]
Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, Santiago Ramón y Cajal observed "recurrent semicircles" in the cerebellar cortex.[35] In 1933, Rafael Lorente de Nó discovered "recurrent, reciprocal connections" by Golgi's method, and proposed that excitatory loops explain certain aspects of the vestibulo-ocular reflex.[‡ 28][36] Hebb considered "reverberating circuit" as an explanation for short-term memory.[37] (McCulloch Pitts) considered neural networks that contains cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past.
Two early influential works were the Jordan network (1986) and the Elman network (1990), which applied RNN to study cognitive psychology. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.[‡ 29]
Sepp Hochreiter's diploma thesis (1991)[‡ 30] proposed the neural history compressor, and identified and analyzed the vanishing gradient problem.[‡ 30][38] In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.[‡ 31][‡ 29] Hochreiter proposed recurrent residual connections to solve the vanishing gradient problem. This led to the long short-term memory (LSTM), published in 1995.[‡ 32] LSTM can learn "very deep learning" tasks[39] with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. That LSTM was not yet the modern architecture, which required a "forget gate", introduced in 1999,[‡ 33] which became the standard RNN architecture.
Long short-term memory (LSTM) networks were invented by Hochreiter and Schmidhuber in 1995 and set accuracy records in multiple applications domains.[‡ 32][‡ 34] It became the default choice for RNN architecture.
Around 2006, LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications.[‡ 35][‡ 36] LSTM also improved large-vocabulary speech recognition[‡ 37][‡ 38] and text-to-speech synthesis[‡ 39] and was used in Google voice search, and dictation on Android devices.[‡ 40]
LSTM broke records for improved machine translation,[‡ 41] language modeling,[40] and multilingual language processing.[‡ 42] LSTM combined with convolutional neural networks (CNNs) improved automatic image captioning.[‡ 43]
Kunihiko Fukushima introduced the neocognitron in 1980.[41][‡ 44][42] It was inspired by David H. Hubel and Torsten Wiesel in the 1950s and 1960s who showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers.
A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters.
Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. Downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.[citation needed]
In 1969, Fukushima introduced the ReLU (rectified linear unit) activation function.[‡ 45][22] The rectifier is the most popular activation function for CNNs and deep neural networks in general.[43]
The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance.[‡ 46][clarification needed] It did so by sharing weights in combination with backpropagation training.[‡ 47] Thus, while using a pyramidal structure as in the neocognitron, it optimized weights globally instead of locally.[‡ 46]
In 1988, Wei et al. applied backpropagation to a CNN (a simplified neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They proposed a CNN for an optical computing system.[‡ 48][‡ 49]
Max pooling appears in Fukushima's 1982 publication on the neocognitron.[‡ 50] In 1989, Yann LeCun et al. trained a max pooling CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[‡ 51][‡ 52] Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader image recognition problems and image types. Subsequently, Wei et al. modified their model by removing the last fully connected layer. They applied it for medical image object segmentation in 1991[‡ 53] and breast cancer detection in mammograms in 1994.[‡ 54]
In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, Weng et al. used max-pooling where a downsampling unit computes the maximum (rather than the average) of the activations in its patch.[‡ 55][‡ 56][‡ 57][‡ 58]
LeNet-5, a 7-level CNN by LeCun et al. in 1998[‡ 59] that classifies digits, was applied by banks to recognize hand-written numbers on checks digitized in 32x32 pixel images. The ability to process higher-resolution images required more, larger layers of CNNs.
In 2010, backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[‡ 60]
Rprop (short for "resilient backpropagation") is a first-order optimization algorithm. It was created by Martin Riedmiller and Heinrich Braun (1992).[‡ 61][‡ 62] Sven Behnke (2003) relied on only the sign of the gradient (Rprop)[‡ 63] on problems such as image reconstruction and face localization.
The deep learning revolution started around CNN- and GPU-based computer vision. Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years,[‡ 64] including CNNs,[‡ 65] faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning.[44]
A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004.[‡ 64][‡ 65] In 2009, Rajat Raina, Anand Madhavan, and Andrew Ng reported a 100M deep belief network trained on 30 Nvidia GeForce GTX 280 GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training.[‡ 66]
In 2011, a CNN named DanNet[‡ 67][‡ 68] by Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3.[39] It then won more contests.[‡ 69][‡ 70] They also showed how max-pooling CNNs on GPU improved performance significantly.[‡ 71]
Many discoveries were empirical and focused on engineering. For example, in 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU, used by Fukushima in 1969,[‡ 45] worked better than widely used activation functions prior to 2011.[citation needed]
In October 2012, AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the large-scale ImageNet competition by a significant margin over shallow machine learning methods.[‡ 72] Further incremental improvements included the VGG-16 network by Karen Simonyan and Andrew Zisserman[‡ 73] and Google's Inceptionv3.[‡ 74]
The success in image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.[‡ 43][‡ 75][‡ 76]
In 2014, the state of the art was training "very deep neural network" with 20 to 30 layers.[‡ 73] Stacking too many layers led to a steep reduction in training accuracy,[45] known as the "degradation" problem.[46] In 2015, two techniques were developed concurrently to train very deep networks: highway network[‡ 77] and residual neural network (ResNet).[‡ 78] The ResNet research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.[47]
In 1991, Jürgen Schmidhuber published "artificial curiosity", neural networks in a zero-sum game.[‡ 79] The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. GANs can be regarded as a case where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set.[48] It was extended to "predictability minimization" to create disentangled representations of input patterns.[‡ 80][‡ 81]
Other people had similar ideas but did not develop them similarly. An idea involving adversarial networks was published in a 2010 blog post by Olli Niemitalo.[‡ 82][importance?] This idea was never implemented and did not involve stochasticity in the generator and thus was not a generative model. It is now known as a conditional GAN or cGAN. An idea similar to GANs was used to model animal behavior by Li, Gauci and Gross in 2013.[‡ 83]
Another inspiration for GANs was noise-contrastive estimation,[‡ 84] which uses the same loss function as GANs and which Goodfellow studied during his PhD in 2010–2014. Generative adversarial networks (GANs) as introduced by Ian Goodfellow et al. in 2014[‡ 85] became state of the art in generative modeling during 2014-2018 period. ExcellentTemplate:Opinion inline image quality is achieved by Nvidia's StyleGAN (2018)[49] based on the Progressive GAN by Tero Karras et al.[‡ 86] Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning deepfakes.[50] Diffusion models (2015)[‡ 87] eclipsed GANs in generative modeling since then, with systems such as DALL·E 2 (2022) and Stable Diffusion (2022).
Human selective attention has been studied in both neuroscience and cognitive psychology.[51] Selective attention of auditory inputs was studied by Colin Cherry in 1953, who first defined and named the cocktail party effect.[‡ 88] Donald Broadbent proposed the filter model of attention in 1958.[‡ 89] Selective attention of vision was studied in the 1960s by George Sperling using the partial report paradigm. Saccade control is modulated by cognitive processes, in that the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.[‡ 90][importance?]
These research studies inspired algorithms such as a variant of the neocognitron.[52] [‡ 91] Developments in neural networks have inspired circuit models of biological visual attention.[53][54]
A key aspect of attention mechanism is the use of multiplicative operations, which have been studied under the names of higher-order neural networks,[‡ 92] multiplication units,[‡ 93] sigma-pi units,[55] fast weight controllers,[‡ 94] and hyper-networks.[‡ 95]
During the deep learning era, the attention mechanism was developed to address problems in sequence encoding and decoding.[56][incomprehensible]
The idea of encoder-decoder sequence transduction had been developed in the early 2010s. Two papers from 2014 are most commonly cited as the originators of seq2seq.[‡ 96][‡ 97] The seq2seq architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. Seq2seq became state-of-the-art in machine translation and was instrumental in the development of the attention mechanism and transformer.
An image captioning model that would encode an input image into a fixed-length vector was proposed in 2015, citing inspiration from the seq2seq model.[‡ 43] In 2015, Kelvin Xu et al. applied the attention mechanism as used in the seq2seq model to image captioning,[57] citing Bahdanau et al. 2014,[58].
One problem with seq2seq models was their use of recurrent neural networks, which are not able to be made parallel, as both the encoder and the decoder processes the sequence token-by-token. Decomposable attention attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix"; "alignment" is the terminology used by Bahdanau et al. 2014.[‡ 98] This allowed parallel processing.[citation needed]
The idea of using attention mechanism instead of an encoder-decoder (cross-attention) for self-attention was also proposed during this period, such as in differentiable neural computers and neural Turing machines.[‡ 99] Using an attention mechanism for self-attention was termed intra-attention by Jiangpeng Cheng et al.[‡ 100] Intra-attention occurs where an LSTM is augmented with a memory network as it encodes an input sequence.
These strands of development were combined in the transformer architecture, published in Attention Is All You Need in 2017. Subsequently, attention mechanisms were extended within the framework of the transformer architecture.
Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to be made parallel, which prevented them to be accelerated on GPUs. In 2016, decomposable attention applied attention mechanism to the feedforward network, which are easy to be made parallel.[‡ 101] One of its authors, Jakob Uszkoreit, suspected that attention without recurrence is sufficient for language translation, thus the title "attention is all you need".[59]
In 2017, the original (100M-sized) encoder-decoder transformer model was also proposed in the "Attention is all you need" paper. The focus of that research was on improving seq2seq for machine translation by removing its recurrence to process all tokens in parallel and by preserving its dot-product attention mechanism to keep its text processing performance,[‡ 3] which were important factors in its widespread use in large neural networks.[60]
Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982.[61][‡ 102] SOMs are neurophysiologically inspired[62] artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.
SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.
During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by Terry Sejnowski, Peter Dayan, Geoffrey Hinton, etc., including the Boltzmann machine,[‡ 103] restricted Boltzmann machine,[63] Helmholtz machine,[‡ 104] and the wake-sleep algorithm.[‡ 105] These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann's machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986.[64]: 112
Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine (RBM)[65] to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[‡ 106][66]
In 2012, Andrew Ng and Jeff Dean created a neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.[‡ 107][importance?]
Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration.[67] In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines[‡ 108][‡ 109] or both are parity machines.[‡ 110]
Another early example of network distillation was also published in 1992, in the field of recurrent neural networks (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.[‡ 111]
A related methodology was model compression or pruning, where a trained network is reduced in size. It was inspired by neurobiological studies showing that the human brain is resistant to damage, and was studied in the 1980s, via methods such as Biased Weight Decay[‡ 112] and Optimal Brain Damage.[‡ 113]
The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.[68]
Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices[69] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices).[70]
<ref> tag; no text was provided for refs named schmidhuber-annotated-history
Template:Primary sources reflist
Cite error: <ref> tags exist for a group named "‡", but no corresponding <references group="‡"/> tag was found