How Do Neural Networks Work?

There are various types of neural network designed for different uses:

  • Feedforward Neural Networks (FNNs) are the simplest. Information flows forward only through a number of layers. The nodes in each layer are connected to all the nodes in the previous layer.
  • Convolutional Neural Networks (CNNs) are designed for image and video recognition. They use convolution in one or more layers. Convolution is a mathematical operation performed on two functions that expresses the shape of one modified by the other.
  • Recurrent Neural Networks (RNNs) are designed to process serial data and involve loops so that information can be carried forward from one time step to the next. This enables patterns in time-based data can be found.
  • Long Short-Term Memory Networks (LSTMs) are a type of RNNs designed for long-term data dependencies.
  • Autoencoder Neural Networks are used for data compression and removing noise and finding anomalies.
  • Generative Adversarial Networks (GANs) are used to generate new data from existing and a discriminator to distinguish between them.
  • There are others including deep belief networks, self-organising maps and Hopfield networks.

However, AI is in the news every day because of ChatGPT (Generative Pre-trained Transformer) and so I will focus on describing this type of neural network.

There are three stages in the life of a neural network. It is first programmed by a team of programmers who do not specify any words, data, information, goals or knowledge. They create a set of algorithms that manipulate some complex data structures, mostly matrices, that is tables, that will have data added during the second stage. The second stage is training which results in the matrices being filled not with words but with numbers. Think of a person learning, at the lowest level all that happens is the propensity for a synaptic connection to fire is enhanced or diminished. In this sense people do not learn words or facts but their brains are altered in subtle ways that modify the future electrical activity. It is important to realise that the neural network does not contain the words it was trained on. It is like a person who reads a novel and cannot remember any sentences or phrases from the novel but can describe the plot in detail.

The third stage is using the neural network. Many neural networks are fixed at the end of the training stage and are not modified as the result of use which is why GPT is described as ‘pre-trained’. Among other benefits this prevents malevolent users from corrupting the system. In use the system transforms an input sequence of words into an output sequence of words.

The transformation process starts by converting the input sequence of words into a sequence of tokens. A token is a number representing a word, a part of a word, such as a word root or a word ending or a punctuation character. Words are converted to tokens by looking them up in a dictionary, typically of about 50,000 words and parts of words such as ‘ing’ or ‘ed’. Regarding upper and lower case tokens represent all the different possibilities, such as ‘next’, ‘Next” and “NEXT” plus these words with a space in front.

The tokens are then converted into vectors by a process called input embedding. The vector for each token is a large data structure typically with from 768 to over 4,000 entries called dimensions. The number of dimensions depends on the LLM and is a balance between capturing complexity and focusing too much on the training data to the extent that it cannot perform well if given new data, called the problem of overfitting. Note that each token has a pre-trained and then fixed set of say 768 entries so the embedding matrix is the size of the dictionary, say 50,000 multiplied by 768.

The input sequence of words has been turned into a set of vectors and we next add information about the position of each token, called positional encoding. By the way, a vector is simply a sequence of numbers and can be thought of as a point in a multidimensional space.

We now get to the heart of the process. This consists of a sequence of encoder layers, also called transformer layers, each of which has a self-attention mechanism followed by a feedforward network. The network consist of nodes, which are equivalent to neurons, and each node has parameters consisting of biases and weights learned during the training period. To give you some idea of the size, GPT-3 has 175 billion parameters and 96 transformer layers. The neural network is very wide and not very deep.

The self-attention mechanism is key to the efficiency of the overall design. We first take the vector representing each token and convert it into three different vectors, a query vector, a key vector and a value vector. This is done by multiplying the vector representing each token by a query weight matrix, a key weight matrix and a value weight matrix learnt during the training stage by a process called backpropagation and gradient descent, of which more later. Note that for efficiency and to improve its ability to generalise there is only one set of these three weight matrices for the entire input sequence.

These three vectors are important, but their names do not give a good idea of their function.

  • The query vector captures the intent or purpose of the input text. It is multiplied by the query weight matrix to determine how much attention or importance to place on each value in the query. In other words, it is used to determine the relevance of the different parts of the input text.
  • The key vector represents the associations between the different concepts or words in the input text.
  • The value vector carries the information and meaning about a particular word or concept.

When combined with the associated matrices they enable the model to compute attention scores that determine how much focus or importance should be placed on different values when generating an output.

Next, the model calculates what are called scores. For each pair of tokens, we multiply the corresponding elements of the query vector of one and the key vector of the other and add all the results together, this is called taking the dot product, and gives a score. If this gives a big number then the two tokens are closely linked, for example, in the sentence “John kicked the ball and it broke the window”, “it” is closely linked to “ball”. We then have a set of scores which are normalised by turning them into probabilities which add up to one using what is called a softmax function. This tells us how much attention each token should pay to each other token. Finally, each value vector is multiplied by the normalised score for the corresponding token and the result summed to produce the output vector for each token. This complex process has been found to capture dependencies between tokens even when they are far apart, in an efficient manner.

Next, each token from the self-attention mechanism is passed through a feed-forward neural network. A feed-forward network, as mentioned above, is one where each node is connected to all the nodes in the previous layer. There are three processes that are performed. First a weight matrix is applied to the output of the self-attention layer and a bias addition. The output is then passed through a non-linear activation function. There are many functions that can be used but OpenAI has discovered that a Gaussian Error Linear Unit (GELU) function leads to better performance. The details of the function are not important at this simplified level but it is non-linear, in other words its graph is a curve not a straight line. This stage is completed by applying another weight matrix again learnt by backpropagation and gradient descent.

Following the multiple encoder layers the output is normalised and some of the input is added to the result, known as a residual connection that prevents gradients disappearing and so allows deep networks to work effectively.

Finally, the output of the final layer is transformed to produce a sequence of what are called logits. This is a sequence of unnormalized scores for each possible next word or token. The logits are normalised using a softmax function to produce the probabilities of the possible next token.

The softmax function converts a vector of real numbers into a vector of probabilities all of which add up to one.

I skipped over the gradient function which is critical during the learning stage. When learning, the output produced is compared with an optimal output derived from a human or selected by a human. The difference between the output produced and the ideal is the error. The best way to think about this is by drastically simplifying the multi-dimensional space to three dimensions. The aim is to minimise the error which is equivalent in three dimensions to finding the lowest point on the surface which represents the range of errors. Simplistically, this can be found at any point by finding the direction which gives the steepest slop downwards. This has problems, for example, the process could become stuck at the bottom of a dip high up a mountain. One way to overcome this is to use a process that does not result in a precise point but a point somewhere in an area. The hope is that the area is large enough to get over the lip of the dip and proceed downwards to a lower point. This is usually done by randomly selecting a small batch of data rather than the entire data set which also improves the efficiency of the process.

The gradient for each layer is calculated and used to adjust the weight for that layer. The result is then fed back to the previous layer and the process repeated until the initial layer is reached.

It requires millions of iterations at the start of the learning process to adjust the weights to a meaningful set. Training a pre-trained system is a much quicker process as once the weights have been learnt additional knowledge only results in relatively small changes.

TRAINING

GPT uses Reinforcement Learning from Human Feedback (RLHF). Version 3.5 has 175 billion parameters which require about 800GB. It is not clear what type of neural network ChatGPT uses but it is believed to have about 100 layers, is feedforward and uses the transformer technology developed by Google in 2017. It is like an RNN but processes the entire input all at once assigning weights to differentiate the significant parts. Transformers are now replacing RNN models such as LSTM for all Natural Language Processing (NLP) tasks and recently for image recognition. They are able to process large datasets in parallel and so are faster.
o For ChatGPT, the first step is that a “prompt dataset” is defined, for example, all the useful text across the internet (including Wikipedia, scientific papers, informative websites and so on). Sequences of text are fed into the system and turned into tokens. Sequences are fed through the neural network to produce an output which is compared with a required output. An error (based on the difference) is calculated and used to adjust the weights backwards through the network. This is called Supervised Fine Tuning (SFT). For example, take the sentence “The second law of robotics states a robot must obey humans”. Select an input sequence such as “a robot must” and the required output is “obey”. Initially the network could output anything (as it is initialised randomly). It might output “troll”, the error difference is calculated and used to adjust all the weights in the network and try again. This continues processing all the text in the sample set, e.g. the internet, as short sequences trying to predict the next word. This is done millions of times.

The prompt dataset is then used to produce four to nine different outputs and a team of humans rank the answers from best to worst. This is called training the Reward Model (RM) using Reinforcement Learning with Human Feedback (RLHF).

The third and final step is to use the RM to change the weights using Proximal Policy Optimisation (PPO) to fine-tune the Supervised Policy that was initially created by a team of people (the labellers) showing the model the desired output for a selected prompt.
o Each question is processed before it is input to the network using an algorithm called tokenising. This involves identifying the words, removing punctation and converting it to lower case. Words can be split into parts and each token is a unique number called an index or ID, for example, ‘the’ is 464 in GPT-3. The sentence or sentences become a vector of these indexes. GPT-3 has a vocabulary of about 50,000 tokens. Every token is represented (in GPT-3) by a vector of 768 numbers. The output is produced by converting the final vector from the last hidden layer this gives a score for every word in the vocabulary and the highest score if picked (called logits) which is regarded as a probability.

Each weight and bias in a neural network is a single parameter. The input layer is a few hundred or a few thousand parameters and its job is to feed the hidden layers which may have a billion parameters and a hundred hidden layers.

One of the most important advances in the development of conversational systems was a Google funded paper in 2017 by Vaswani et al. called “Attention Is All You Need”. The important concept was Attention which improved the speed and introduced ‘meaning’ in the sense of incorporating references between the words used. The other important concept was the Transformer, a type of machine learning, that is a machine learning architecture. Since then, one version of the Transformer model called BERT (Bidirectional Encoder Representations from Transformers) has dominated. Google called it “one of the biggest leaps forward in the history of search”. The Transformer based model has recently been extended to vision systems. See The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io) Many Transformer models like BERT are encoder/decoder based (with six input encoders and six output decoders). GPT 2 is 36 blocks.

All GPT models have input and output transformers which encode and decode and introduce ‘meaning’ with a multi-head attention mechanism. The encoder produces a vector for each token with a query, key, and value vector. It takes a dot product of the query vector and each token’s key vector. It normalises weights by feeding step 2 into a SoftMax function (see below). The final vector gives the importance of the token by multiplying the weights by each token’s value vector.

Vector embedding is a way to represent data as points in a multi-dimensional space designed so that data points with a similar meaning are close together. One embedding is Word2Vec invented by Google in 2003. Embeddings can have thousands of dimensions. Data points can be words, e.g. find the nearest ten words to ‘king’ or even images, faces, movies, books or news items so we can find the ten images, faces, movies, books, or news items that are closest to this one. This is done by computing similarity scores using Euclidean distance, dot product, cosine distance or other functions. It can also be used to correct typos as misspelt words have high similarity scores. Another neat trick is the axes correspond to meaning such as gender or tense and so we can solve analogies such as “man is to woman as king is to __”.

ChatGPT model behaviour guidelines are written for the humans who mark the ChatGPT output so that it learns how to behave, they include no hate speech, no harassment, glorifying violence, self-harm, promoting sexual services, describing sexual activity, political opinion or malware generation. If asked about ‘culture war’ topics it is taught to describe different points of view, break down topics into informational questions, do not answer questions that lead to a massive loss of life or dangerous ideas unless it is describing them in a historical perspective. It will ‘write an argument for x’ if it is not dangerous, e.g. ‘write an argument in favour of burning more fossil fuels’. It will not judge one group or opinion as good or bad or take either side. It avoids answering questions with a false premise, e.g. “Why did Napoleon want to invade Puerto Rico?” A detailed knowledge of the model behaviour guidelines would make it easier for a bad actor to circumvent them so they should be kept secret. Independent review of adversarial inputs and their ability to jailbreak the system have found that all models are susceptible to to hand-crafted adversarial inputs although currently GPT-4 is the best-performing (https://doi.org/10.48550/arXiv.2311.04235) .

ChatGPT has been accused of stealing information from the internet (text and images) but I believe this is a misunderstanding. If a person learns to perform a task better while working in a company and uses that skill in another company, it is not stealing. Stealing involves the literal taking of information, such as a list of customers. ChatGPT does not ‘load’ the internet into its neural network or remember any information literally. It learns by adjusting weights that help it predict likely outcomes, such as the next word in a reply. It is trained on a vast corpus of text-based information to predict the next word in a sequence given the previous words, i.e., the overall context. It uses self-attention (i.e., taking the whole context of the input into account) and positional encoding (encoding the relative position). It is trained using backpropagation and stochastic gradient (see below). It can generate textual responses based on these learned relationships, but it does not have access to the internet or any other external data sources during its operation. One caveat is that some systems, such as Bard Chat, construct searches to check the text generated. They then provide footnotes linking to the page. This improves the accuracy of the system by carrying out searches. This is like an academic reference and is an acknowledgement of the source of the information. It also reinforces the point that the information is not stored in the LLM as it must search for the references after it has generated the output text.

When we say LLMs simply “predict the next word” to generate the output that phrase makes it sound statistical and blind. However, if we say it knows the next word given the context of the overall chat, the question, and the output so far then it seems very similar to what humans do. As we speak in a normal conversation, the next word magically appears, seemingly without thought, i.e., we do not have to explicitly think about what word to use next. Our human neural network fires neurons resulting in the next word appearing in our vocal output. We say we ‘understand’ the question and have a model of the world which we can explain but where is the understanding and where is the model? They are simply new sentences we are capable of uttering a word at a time. When we say we understand something we are expressing a belief that if asked about that subject we are able to answer questions correctly.

CREATIVITY

One misunderstanding I feel needs to be addressed at this point. I have heard many people say that computer systems cannot be creative as they are pre-programmed or more subtly, LLMs are fixed in their ways once trained. However, neural networks can be designed to be original and very creative.

What do I mean by creativity? In 2015, Google released DeepDream which turned image recognition around. Instead of generating images based on a description it was asked to “enhance an input image in such a way as to elicit a particular interpretation”. It was given a picture of a cloud (Leonardo da Vinci used moss) and asked, “Whatever you see there, I want more of it!” It found birds, animals, castles, fish and so on in the images. In other words, it was creating original images from a random or semi-random input. This tendency to see images in random objects is known in humans as Pareidolia. The question “Do Androids Dream of Electric Sheep?” comes to mind, the title of Philip K. Dick’s 1968 science fiction novel.

A more up-to-date example is simply asking ChatGPT to write a story or a poem. It is not a copy of what it found on the internet nor even cutting and pasting bits of what it found. It is creating something new based on an ‘understanding’ (i.e. a weighted neural network of probabilities) of everything in its training set. Sometimes its poetry or its stories are described as not very good, but they are as good as or better than the average person and we are just at the beginning of the possibilities for large language models like ChatGPT. Again, we return to where to set the bar that LLMs must jump over. We would not demand that a person write better poems than Shakespeare before agreeing they were a poet. The question too ask is—are they useful?