Attention mechanisms in neural networks are about memory access. That’s the first thing to remember about attention: it’s something of a misnomer. This YouTube video compares attention mechanism to convolutional networks and recurrent neural networks, explains which problems attention solves better than those other algorithms, and then shows how various attention mechanisms are implemented in Deeplearning4j.
“Attention” is defined as the “active direction of the mind to an object”, or more literarily as “giving heed”.1 The word describes the mind’s ability to allocate consideration unevenly across a field of sensation, thought and proprioception, to focus and bring certain inputs to the fore, while ignoring or diminishing the importance of others. So for neural networks, we’re basically talking about credit assignment. And the challenge with credit assignment, almost universally, are long-range dependencies; i.e. things that impact your predictions, but which happened a long time ago in a galaxy far, far way.
At any given moment, our minds concentrate on a subset of the total information available to them. For example, you are reading these words as a larger world flows around you: maybe you’re in a room with traffic coursing outside, maybe you’re in a plane and the pilot is making another annoying announcement about turbulence, but your focus is HERE.
This is important, because the field of sensation is wide, and the mind’s bandwidth to process information is narrow, and some inputs are indeed more important that others, with regard to any given goal. Just as a student of Buddhism channels their own attention to attain enlightenment, or an artist channels the attention of others to evoke emotion, a neural network can channel its attention to maximize the accuracy of its predictions.
Attention is, in a sense, the mind’s capital, the chief resource it can allocate and spend. Algorithms can also allocate attention, and they can learn how to do it better, by adjusting the weights they assign to various inputs. Attention is used for machine translation, speech recognition, reasoning, image captioning, summarization, and the visual identification of objects.
The fundamental task of all neural networks is credit assignment. Credit assignment is allocating importance to input features through the weights of the neural network’s model. Learning is the process by which neural networks figure out which input features correlate highly with the outcomes the net tries to predict, and their learnings are embodied in the adjusted quantities of the weights that result in accurate decisions about the data they’re exposed to.
But there are different ways to structure and channel the relationship of input features to outcomes. Feed-forward networks are a way of establishing a relationship between all input features (e.g. the pixels in a picture) and the predictions you want to make about the input (e.g. this photo represents a dog or a cat), and doing so all at the same time.
When we try to predict temporal sequences of things, like words in a sentence or measurements in a time series (e.g. temperatures or stock prices), we channel inputs in other ways. For example, a recurrent neural network like an LSTM is often used, since it takes account of information in the present time step as well as the context of past time steps. Below is one way to think about how a recurrent network operates: at each time step, it combines input from the present moment, as well as input from the memory layer, to make a decision about the data.
RNNs cram everything they know about a sequence of data elements into the final hidden state of the network. An attention mechanism takes into account the input from several time steps, say, to make one prediction. It distributes attention over several hidden states. And just as importantly, it accords different weights, or degrees of importance, to those inputs, reflected below in the lines of different thicknesses and color. In neural networks, attention is a memory-access mechanism.
The original work on a basic attention mechanism represented a leap forward for machine translation. That advance, like many increases in accuracy, came at the cost of increased computational demands. With attention, you didn’t have to fit the meaning of an entire English phrase into a single hidden state that you would translate to French.
Another way to think about attention models is like this:
Let’s say you are trying to generate a caption from an image. Each input could be part of an image fed into the attention model. The memory layer would feed in the words already generated, the context for future word predictions. The attention model would help the algorithm decide which parts of the image to focus on as it generated each new word (it would decide on the thickness of the lines), and those assignments of importance would be fed into a final decision layer that would generate a new word.
Above, a model highlights which pixels it is focusing on as it predicts the underlined word in the respective captions. Below, a language model highlights the words from one language, French, that were relevant as it produced the English words in the translation. As you can see, attention provides us with a route to interpretability. We can render attention as a heat map over input data such as words and pixels, and thus communicate to human operators how a neural network made a decision. (This could be the basis of a feedback mechanism whereby those humans tell the network to pay attention to certain features and not others.)
In autumn 2017, Google removed the attention mechanism from recurrent networks and showed that it could outperform RNNs alone, with an architecture called Transformer.
You could say that attention networks are a kind of short-term memory that allocates attention over input features they have recently seen. Attention mechanisms are components of memory networks, which focus their attention on external memory storage rather than a sequence of hidden states in an RNN.
Memory networks are a little different, but not too. They work with external data storage, and they are useful for, say, mapping questions as input to answers stored in that external memory.
That external data storage acts as an embedding that the attention mechanism can alter, writing to the memory what it learns, and reading from it to make a prediction. While the hidden states of a recurrent neural network are a sequence of embeddings, memory is an accumulation of those embeddings (imagine performing max pooling on all your hidden states – that would be like memory).
1) “Heed”, in turn, is related to the German hüten, “to guard or watch carefully”.
There’s an old Zen story: a student said to Master Ichu, “Please write for me something of great wisdom.” Master Ichu picked up his brush and wrote one word: “Attention.” The student said, “Is that all?” The master wrote, “Attention. Attention.” The student became irritable. “That doesn’t seem profound or subtle to me.” In response, Master Ichu wrote simply, “Attention. Attention. Attention.” In frustration, the student demanded, “What does this word ‘attention’ mean?” Master Ichu replied, “Attention means attention.”