Transformers
Famous neural network architecture, presented in 2017 by Google, and many things are not the same anymore since then
Last updated
Famous neural network architecture, presented in 2017 by Google, and many things are not the same anymore since then
Last updated
We have been explaining that LLMs are NN which objective is to generate the next word/s from an input sequence, and so, to be able to create new text content.
One way to implement them is using the Recurrent Neural Network (RNN) architecture:
But the have a very important problem: although they have an internal memory, they have the disadvantage that in the management of long chains of words, the importance of the most distant words on the generation of the new word loses importance until it vanishes.
In sentences, when we consider its global meaning or semantic, some words are more important than others. So, it is interesting to consider this fact and pay 'attention' to this effect. This is the attention mechanism idea on neural networks architectures: see the importance of a word (element) towards other words (elements).
So, just with word embedding, just with the semantic classification of isolated words, it is not enough for powerful NLP tasks. Some kind of 'attention' mechanism must be added, and this means the the word embedding will be complemented by combining it with other attention vectors (OK, we will skip mathematical background and dot products etc.)
With no complex maths:
Here is an example of a sentence and how attention is added to the word embedding in the sentence:
Sentence: "The cat sat on the mat."
Without attention, a traditional word embedding method would convert each word in the sentence into a fixed-size vector representation and pass them through the model. However, with attention, the model would also calculate attention weights for each word in the sentence. For example, the attention weights for this sentence might be:
"The" - 0.1
"cat" - 0.4
"sat" - 0.3
"on" - 0.05
"the" - 0.05
"mat" - 0.1
These attention weights indicate that the model is giving more importance to the word "cat" and "sat" in relation to the other words in the sentence, because those words are more important to understanding the meaning of the sentence.
The model then uses these attention weights to weigh the importance of each word vector when processing the sentence.
This allows the LLMs to focus more heavily on certain words in the input text, based on the context, which helps the model to understand the meaning of the text, and solve NLP tasks in a much more effective way.
This technique avoids another drawback of a RNN, processing word after word (sequencing), that it is not interesting if we want to make the best of GPUs and TPUs. Positional encoding, allows to add the position into the final embedding, and then the process carried our with sentences can be done in parallel.
Why are we all talking about GPT, chatGPT etc.? Well, some of their results were really anticipated. Transformers properties after being trained with huge corpora, implementing huge LLM, were expected to be useful.... but not that much!!!
This is known as emergence abilities.
Emergence abilities refer to the ability of large language models, such as transformer models, to learn and generate new knowledge and insights from the data they are trained on. Here is a list of some emergence abilities that large language models can exhibit:
Zero-shot learning: the ability to understand and generate new concepts and ideas that were not explicitly included in their training data.
Few-shot learning: the ability to learn and generalize from a small number of examples. If the task to solve has no result as zero-shot, try to put some examples before.
Compositionality: the ability to understand the meaning of individual words and phrases, and then use this understanding to generate new and complex meaning by combining them in different ways.
Generative abilities: the ability to generate new and original text, images, and other forms of data based on the patterns and relationships it has learned from the training data.
Some of the emergence abilities of large language models were unexpected, these models are able to learn and generate new knowledge and insights from the data in ways that were not fully anticipated by researchers before the models were developed, such as zero-shot learning, few-shot learning, and generative abilities.
So, don't be so rude when GPT, chatGPT etc. don't give you what you expect many times... they are doing more that many expected.
Fine-tuning a language model based on transformers (and also with other NNs architectures), involves training the model on a smaller, task-specific dataset to adapt the model to a specific use case or domain.
This process adjusts the model's parameters to better suit the new task or domain, while still leveraging the knowledge learned during the pre-training on a large dataset. The process typically involves unfreezing a portion of the model's parameters, and training the model on a new dataset with a task-specific objective, such as language translation or text classification.
In the original papers were transformers were presented ('Attention is All You Need' ) besides the attention mechanism, a new way of dealing with the position of the words is introduced: the positional encondig.