Language models
Language model in NLP
After a traditional approach based on porting syntactic and semantic language rules to software, the current tendency is now focused on an inductive approach, i.e. again to use large datasets of text, in this case, to somehow model the complexity of natural language rules.
Language models
A language model can also be defined as a probability of the next word in a sentence or sequence of words.
Essentially, the model is trained to predict the likelihood of a specific word appearing after a given sequence of words. For example, given the sequence "The dog jumped over the", the model would assign a high probability to the next word being "fence", and a lower probability to other words such as "book" or "flower". This ability to predict the probability of a word given a sequence of previous words is what allows language models to generate new text or understand the meaning of a given text.
Training process
One way to train a language model in an unsupervised way is through a technique called autoregressive language modeling. This approach trains the model to predict the next word in a sequence of words, given the previous words, without any labeled data. The model is trained on a large corpus of text, and it learns the distribution of words and the patterns of how words are related to each other by predicting the next word in a sequence.
The training process goes as follows:
The model is presented with a sequence of words from the unsupervised dataset, and the model's task is to predict the next word in the sequence.
The model's predictions are then compared to the actual next word in the sequence and the model's parameters are adjusted to reduce the difference between the predicted and actual word.
This process is repeated multiple times using different sequences of words from the unsupervised dataset (normally using a kind of window or frame that we move one position to the right), and the model's parameters are gradually adjusted to minimize the overall prediction error.
When all the corpora have been processed, the final network has the possibility that, from an input sequence, to give us the most likely next word (token).
The 'creativity' or 'randomness' of the generated text, some kind of measure of innovative text generations, can be modeled, e.g., by not selectiing the most likely next word but the second, or third etc. This is something called in some of this language models as temperature.
Last updated