Word embeddings

NN, as any other software solution, deal just with numbers. So, words must be converted into numbers, and that's not trivial at all.

The concept of embedding

In deep learning, an embedding is a way of representing a discrete object, such as a word or an image, in a continuous high-dimensional space (a vector of real numbers).

This allows the model to learn a more compact and useful representation of the data, rather than just treating it as a discrete and unordered set of symbols. Codifying the discrete elements, e.g. words, into just indexes, it is not a good idea, since the neural network could interpret their value as important (how big or small a number is), where they are just non quantitative elements we want to codify.

Embeddings are often used in natural language processing task, but they can also be used, for example, in computer vision tasks, such as image classification, where they can be learned from images and then used to compare the similarity between different images.

Word embedding

Embedding is the process of transforming discrete elements into elements in a hyperspace. As mentioned, it is not a good idea to use indexes or other simple numerical representations of words, since the magnitude of the number would be misunderstood by the neural network.

It is then, much better some sort of vector respresentation. In the context of words, it's convenient that each dimension of the vector could be a semantic feature. This could give us very interesting possibilities.

We can select these features. For example:

However, as we know, a good desired property of NN is to let them to decide the features, removing the human factor in this selection.

On the other hand, it is well-known than the number of dimesions, or semantic features, must be in the order of hundreds.

So, we will use a NN to find these features, and do that by using a large corpus.

The techniques used here are a bit complex, out of the scope now, with word2vec being one of the best known techniques.

Somehow, these methods take into accout co-ocurrences of words in sentences in large corpora. When they process the full corpus, similar semantic words are mapped into similar regions of the hyperspace os semantic features.

So, finally, words with similar semantic properties will be mapped into similar areas of the hyperspaces of word embeddings.

We can visualize a large vocabulary of word embeddings (with a 3D dimensionality reduction) at:

Embedding projector - visualization of high-dimensional data

A good tutorial on word2vec word embedding algorithm: https://medium.com/@zafaralibagh6/simple-tutorial-on-word-embedding-and-word2vec-43d477624b6d

PreviousLanguage models NextLLMs (Large Language Models)

Last updated 2 years ago