Stable Diffusion model
When magic becomes true
Last updated
When magic becomes true
Last updated
Stable diffusion model, and diffusion-based generation models in general, combines the fantastic properties of transformers in getting the semantics and meaning of words and sentences, with similar mechanisms able to get the essence and patterns of images, including objects, actions, illuminations, and other important features that emerge from the training of thousands of milions of images.
We are going to carry out a quite high level view of the basic ideas behind these models, since their structure and architecture are quite complex.
Thousands of millions of images with a text (or several) describing the image (objects, actions, descriptive information, feelings, etc.)
We already know transformers, and that they finally are able to capture the semantics of words and sentences. Somehow, they achieve certain degree of language understanding and a great knowledge of language semantics.
Stable diffusion will try to do something similar including also images. So, the idea, is not generate a model just able to get the 'essence' of patterns of images, and so, of any image, but also connect this knowledge with the text description and its semantics.
Stable diffusion is a multimodal model, and it aims to capture semantics of text and images altogether and in conjunction.
The training idea is based on diffusion models, which are based in the concept of Gaussian noise.
The multi-component NN will learn how to add a specific amount of noise, so it will be able to do the opposite process: denoising.
In stable diffusion, the images are going to be first transformed to latent space, with an encoder, which means the NN won't work with pixel, but with a much more synthetic representation of the images. This was introduced in the original paper of stable diffusion (latent diffusion models).
The embedding produced by the image encoder, will be complemented with an embedding generated for the prompt (text), using a CLIP-based process (we will see it later), that achieves an alignment between both embeddings (text and image) based on a true relationship between them.
This will make the magic: the combination of both embeddings, in the training process, with the noising/denoising process conditioned to the prompt (using the attention mechanism of the transformers), that will provide the NN strong insights about the relationship of meanings and patterns: between text (objects, actions, etc.) and images (shapes, patterns etc.).
Using thousands of millions of pairs image/captions, the result is a NN able to make real something very similar to a true understanding of the concepts reflected in natural languages and shapes, illuminations, patterns, objects in images.
The image generation is carried out by providing a text or 'prompt'. Optionally, the prompt can come with an input image, that will be used to guide the image generation process.
Again, random values are going to be part of the initial creation of the image. This will help the creativity and unexpected result of the process. This randomness is going to be a random noisy image.
The idea of diffusion models is that you take an image and add a little bit of Gaussian noise to it, so you obtain a slightly noisy image. Then you repeat that process, so to the slightly noisy image you again add a little bit of Gaussian noise to obtain an even noisier image. You repeat this several times (up to ~1000 times) to obtain a fully noisy image.
While doing so, you know for each step the original image (or slightly noisy image) and it’s noisier version.
Then you train a neural network which gets as input the noisier example and has the task to predict the denoised version of the image.
In doing so for many different steps, the neural network learns to denoise very noisy images in a repeated manner to obtain the original image.
Let's imagine we are going to work with only images of faces. Using autoencoders, we can train the autoencoders in a chain way, where the autoencoder is provided with a image with some degree of noise and another one with more noise.
If we train this models with thousands of faces, the model is able to work in a bidirectional way.
After this training, yes !!!! if the input is a totally noisy image, the final output will be a new face !!!! But ..... what face ?????
This is one of the magical strong points of the model. Nobody can predict what face will be generated.
The final NN in this stage, depending on this random points, depending on what random and particulary collection of initial pixels, will potentially intuits a shape of a potential face (somewhere in the noise), a potential nose, ears... little by litte, step by step, and in each step a final face appears. Magically ;)
This is the problem now. Only faces? We want to be able to generate anything, doing anything, with different styles, iluminations, etc.
So, the very first consequence is to need thousands of millions of images. OK.
However, this is not enough.
Since the objective it is to start from a prompt, text is important here. Text must be able to guide the process.
How to achieve that? By making the training process (noising/denoising process) not just with image embeddings, but to take into account also the text embedding of the images !!
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on hundreds of millions of images and their associated captions. It learns how much a given text snippet relates to an image.
CLIP is used in stable diffusion to compute the embedding of the text that will be used in the training process.
CLIP models, OpenCLIP is the one now used in Stable Diffusion, are pre-trained with millions of images and text describing them. They produce embeddings for text and image, and the training of these models tries to maximize the similarity of both embeddings.
The idea is that, from an image, the more similar embedding belongs to a text that better describes the image and viceversa.
This is how CLIP is used in the training of Stable Diffusion, using this pre-trained model so the text embedding (of the caption) used in the training of the process is very suitable for the image.
So, the training is not just noising/denoising and that's all. The embedding of the image is complemented by the embedding of the text (using CLIP embeddings), so when the model is trained with an image and a caption, the image embedding and the caption embedding are taken into account in the learning process of the NN.
So, when we start from just an image with only noise and a text, besides the intuition of the neurons to find something in the noise, this something is also conditioned by the text... so, the Stable Diffusion models start with a random noisy image and a text, but the model, thanks to the intensive training with huge datasets of images-captions, knows what things must try to find in the noise !!
So, the noising/denoising process is guided with a very clever complement that allows the neurons of the model to really relate text and shapes etc.
In this way, we can not just have only good results with faces, but with any input text prompt describing anything, since the stable diffusion NN neurons will pay attention to what the text says regarding on what to find in the image.
A description of this process with more detail: