Stable Diffusion - Architecture



Large text-to-image models have achieved remarkable success enabling high quality synthesis of images form text prompts. Stable Diffusion is one among them for image generation. It is based on a type of diffusion model called Latent Diffusion Model, created by CompVis, LMU and RunwayML.

This new diffusion model reduces the use of memory and computation time by applying the diffusion process over a less dimensional latent space, rather than the actual high dimensional image space.

The three main component in the Stable Diffusion are −

  • Variational Autoencoder (VAE)
  • U-Net
  • Text-Encoder
Architecture of Latent Diffusion

Variational Autoencoder (VAE)

Variational Autoencoder (VAE) has two parts: an encoder and a decoder. During the training, the encoder converts an image into a low dimensional latent representation for the forward diffusion process, i.e. the process of turning an image into noise. These small encoded versions are called latents, to which noise is applied repeatedly in each step training which is fed as input for the U-Net model.

The decoder of VAE is used to transform the low dimensional representation back into an image. The denoised latents generated by the reverse diffusion process, i.e., the process of converting noise into image which is done using the decoder.

U-Net

U-Net is a convolutional neural network that predicts denoised image representation of noisy latent. The input for U-Net is the noisy latents, and the output of UNet is noise in the latents. This step is especially carried out to get the actual latents by removing the noise for the noisy latents.

The architecture of U-Net in this model consists of an encoder with 12 blocks followed by a middle block, and then a decoder with 12 blocks. Among these 25 blocks, 8 of them are either for down sampling or up sampling convolution layers and the rest are the main blocks which consist 4 resNet layers and two Vision Transformers(ViTs).

Text Encoder

A text encoder is a simple transformer-based model that transforms a sequence of input tokens into a series of latent text embeddings. Stable Diffusion applied the pre-trained CLIP text encoder, which generated embeddings that correspond to the given input text. The embedding space is further used as input for U-Net, also provides guidance for denoising noisy latents during the U-Net's training process.

Advertisements