1. Background Introduction

Since its introduction by Vaswani et al. in 2017, the Transformer model has made revolutionary progress in the field of natural language processing. However, as the complexity of generative tasks increases, traditional Transformer models face challenges. The proposal of Diffusion Transformers (DiTs) aims to combine the generative capabilities of the diffusion process with the self-attention mechanism of Transformers to address these challenges.

2. Diffusion Models

2.1. Background of Diffusion Models

Diffusion Models are a new and advanced type of generative model used to produce data similar to training data, capable of generating various high-resolution images.

2.2. Definition of Diffusion Models

Diffusion Models are a new and advanced type of generative model used to produce data similar to training data, capable of generating various high-resolution images.

2.3. Core Idea of Diffusion Models

Inspired by non-equilibrium thermodynamics, Diffusion Models are generative models whose core idea is to gradually add noise to data through a diffusion process and then learn to reverse this process to construct the desired data samples from the noise.

2.4. Detailed Explanation of the Diffusion Process

The diffusion process typically includes two phases: the forward process and the reverse process. In the forward process, the model gradually guides the data towards a simple noise distribution; in the reverse process, the model reverses this process, gradually removing noise to recover the original data.

3. Transformer Architecture

3.1. Overview of Transformers

Transformers are models based on the self-attention mechanism that can process sequential data and capture long-distance dependencies. They consist of an encoder and a decoder, implementing information transfer and processing through self-attention layers and feed-forward neural network layers.

3.2. Self-Attention Mechanism

The self-attention mechanism allows the model to consider all elements of a sequence simultaneously while processing it, thereby capturing the global context.

4. Combining Diffusion Transformers

4.1. Key Concepts of DiTs

4.1.1. Definition of DiT

A Diffusion Transformer is a diffusion model combined with the Transformer architecture, used for image and video generation tasks, capable of efficiently capturing dependencies in data and producing high-quality results.

4.1.2. Essence of DiT

A Diffusion Transformer is a new type of diffusion model that combines the denoising diffusion probabilistic model (DDPM) with the Transformer architecture.

4.1.3. Core Idea of DiT

The core idea of the Diffusion Transformer is to use the Transformer as the backbone network for the diffusion model, instead of traditional convolutional neural networks (such as U-Net), to handle the latent representations of images.

4.2. Workflow of DiTs

By introducing noise and training a neural network to reverse the noise addition process, combined with the Transformer model, image or video generation and transformation are achieved. This process involves data preprocessing, noise introduction, model training, and the final image or video generation.

4.2.1. Data Preprocessing

Convert the input image or video data into a format that can be processed by the model, such as dividing the image into fixed-size patches and then transforming these patches into feature vectors.

4.2.2. Noise Introduction

Gradually introduce noise to the feature vectors after data preprocessing, forming a noise-increasing diffusion process. This process can be seen as a transformation from original data to noisy data.

4.2.3. Model Training

Use the feature vectors with introduced noise as input to train the Diffusion Transformer model. The goal of the model is to learn how to reverse the noise addition process, i.e., to recover the original data from the noisy data.

4.2.4. Image or Video Generation

After the model training is complete, new images or videos can be generated by inputting noisy data (or randomly generated noise) into the model, which, after being processed by the model, generates new images or videos. This generation process utilizes the mapping relationship from noise to original data learned by the model.

4.3. Architecture of DiT

The DiT architecture is based on the Latent Diffusion Model (LDM) framework, using Vision Transformer (ViT) as the backbone network, and constructing a scalable diffusion model by adjusting the normalization of ViT. The architecture is as follows:

4.3.1. Input Layer

The input layer receives conditional information, providing necessary context for the generative process of DiTs.

4.3.2. Diffusion Layer

The diffusion layer is responsible for gradually introducing noise, generating diffused data.

4.3.3. Reverse Diffusion Layer

The reverse diffusion layer reverses the diffusion process, removing noise to generate the target data.

4.3.4. Self-Attention Module

The self-attention module plays a role in each diffusion and reverse diffusion step, helping the model capture global information.

5. Applications of DiTs

5.1. Sora

5.1.1. Definition of Sora

The Sora model is an advanced visual technology model that generates videos in a unique way, forming the final imagery by gradually removing noise, resulting in more detailed scenes and the ability to learn complex dynamics.

5.1.2. Core Components of Sora

The core components of the Sora model include the Diffusion Transformer (DiT), Variational Autoencoder (VAE), and Vision Transformer (ViT).
DiT is responsible for recovering original video data from noisy data, VAE is used to compress video data into latent representations, and ViT is used to transform video frames into feature vectors for DiT processing.

Diffusion Transformer (DiT): Combining the advantages of diffusion models and Transformer architecture, DiT can generate high-quality, realistic video content by simulating the diffusion process from noise to data. In the Sora model, DiT is responsible for recovering original video data from noisy data.
Variational Autoencoder (VAE): VAE is a generative model that can compress input images or video data into low-dimensional latent representations and restore these latent representations to original data through a decoder. In the Sora model, VAE is used as an encoder to compress input video data into inputs for DiT, thereby guiding DiT to generate video content similar to the input video.
Vision Transformer (ViT): ViT is an image processing model based on the Transformer that treats images as a series of patches and transforms these patches into feature vectors as inputs for the Transformer. In the Sora model, ViT may be used as a preprocessing step or as a component of the model.