How DALL·E 2 Works
Figure 1: variations from DALL·E 2 on a blackboard doodle by Lei Pan. The original doodle is in the center, and the generated variations are displayed around it. DALL·E 2 is a system for text-to-image generation developed by my coauthors and me at OpenAI. When prompted with a caption, the system will attempt to generate a novel image from scratch that matches it. It also has additional capabilities like:
- Inpainting: perform edits to an image using language;
- Variations (Figure 1): generate new images that share the same essence as a given reference image, but differ in how the details are put together; and
- Text diffs (Figure 4): transform any aspect of an image using language.
The system underlying DALL·E 2, which we call unCLIP, is based on two key technologies: CLIP and diffusion. As stated in the blog, CLIP is a model that “efficiently learns visual concepts from natural language supervision”. Diffusion is a technique to train a generative model for images by learning to undo the steps of a fixed corruption process. We briefly describe both of these technologies next.
Figure 2:
illustration of the contrastive training objective for CLIP. During each
step of training, CLIP receives \(N =
32{,}786\) images and their corresponding captions. From these,
we form \(N\) matching image-caption
pairs (corresponding to the diagonal elements of the matrix in the
illustration), and \(N (N -
1)\) pairs of mismatching captions and images (corresponding to
the off-diagonal elements).
CLIP
consists of two neural networks – a text encoder and an image encoder –
that are trained on a large, diverse collection of image-text pairs.
Each encoder maps its input to a point on a globe (known as an
embedding) that functions as a “concept space” shared by both
modalities. During each step of training, CLIP receives a list of images
and a corresponding list of captions that describe them. Using this
data, we can form two types of image-text pairs: a matching
pair, in which an image is paired up with its corresponding caption, and
a mismatching pair, in which an image is paired up with any
other caption. The encoders are trained to map the matching pairs to
nearby points on this globe, and mismatching pairs to distant
points.
This simple training objectiveKnown as “contrastive training” in machine learning. encourages CLIP to learn about all of the features of an image that people are likely to write about online. These features include things like which objects are present, the aesthetic style, the colors and materials that are used, and so on. By contrast, CLIP is typically not incentivized to preserve information about the relative positions of objects, or information about which attributes apply to which objects. CLIP would therefore have a hard time distinguishing between, say, an image of a red cube on top of a blue cube and another image in which the positions of the two objects are swapped. The reason for this is the nature of the CLIP training objective: CLIP is only incentivized to learn the features of an image that are sufficient to match it up with the correct caption (as opposed to any of the others in the list). Unless it receives a counterexample (i.e., a caption that mentions a blue cube on top of a red cube), CLIP will not learn to preserve information about the objects’ relative positions.
Figure
3: illustration of the process used to generate a new image with the
diffusion model, created by Alex
Nichol.
A
diffusion model is trained to undo the steps of a fixed corruption
process. Each step of the corruption process adds a small amount of
noiseSpecifically, gaussian noise. to an image, which erases some of
the information in it. After the final step, the image becomes
indistinguishable from pure noise. The diffusion model is trained to
reverse this process, and in doing so learns to regenerate what might
have been erased in each step. To generate an image from scratch, we
start with pure noise and suppose that it was the end result of the
corruption process applied to a real image. Then, we repeatedly apply
the model to reverse each step of this hypothetical corruption process.
This gradually makes the image more and more realistic, eventually
yielding a pristine, noiseless image.
DALL·E 2 generates images in a two-stage process, first by generating the “gist” of an image and then by filling in the remaining details to obtain a realistic image. In the first stage, a model which we call the prior generates the CLIP image embedding (intended to describe the “gist” of the image) from the given caption.One might ask why this prior model is necessary: since the CLIP text encoder is trained to match the output of the image encoder, why not use the output of the text encoder as the “gist” of the image? The answer is that an infinite number of images could be consistent with a given caption, so the outputs of the two encoders will not perfectly coincide. Hence, a separate prior model is needed to “translate” the text embedding into an image embedding that could plausibly match it. In the second stage, a diffusion model which we call unCLIP generates the image itself from this embedding. During each step of training, unCLIP receives both a corrupted version of the image it is trained to reconstruct, as well as the CLIP image embedding of the clean image. This model is called unCLIP because it effectively reverses the mapping learned by the CLIP image encoder. Since unCLIP trained to “fill in the details” necessary to produce a realistic image from the embedding, it will learn to model all of the information that CLIP deems irrelevant for its training objective and hence discards.
There’s a few reasons why it’s advantageous to use this two-stage sampling process, and we discuss two of them here.Our paper discusses further advantages of the two-stage sampling process. Firstly, we can prioritize modeling the high-level semantics that make images meaningful to humans above other details. Images contain a lot of information, most of which is used to describe to fine-grained, imperceptible details. Only a relatively tiny sliver of this information is responsible for what makes images visually coherent and meaningful to us, and the CLIP image embedding captures much of it. Training a model directly on the CLIP image embedding allows us to focus on modeling these salient characteristics first, before filling in the details necessary to synthesize a realistic image in the second stage.
Figure 4:
animation of text diff used to transform a Victorian house into a modern
one. The transformation is determined by the captions “a victorian
house”, which describes the architecture of the house, and “a modern
house”, which describes how the architecture of the house should be
changed.
The
second reason is that CLIP’s multimodal embedding space allows us to
apply “before and after” transformations to images using a technique
that we call text diffs. In 2013, word2vec showed that it is
possible to obtain a “concept space” for text in which vector arithmetic
becomes interpretable. For example, word2vec maps the word “queen” close
to the result of computing \[
\textrm{“woman”} + \textrm{“king”} - \textrm{“man”},
\] which makes it possible to complete analogies of the sort one
might encounter in a standardized test. CLIP takes this a step further
and allows us to perform arithmetic using both text and images,
as in \[
\textrm{(image of victorian house)} + \textrm{“a modern house”} -
\textrm{“a victorian house”}.
\] Using unCLIP, we can translate points in CLIP’s concept space
back into images and visually inspect the change that is taking place as
we move the embedding of the image in the direction specified by the
“before” caption (“a victorian house”) and the “after” caption (“a
modern house”).Concretely, let \(f_i\)
and \(f_t\) denote the CLIP image and
text encoders, respectively, and suppose that we have an image of a
Victorian house contained in a file house.png
which we
would like to transform into a modern house. To do this, we first
compute \[
\begin{align}
z_{i0} &= f_i(\texttt{house.png}), \\
z_{t0} &= f_t(\textrm{“a photo of a victorian house”}), \\
z_{t1} &= f_t(\textrm{“a photo of a modern house”}),
\quad\textrm{and} \\
z_d &= (z_{t1} - z_{t0}) / \|z_{t1} - z_{t0}\|,
\end{align}
\] where \(z_d\) is known as the
text diff vector. Next, to transform the house, we rotate
between the image embedding \(z_{i0}\)
and the text diff vector \(z_d\) using
\(z_{i1} = \operatorname{slerp}(z_{i0}, z_d,
\theta)\). Finally, we synthesize an image from \(z_{i1}\) using unCLIP. The animation shows
the trajectory as \(\theta\) is varied
from 0 (which reconstructs the original image) to 0.50 (which results in
a modernized version of the house). The animation shows this trajectory,
and provides us with visual confirmation that the image of a victorian
house that we started out with is indeed being “modernized” as we might
intuitively expect. Of course, text diffs are not limited to
architecture: the transformation could be any “before and after” concept
that can be expressed in language, which makes this a versatile and
powerful tool.
Acknowledgments: I’d like to thank Lei Pan, Aravind Srinivas, Rewon Child and Justin Mao-Jones for their feedback on this blog. I’d also like to thank to my coauthors Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.