Ask DALL·E 2, an image generation system created by OpenAI, to paint a picture of “goldfish slurping Coca-Cola on a beach,” and it will spit out surreal images of exactly that. The program would have encountered images of beaches, goldfish and Coca-Cola during training, but it’s highly unlikely it would have seen one in which all three came together. Yet DALL·E 2 can assemble the concepts into something that might have made Dalí proud.
DALL·E 2 is a type of generative model — a system that attempts to use training data to generate something new that’s comparable to the data in terms of quality and variety. This is one of the hardest problems in machine learning, and getting to this point has been a difficult journey.
The first important generative models for images used an approach to artificial intelligence called a neural network — a program composed of many layers of computational units called artificial neurons. But even as the quality of their images got better, the models proved unreliable and hard to train. Meanwhile, a powerful generative model — created by a postdoctoral researcher with a passion for physics — lay dormant, until two graduate students made technical breakthroughs that brought the beast to life.
DALL·E 2 is such a beast. The key insight that makes DALL·E 2’s images possible — as well as those of its competitors Stable Diffusion and Imagen — comes from the world of physics. The system that underpins them, known as a diffusion model, is heavily inspired by nonequilibrium thermodynamics, which governs phenomena like the spread of fluids and gases. “There are a lot of techniques that were initially invented by physicists and now are very important in machine learning,” said Yang Song, a machine learning researcher at OpenAI.
The power of these models has rocked industry and users alike. “This is an exciting time for generative models,” said Anima Anandkumar, a computer scientist at the California Institute of Technology and senior director of machine learning research at Nvidia. And while the realistic-looking images created by diffusion models can sometimes perpetuate social and cultural biases, she said, “we have demonstrated that generative models are useful for downstream tasks [that] improve the fairness of predictive AI models.”
To understand how creating data works for images, let’s start with a simple image made of just two adjacent grayscale pixels. We can fully describe this image with two values, based on each pixel’s shade (from zero being completely black to 255 being completely white). You can use these two values to plot the image as a point in 2D space.
If we plot multiple images as points, clusters may emerge — certain images and their corresponding pixel values that occur more frequently than others. Now imagine a surface above the plane, where the height of the surface corresponds to how dense the clusters are. This surface maps out a probability distribution. You’re most likely to find individual data points underneath the highest part of the surface, and few where the surface is lowest.
Now you can use this probability distribution to generate new images. All you need to do is randomly generate new data points while adhering to the restriction that you generate more probable data more often — a process called “sampling” the distribution. Each new point is a new image.
The same analysis holds for more realistic grayscale photographs with, say, a million pixels each. Only now, plotting each image requires not two axes, but a million. The probability distribution over such images will be some complex million-plus-one-dimensional surface. If you sample that distribution, you’ll produce a million pixel values. Print those pixels on a sheet of paper, and the image will likely look like a photo from the original data set.
The challenge of generative modeling is to learn this complicated probability distribution for some set of images that constitute training data. The distribution is useful partly because it captures extensive information about the data, and partly because researchers can combine probability distributions over different types of data (such as text and images) to compose surreal outputs, such as a goldfish slurping Coca-Cola on a beach. “You can mix and match different concepts … to create entirely new scenarios that were never seen in training data,” said Anandkumar.
In 2014, a model called a generative adversarial network (GAN) became the first to produce realistic images. “There was so much excitement,” said Anandkumar. But GANs are hard to train: They may not learn the full probability distribution and can get locked into producing images from only a subset of the distribution. For example, a GAN trained on images of a variety of animals may generate only pictures of dogs.
Machine learning needed a more robust model. Jascha Sohl-Dickstein, whose work was inspired by physics, would provide one.
Blobs of Excitement
Around the time GANs were invented, Sohl-Dickstein was a postdoc at Stanford University working on generative models, with a side interest in nonequilibrium thermodynamics. This branch of physics studies systems not in thermal equilibrium — those that exchange matter and energy internally and with their environment.
An illustrative example is a drop of blue ink diffusing through a container of water. At first, it forms a dark blob in one spot. At this point, if you want to calculate the probability of finding a molecule of ink in some small volume of the container, you need a probability distribution that cleanly models the initial state, before the ink begins spreading. But this distribution is complex and thus hard to sample from.
Eventually, however, the ink diffuses throughout the water, making it pale blue. This leads to a much simpler, more uniform probability distribution of molecules that can be described with a straightforward mathematical expression. Nonequilibrium thermodynamics describes the probability distribution at each step in the diffusion process. Crucially, each step is reversible — with small enough steps, you can go from a simple distribution back to a complex one.
Sohl-Dickstein used the principles of diffusion to develop an algorithm for generative modeling. The idea is simple: The algorithm first turns complex images in the training data set into simple noise — akin to going from a blob of ink to diffuse light blue water — and then teaches the system how to reverse the process, turning noise into images.
Here’s how it works. First, the algorithm takes an image from the training set. As before, let’s say that each of the million pixels has some value, and we can plot the image as a dot in million-dimensional space. The algorithm adds some noise to each pixel at every time step, equivalent to the diffusion of ink after one small time step. As this process continues, the values of the pixels bear less of a relationship to their values in the original image, and the pixels look more like a simple noise distribution. (The algorithm also nudges each pixel value a smidgen toward the origin, the zero value on all those axes, at each time step. This nudge prevents pixel values from growing too large for computers to easily work with.)
Do this for all images in the data set, and an initial complex distribution of dots in million-dimensional space (which cannot be described and sampled from easily) turns into a simple, normal distribution of dots around the origin.
“The sequence of transformations very slowly turns your data distribution into just a big noise ball,” said Sohl-Dickstein. This “forward process” leaves you with a distribution you can sample from with ease.
Next is the machine learning part: Give a neural network the noisy images obtained from a forward pass and train it to predict the less noisy images that came one step earlier. It’ll make mistakes at first, so you tweak the parameters of the network so it does better. Eventually, the neural network can reliably turn a noisy image, which is representative of a sample from the simple distribution, all the way into an image representative of a sample from the complex distribution.
The trained network is a full-blown generative model. Now you don’t even need an original image on which to do a forward pass: You have a full mathematical description of the simple distribution, so you can sample from it directly. The neural network can turn this sample — essentially just static — into a final image that resembles an image in the training data set.
Sohl-Dickstein recalls the first outputs of his diffusion model. “You’d squint and be like, ‘I think that colored blob looks like a truck,’” he said. “I’d spent so many months of my life staring at different patterns of pixels and trying to see structure that I was like, ‘This is way more structured than I’d ever gotten before.’ I was very excited.”
Envisioning the Future
Sohl-Dickstein published his diffusion model algorithm in 2015, but it was still far behind what GANs could do. While diffusion models could sample over the entire distribution and never get stuck spitting out only a subset of images, the images looked worse, and the process was much too slow. “I don’t think at the time this was seen as exciting,” said Sohl-Dickstein.
It would take two students, neither of whom knew Sohl-Dickstein or each other, to connect the dots from this initial work to modern day diffusion models like DALL·E 2. The first was Song, a doctoral student at Stanford at the time. In 2019, he and his adviser published a novel method for building generative models that didn’t estimate the probability distribution of the data (the high-dimensional surface). Instead, it estimated the gradient of the distribution (think of it as the slope of the high-dimensional surface).
Song found his technique worked best if he first perturbed each image in the training data set with increasing levels of noise, then asked his neural network to predict the original image using gradients of the distribution, effectively denoising it. Once trained, his neural network could take a noisy image sampled from a simple distribution and progressively turn that back into an image representative of the training data set. The image quality was great, but his machine learning model was painfully slow to sample. And he did this with no knowledge of Sohl-Dickstein’s work. “I was not aware of diffusion models at all,” said Song. “After our 2019 paper was published, I received an email from Jascha. He pointed out to me that [our models] have very strong connections.”
In 2020, the second student saw those connections and realized that Song’s work could improve Sohl-Dickstein’s diffusion models. Jonathan Ho had recently finished his doctoral work on generative modeling at the University of California, Berkeley, but he continued working on it. “I thought it was the most mathematically beautiful subdiscipline of machine learning,” he said.
Ho redesigned and updated Sohl-Dickstein’s diffusion model with some of Song’s ideas and other advances from the world of neural networks. “I knew that in order to get the community’s attention, I needed to make the model generate great-looking samples,” he said. “I was convinced that this was the most important thing I could do at the time.”
His intuition was spot on. Ho and his colleagues announced this new and improved diffusion model in 2020, in a paper titled “Denoising Diffusion Probabilistic Models.” It quickly became such a landmark that researchers now refer to it simply as DDPM. According to one benchmark of image quality — which compares the distribution of generated images to the distribution of training images — these models matched or surpassed all competing generative models, including GANs. It wasn’t long before the big players took notice. Now, DALL·E 2, Stable Diffusion, Imagen and other commercial models all use some variation of DDPM.
Modern diffusion models have one more key ingredient: large language models (LLMs), such as GPT-3. These are generative models trained on text from the internet to learn probability distributions over words instead of images. In 2021, Ho — now a research scientist at a stealth company — and his colleague Tim Salimans at Google Research, along with other teams elsewhere, showed how to combine information from an LLM and an image-generating diffusion model to use text (say, “goldfish slurping Coca-Cola on a beach”) to guide the process of diffusion and hence image generation. This process of “guided diffusion” is behind the success of text-to-image models, such as DALL·E 2.
“They are way beyond my wildest expectations,” said Ho. “I’m not going to pretend I saw all this coming.”
As successful as these models have been, images from DALL·E 2 and its ilk are still far from perfect. Large language models can reflect cultural and societal biases, such as racism and sexism, in the text they generate. That’s because they are trained on text taken off the internet, and often such texts contain racist and sexist language. LLMs that learn a probability distribution over such text become imbued with the same biases. Diffusion models are also trained on un-curated images taken off the internet, which can contain similarly biased data. It’s no wonder that combining LLMs with today’s diffusion models can sometimes result in images reflective of society’s ills.
Anandkumar has firsthand experience. When she tried to generate stylized avatars of herself using a diffusion model–based app, she was shocked. “So [many] of the images were highly sexualized,” she said, “whereas the things that it was presenting to men weren’t.” She’s not alone.
These biases can be lessened by curating and filtering the data (an extremely difficult task, given the immensity of the data set), or by putting checks on both the input prompts and the outputs of these models. “Of course, nothing is a substitute for carefully and extensively safety-testing” a model, Ho said. “This is an important challenge for the field.”
Despite such concerns, Anandkumar believes in the power of generative modeling. “I really like Richard Feynman’s quote: ‘What I cannot create, I do not understand,’” she said. An increased understanding has enabled her team to develop generative models to produce, for example, synthetic training data of under-represented classes for predictive tasks, such as darker skin tones for facial recognition, helping improve fairness. Generative models may also give us insights into how our brains deal with noisy inputs, or how they conjure up mental imagery and contemplate future action. And building more sophisticated models could endow AIs with similar capabilities.
“I think we are just at the beginning of the possibilities of what we can do with generative AI,” said Anandkumar.