Researchers Uncover Hidden Ingredients Behind AI Creativity

Adrián Astorgano for Quanta Magazine
Introduction
We were once promised self-driving cars and robot maids. Instead, we’ve seen the rise of artificial intelligence systems that can beat us in chess, analyze huge reams of text and compose sonnets. This has been one of the great surprises of the modern era: physical tasks that are easy for humans turn out to be very difficult for robots, while algorithms are increasingly able to mimic our intellect.
Another surprise that has long perplexed researchers is those algorithms’ knack for their own, strange kind of creativity.
Diffusion models, the backbone of image-generating tools such as DALL·E, Imagen and Stable Diffusion, are designed to generate carbon copies of the images on which they’ve been trained. In practice, however, they seem to improvise, blending elements within images to create something new — not just nonsensical blobs of color, but coherent images with semantic meaning. This is the “paradox” behind diffusion models, said Giulio Biroli, an AI researcher and physicist at the École Normale Supérieure in Paris: “If they worked perfectly, they should just memorize,” he said. “But they don’t — they’re actually able to produce new samples.”
To generate images, diffusion models use a process known as denoising. They convert an image into digital noise (an incoherent collection of pixels), then reassemble it. It’s like repeatedly putting a painting through a shredder until all you have left is a pile of fine dust, then patching the pieces back together. For years, researchers have wondered: If the models are just reassembling, then how does novelty come into the picture? It’s like reassembling your shredded painting into a completely new work of art.
Now two physicists have made a startling claim: It’s the technical imperfections in the denoising process itself that leads to the creativity of diffusion models. In a paper that will be presented at the International Conference on Machine Learning 2025, the duo developed a mathematical model of trained diffusion models to show that their so-called creativity is in fact a deterministic process — a direct, inevitable consequence of their architecture.
By illuminating the black box of diffusion models, the new research could have big implications for future AI research — and perhaps even for our understanding of human creativity. “The real strength of the paper is that it makes very accurate predictions of something very nontrivial,” said Luca Ambrogioni, a computer scientist at Radboud University in the Netherlands.
Bottoms Up
Mason Kamb, a graduate student studying applied physics at Stanford University and the lead author of the new paper, has long been fascinated by morphogenesis: the processes by which living systems self-assemble.
One way to understand the development of embryos in humans and other animals is through what’s known as a Turing pattern, named after the 20th-century mathematician Alan Turing. Turing patterns explain how groups of cells can organize themselves into distinct organs and limbs. Crucially, this coordination all takes place at a local level. There’s no CEO overseeing the trillions of cells to make sure they all conform to a final body plan. Individual cells, in other words, don’t have some finished blueprint of a body on which to base their work. They’re just taking action and making corrections in response to signals from their neighbors. This bottom-up system usually runs smoothly, but every now and then it goes awry — producing hands with extra fingers, for example.
When the first AI-generated images started cropping up online, many looked like surrealist paintings, depicting humans with extra fingers. These immediately made Kamb think of morphogenesis: “It smelled like a failure you’d expect from a [bottom-up] system,” he said.
AI researchers knew by that point that diffusion models take a couple of technical shortcuts when generating images. The first is known as locality: They only pay attention to a single group, or “patch,” of pixels at a time. The second is that they adhere to a strict rule when generating images: If you shift an input image by just a couple of pixels in any direction, for example, the system will automatically adjust to make the same change in the image it generates. This feature, called translational equivariance, is the model’s way of preserving coherent structure; without it, it’s much more difficult to create realistic images.
In part because of these features, diffusion models don’t pay any attention to where a particular patch will fit into the final image. They just focus on generating one patch at a time and then automatically fit them into place using a mathematical model known as a score function, which can be thought of as a digital Turing pattern.
Researchers long regarded locality and equivariance as mere limitations of the denoising process, technical quirks that prevented diffusion models from creating perfect replicas of images. They didn’t associate them with creativity, which was seen as a higher-order phenomenon.
They were in for another surprise.
Made Locally
Kamb started his graduate work in 2022 in the lab of Surya Ganguli, a physicist at Stanford who also has appointments in neurobiology and electrical engineering. OpenAI released ChatGPT the same year, causing a surge of interest in the field now known as generative AI. As tech developers worked on building ever-more-powerful models, many academics remained fixated on understanding the inner workings of these systems.


Mason Kamb (left) and Surya Ganguli found that the creativity in diffusion models is a consequence of their architecture.
Charles Yang (left)
To that end, Kamb eventually developed a hypothesis that locality and equivariance lead to creativity. That raised a tantalizing experimental possibility: If he could devise a system to do nothing but optimize for locality and equivariance, it should then behave like a diffusion model. This experiment was at the heart of his new paper, which he wrote with Ganguli as his co-author.
Kamb and Ganguli call their system the equivariant local score (ELS) machine. It is not a trained diffusion model, but rather a set of equations which can analytically predict the composition of denoised images based solely on the mechanics of locality and equivariance. They then took a series of images that had been converted to digital noise and ran them through both the ELS machine and a number of powerful diffusion models, including ResNets and UNets.
The results were “shocking,” Ganguli said: Across the board, the ELS machine was able to identically match the outputs of the trained diffusion models with an average accuracy of 90% — a result that’s “unheard of in machine learning,” Ganguli said.
The results appear to support Kamb’s hypothesis. “As soon as you impose locality, [creativity] was automatic; it fell out of the dynamics completely naturally,” he said. The very mechanisms which constrained diffusion models’ window of attention during the denoising process — forcing them to focus on individual patches, regardless of where they’d ultimately fit into the final product — are the very same that enable their creativity, he found. The extra-fingers phenomenon seen in diffusion models was similarly a direct by-product of the model’s hyperfixation on generating local patches of pixels without any kind of broader context.
Experts interviewed for this story generally agreed that although Kamb and Ganguli’s paper illuminates the mechanisms behind creativity in diffusion models, much remains mysterious. For example, large language models and other AI systems also appear to display creativity, but they don’t harness locality and equivariance.
“I think this is a very important part of the story,” Biroli said, “[but] it’s not the whole story.”
Creating Creativity
For the first time, researchers have shown how the creativity of diffusion models can be thought of as a by-product of the denoising process itself, one that can be formalized mathematically and predicted with an unprecedentedly high degree of accuracy. It’s almost as if neuroscientists had put a group of human artists into an MRI machine and found a common neural mechanism behind their creativity that could be written down as a set of equations.
The comparison to neuroscience may go beyond mere metaphor: Kamb and Ganguli’s work could also provide insight into the black box of the human mind. “Human and AI creativity may not be so different,” said Ben Hoover, a machine learning researcher at the Georgia Institute of Technology who studies diffusion models. “We assemble things based on what we experience, what we’ve dreamed, what we’ve seen, heard or desire. AI is also just assembling the building blocks from what it’s seen and what it’s asked to do.” Both human and artificial creativity, according to this view, could be fundamentally rooted in an incomplete understanding of the world: We’re all doing our best to fill in the gaps in our knowledge, and every now and then we generate something that’s both new and valuable. Perhaps this is what we call creativity.