SeedAI Crash Course is a series designed to break down emerging and complex AI topics for non-technical and policy audiences. Is there a topic you’d like a Crash Course on? Let us know!
Some people may know the “T” in ChatGPT stands for “transformer.” And while it’s true that transformer-based AI models have been a major driver of the generative AI boom, another class of models that rely on a technique called “diffusion” are an important building block of the generative AI ecosystem.
Diffusion-based AI models came to prominence with image generation models like DALL-E 2 and Stable Diffusion in 2022. Now, researchers are pushing them in new and exciting directions. Understanding the evolution of this class of technology is important as it has significant implications for the speed and cost of AI, as well as downstream applications like scientific research.
What is diffusion?
At the most basic level, diffusion models generate new data from random noise. Picture TV static slowly turning into a crisp photo. A diffusion model ultimately learns to reverse that process: take random noise and, step by step, remove the noise until a realistic output appears. During initial training, the model actually practices the forward direction first by starting with a clean example and learning how to add a tiny bit of noise at a time. Once it has mastered noise-in, it can learn noise-out. Because the model edits the whole canvas at once, it can fix mistakes as it goes and keep the big picture consistent.
Diffusion models are not new: they were first invented by a Stanford researcher named Jascha Sohl-Dickstein in 2015. Importantly, diffusion refers to the generative process (iterative denoising) while the underlying architecture of the AI model can vary. Diffusion models can rely on architectures like convolutional neural networks (often specifically “U-Nets”), graph neural networks, and even transformers, too – sometimes in combination with other architectures.
How is diffusion used?
Audiovisual content
The “denoise toward an answer” idea made diffusion a natural fit for generating audiovisual content, especially images and video, which is what catapulted models like DALL-E 2 and Stable Diffusion into the public eye. Both of these models are text-to-image models, meaning they produce an image given a description of an image desired to be generated. Diffusion has also similarly been applied in models that generate video and that generate audio. Researchers have continued to make steady progress in advancing the capabilities of diffusion models in audiovisual domains. Music generation service Suno uses a mix of diffusion and transformer-based models. And Google Deemind’s Veo 3 model utilizes a technique called latent diffusion to produce high quality video.
Text generation
In May 2025, Google DeepMind released a text generation model called Gemini Diffusion, which uses diffusion to generate text. Unlike most language models, like those that power Claude and ChatGPT, which generate text one word at a time, Gemini Diffusion generates an entire block of text and then iteratively “de-noises” it until a final output is produced. This results in incredibly fast production times for larger outputs and more coherent and consistent text. This is significant because certain generative applications, like working with math and code, can require internal consistency across large amounts of text to function properly.
Strategic Planning
Researchers from the Korea Advanced Institute of Science & Technology (KAIST) and Mila - Quebec AI Institute have recently developed diffusion models for planning that perform extremely well while being more efficient. AI models or agents often need to be able to plan out their actions to achieve a goal when they only have incomplete information. For example, Google DeepMind’s Alpha Go utilized a technique called Monte Carlo Tree Search (MCTS) to help parse the extremely large number of potential moves in a Go game.
However, such approaches can be very computationally demanding. KAIST and Mila researchers developed a new framework called Monte Carlo Tree Diffusion (MCTD) that draws on the generative ability of diffusion models and combines it with the flexibility and scalability of existing methods. Early research shows that the efficiency of MCTD can be significantly improved, enabling it to outperform existing models while enhancing speed and scalability.
Scientific Research
While much of the focus of diffusion’s generative abilities was captured by consumer-facing applications in recent years, researchers have leveraged diffusion for a variety of highly valuable scientific applications.
In 2023, researchers at MIT developed an AI model called DiffDock, which uses diffusion to predict how molecules might bind to a protein by denoising over 3D poses. DiffDock exhibited extremely high performance while being up to 12 times faster than state-of-the-art methods, potentially speeding the process of biomedical research like drug discovery.
Also in 2023, researchers at DeepMind and UC Berkeley developed UniMat, a diffusion model that represents crystal structures on a periodic-table grid, capturing chemical regularities that graph methods often miss. This formulation scaled to millions of candidates and produced materials validated as stable in physics-based tests, offering a more effective route for computational materials discovery.
Looking ahead
As the public and private sectors alike familiarize themselves with generative AI it is important to understand that AI is far more than the transformer-based LLMs that have made ChatGPT a household name. Diffusion techniques are an increasingly powerful tool for AI researchers to help unlock capabilities ranging from entertaining image creation to helping develop new breakthrough drugs. Policymakers should ensure that any efforts to support the American AI Stack acknowledge the wide range of potential technologies, architectures, and techniques involved in powering AI.