Lesson 19: Deep Learning Foundations to Stable Diffusion

Jeremy-Howard

Lesson 19: Deep Learning Foundations to Stable Diffusion by Jeremy-Howard

In this video, Jeremy discusses what Dropout is and shows how to implement it in PyTorch. He also introduces a callback that puts the model in training mode all the time, even at evaluation, if the model is a Dropout. The presenters then move on to exploring diffusion models and the backward and forward processes involved in generative modeling, with a focus on the Stable Diffusion model. The simplified training objective needed to train the diffusion model is also explained, including the epsilon-theta function used to predict the noise added to images. Lastly, they discuss the use of Greek letters in code, before moving on to discussing how to predict the amount of noise in the diffusion model.

00:00:00

In this section, Jeremy gives an update on the Fashion-MNIST challenge, saying that a user on the forum, Christopher Thomas, got better results for the challenge with the use of Dropout. Additionally, a user named Piotr pointed out a bug in the ResBlock code which, after being fixed, improved the results further. Jeremy then explains what Dropout is and how it works, showing a Dropout class that passes in the probability rate of dropping out random activations, creating binomial distributions and samples, and multiplying them by the initial tensor, with the whole process being performed only during training time. The use of Dropout leads to the activations being averaged at about one tenth smaller and thus is useful in training models.

00:05:00

In this section of the video, Jeremy discusses the concept of dropout and shows how to implement it in PyTorch. He demonstrates how dropout can prevent overfitting and improve model performance by adding randomness to the activations. He also explains the difference between Dropout and Dropout2d, which drops out an entire grid area instead of individual elements. Finally, he mentions a callback he wrote for test time dropout, which allows for using dropout during inference for test time augmentation.

00:10:00

In this section, the instructor discusses a callback that puts the model in training mode all the time, even at evaluation, if the model is a Dropout. Although it is unlikely to give better results, it provides a sense of the model's confidence by leading to different predictions, indicating the model's idea if it has no idea. While this technique has been used in medical models, it is less popular and not well-known outside the medical world. However, this is an interesting idea worth exploring more and studying further. Following this, the instructor hands over to Tanishq to discuss how to put together a diffusion model from scratch using the miniai library, starting from the paper "Denoising Diffusion Probabilistic Models" by UC Berkeley.

00:15:00

In this section, Tanishq and Jeremy discuss the evolution of diffusion models and the introduction of a new model that appears to be better than Stable Diffusion, which doesn't use diffusion at all. They also provide an overview of the general idea behind generative modeling, which involves trying to produce more data points like the data points given, such as images of dogs for image generation tasks, based on the likelihood of those data points appearing in real life. They use the example of height to explain how probability works, where the X axis is the data point, and the Y axis is the probability of encountering someone with that data point.

00:20:00

In this section, Tanishq explains generative modeling and the importance of knowing the probability distribution function p(x). If we have knowledge of p(x), we can use it to sample new values and create new generations, rather than selecting uniform random values. Various techniques exist for generative modeling including GANs and VAEs, but the popular method now is diffusion models which involves the forward process from pure noise to the image and the reverse process from the image to pure noise. Tanishq provides a high-level explanation of the directed graphical model used in diffusion models.

00:25:00

In this section, Tanishq and Jeremy discuss the transition process in stable diffusion, which involves going from a less noisy image to a more noisy image and then back again. This process is represented by a transition kernel that describes the equations governing the diffusion process. In particular, the equation includes a Gaussian distribution with mean and variance, as well as a variable beta-t that increases as t increases. As beta-t increases, the variance increases while the mean decreases, resulting in less of the original image being retained.

00:30:00

In this section, the concept of diffusion models is explained further, with a focus on the iterative process of adding noise and losing the contribution of the original image as the time step increases. However, the mean of the image is still very close to the original image at x1 because of beta-t approaching 1 at that time step. The section also mentions that writing out q(xt) can be easier, as it is only dependent on x(t-1), and the contribution from the original image decreases as time goes on while the noise from variance increases. The reverse process is explained to be a neural network, which is learned during training, with one neural network needed since the term alpha bar t remains constant.

00:35:00

In this section, Tanishq and Jeremy discuss the simplified training objective needed to train the diffusion model process. They introduce the epsilon-theta function, which is a basic mean square error loss function used to predict the noise added to images. By predicting and subtracting the noise, the aim is to return the image to the original distribution. This process is important for generating random data points, which are usually noisy images.

00:40:00

In this section of the video, the presenters discuss the importance of noise prediction in deep learning and explain the process of using an iterative processor to move towards the data distribution. They introduce the Fashion-MNIST dataset and a U-Net neural network used for image-to-image tasks that take an input image and output another image. They also discuss using the MSE loss and choosing a random timestep to add noise to the image before passing it to the model for prediction.

00:45:00

In this section, the speaker explains how she plotted the beta, sigma, and alphabar distributions from Tanishq's class which were created using linspace, square root, and cumulative product of 1 minus them respectively. She shows how these distributions change with increasing timesteps, and how they are used to determine the amount of noise added to data. The speakers also discuss the use of callbacks in setting up the batch to be passed into the learner, and the replacement of Greek letters with English written versions in the code.

00:50:00

In this section of the video, the presenters discuss whether using Greek letters in code is helpful for educational purposes. They agree that having Greek letters can be useful for studying and implementing mathematical equations, especially when matching the math to the code. They then move on to discuss the initialization and before batch setup for their noise predicting model, which takes in a noisy image, a timestep, and predicts the amount of noise in the image. They explain how they generate the noise target variable, epsilon, and how they add noise to their clean images before passing them into the model for training.

00:55:00

In this section, the presenters discuss how to train a neural network to predict the noise in the diffusion model. They use Hugging Face's API to replace the way prediction works, allowing them to pass in two things from the learn.batch and get the output via .sample. They also demonstrate how the training loop works by initializing the DDPM callback with appropriate arguments. Using an MSE loss, they show how the same code can be used to train both a diffusion model and a classifier.

01:00:00

In this section, the speaker discusses using a trained model to sample from the dataset. The process involves using a noise-predicting model to determine the direction to move from a random data point towards the data distribution. The noise predicting model gives an original direction that is tangential to the actual path at that location. However, the initial direction given is not enough to get a correct data point because it does not follow the path that the noise added to the data point originally followed. Therefore, the process involves making estimates of the gradient by evaluating the noise prediction function to get updated and better estimates of the gradient to finally converge onto the image.

01:05:00

In this section, Tanishq explains the process of creating an image using deep diffusion probabilistic models. The process starts with a random noise image, which is not part of the data distribution, and goes through each timestep, predicting a direction to move towards. A noise predicting model is used to get the noise prediction direction, and coefficients are created to give a weighted average of the denoised image estimate and the original noisy image with additional noise. The estimate of the generated image gets closer to the final timestep, whereby the predicted image is achieved. The sample function is implemented as part of the callback method and called with the DDPM callback to generate 16, 1 channel 32 by 32 images.

01:10:00

In this section, the presenters look at how the sampling progresses over time during the multiple timesteps of DDPM. They collected at each timestep what that estimate looks like, starting from timestep 800 to a 1,000. However, there are limitations to the noise schedule used in the original DDPM paper, which is why some papers propose other sorts of noise schedules to make full use of all the timesteps and have it do something during that time period. This is the beginning of their next journey, which is to make better generative models for Fashion-MNIST classification and eventually move towards Stable Diffusion and beyond.

01:15:00

In this section, Jeremy discusses how he tried to understand what was going on in Tanishq's notebook and experimented with it in various ways to understand it better. He drew pictures of the images to remind himself what they looked like and copied and pasted various things from Tanishq's notebook to make changes to it. He created a function and passed everything it needed, making the callback tiny. Lastly, he tried to experiment with as many different ways of doing things as possible to help others see the various ways to work with their framework.

01:20:00

In this section, the instructor explains how to replace the model by inheriting from UNet2DModel and replacing the forward function. By doing this, the TrainCB and predict are no longer necessary. To speed up the training, it is possible to divide all the channels by two. The instructor also mentions the group norm and how it splits channels up into a certain number of groups. It is important to make sure the groups have more than one item in them. To ensure that the model is initialized correctly, the instructor chats with an expert and looks at papers.

01:25:00

In this section, the speaker discusses modifications made to the deep learning model, including zeroing out specific layers, using orthogonal weights, and changing the optimization algorithm. They also introduced the concept of mixed precision, which uses 16 bit floating point values instead of 32 bit to speed up the matrix multiplication process. However, this will require implementing mixed precision from scratch in lesson 20. Ultimately, these modifications led to improved performance and faster training times for the model.

More from
Jeremy-Howard

No videos found.

Related Videos

No related videos found.

Trending
AI Music

No music found.