NoShade.Vision

Lesson 9: Deep Learning Foundations to Stable Diffusion, 2022

Jeremy-Howard

Lesson 9: Deep Learning Foundations to Stable Diffusion, 2022 by Jeremy-Howard

In "Lesson 9: Deep Learning Foundations to Stable Diffusion, 2022", the speaker introduces Part 2 of the "Practical Deep Learning for Coders" series, which focuses on stable diffusion and generative modeling. The course teaches foundational knowledge to keep up with the rapid developments in the field of deep learning. The speaker recommends exploring the "diffusion-nbs" repository and using the Diffusers library from Hugging Face for stable diffusion. Users can adjust the specificity of the output image using the guidance scale and negative prompts, and can fine-tune images using the image to image pipeline. The video also explains a new way of thinking about stable diffusion and how to calculate the gradient of the probability that an image is a handwritten digit with respect to its pixels.

00:00:00

In this section, the speaker introduces Part 2 of the "Practical Deep Learning for Coders" series, called "Deep Learning Foundations to Stable Diffusion", which focuses on generative modeling and understanding the details behind it. The speaker notes that while they will be discussing Stable Diffusion and how to use it, the field is evolving so quickly that the software and details they discuss may be out of date by the time the video is watched in the future. However, the speaker hopes to provide a reasonable understanding of the topics discussed and suggests that viewers have some familiarity with deep learning basics before proceeding with the course.

00:05:00

In this section, the video lecturer explains that stable diffusion generative models are becoming faster and faster with the number of steps required decreasing from a thousand to 4. This pace of development means the course is designed to teach from foundational knowledge so that students can keep up with such a fast pace of innovation. The lecturer also showcases strumer.com, a company that implements the course material to put together pictures of people and objects as an example of how the course is relevant in popular culture. Additionally, the lecturer notes that the course material will be heavily influenced by fast AI alumni, such as Jonathan Whitaker, Wasim, Petro, and Tanishq, all of whom play an integral role in the course design.

00:10:00

In this section, the instructor emphasizes the importance of using available resources to maximize understanding of Stable Diffusion models, particularly for medical applications. He recommends visiting course.fast.ai for links to notebooks and additional materials, and the forums for further discussions and chat. However, the instructor cautions that Part 2 requires more compute power than Part 1, and that resources like Colab may have limitations or charges for use. He recommends trying Paperspace Gradient or Lambda Labs, but notes that pricing and availability can change rapidly. Finally, he suggests that buying one's own machine may be worth considering given current GPU prices.

00:15:00

In this section, the speaker recommends exploring the “diffusion-nbs” repository, which has some fun notebooks to try out, including suggested_tools.md, curated by Jonathan Whitaker. These notebooks have a hacker aesthetic and are designed to help users understand the capabilities and constraints of deep learning before exploring potential research opportunities. To create great outputs when using these application-ready notebooks, users should start with a prompt and then add artists' names or places to create an algorithm that corresponds to those words in their captions. The speaker suggests that users should play a lot and explore Lexica, which has many AI artworks that demonstrate how to create an effective prompt.

00:20:00

In this section, the instructor discusses the stable_diffusion(.ipybn) notebook from the diffusion-nbs repository, which uses the Diffusers library from Hugging Face for stable diffusion. They explain that pipelines are similar to fastai learners and how to save them to the Hugging Face Hub for others to use. The instructor also mentions that while Hugging Face is currently the recommended library, there may be other options in the future. Additionally, the instructor notes that the notebook will require downloading several gigabytes of data from the internet when running for the first time.

00:25:00

In this section, the transcript excerpt describes how to get started with creating images using deep learning on Colab or your own machine. The pipeline called "pipe" can be treated as a function where you can pass a prompt, and it will return an image. By using the same random seed, you can get the same image every time you call the pipeline. The diffusion models work by starting with random noise and making it slightly less noisy and more like the thing you want in each step. It takes many steps to create an image, but the number of steps required is reducing with new advancements. The basic understanding of this concept is going to be essential in the long run.

00:30:00

In this section, the instructor demonstrates how to use the guidance scale and the negative prompt in diffusers. By passing the prompts and using a different guidance scale, users can adjust the degree of the specifics caption versus just creating an image. Users can also subtract one image from another to get a different output, which is useful for generating non-blue “Labrador in the style of Vermeer”. Additionally, users can pass in images to the image to image pipeline to get a more refined output.

00:35:00

In this section, the speaker discusses how to create new images using Stable Diffusion and random noise. The approach involves starting with a noisy version of a drawing and then creating something that matches a given caption while following a guiding starting point. Parameters such as strength can be used to control the degree of similarity between the output images and the original drawing. The speaker also demonstrates how to use Stable Diffusion for image captioning and fine-tuning, including Textual Inversion, which involves fine-tuning a single embedding.

00:40:00

In this section of the video, the speaker discusses how to get started playing around with Stable Diffusion, including using a specific token for training embeddings and utilizing Dreambooth to fine-tune a model to match specific images. The speaker also shares some examples of textual inversion training, including attempting to create a version of his daughter's teddy riding a horse. In the second part of the video, the speaker will explain the basic idea of how machine learning models are trained.

00:45:00

In this section, the instructor introduces a new way of thinking about Stable Diffusion, which is conceptually simpler and equally mathematically valid to the traditional approach. He then goes on to explain a hypothetical scenario involving a web API that generates probabilities of an image being a handwritten digit, which can be used to generate new images. He also explains a method of turning an output mess into a 28 by 28 image.

00:50:00

In this section, the speaker explains how to calculate the gradient of the probability that an image is a handwritten digit with respect to its pixels. They demonstrate how to adjust the darkness or lightness of each pixel one at a time to make the image more like a handwritten digit, ultimately calculating the gradient for each pixel of the 28 by 28 image. The gradient values are contained in a large set of values which indicate how much the probability that the image is a digit increases when a specific pixel's value is increased or decreased. This approach results in 784 values for the 784 pixels in the image, allowing us to determine how to change the image to make it look more like a handwritten digit.

01:05:00

In this section of the video, the presenter explains how to modify the pixels of an image using gradients in a way similar to training neural networks. By subtracting a little bit times the gradient of each pixel and multiplying it by a constant, they get a new image that looks slightly more like a handwritten digit than before. Running this new image through the network yields a higher probability that it is indeed a handwritten digit. These modifications can be made repeatedly to refine the image and increase the probability even further.

01:10:00

In this section, the speaker discusses the process of changing noisy inputs into valid ones and turning pixels into darker or lighter based on the probability that it is a digit. They explain that they can use analytic derivatives instead of the slow finite differencing method to calculate derivatives by calling f.backward() in Python. The speaker suggests training a neural net to determine which pixels to change in order to make an image look more like a handwritten digit. They propose creating training data and passing in different types of inputs to get the desired information.

01:15:00

In this section, the presenter explains how to use a neural net to predict the amount of noise added to real handwritten digits. Rather than trying to score how closely the noise image resembles a digit, the approach is to predict the amount of noise added to each digit. A neural net is created as a simple input-output system with a loss function that predicts the actual noise added to the image. The loss function is calculated by taking the input, passing it through the neural net, and predicting the actual noise added to the digit. The variance of the normally distributed random variable is used as the output, with a mean of zero for each pixel.

01:20:00

In this section, the speaker explains how to use a neural network to generate images by removing noise from noisy handwritten digits. The process involves passing noisy digits through a trained neural network that predicts the noise added to the inputs, subtracting the predicted noise from the input, and obtaining a digit-like image. This process is repeated multiple times until the desired image is achieved. The Mean Squared Error is used as the loss function to update the weights of the neural network during training.

01:25:00

In this section, the instructor explains the first component of Stable Diffusion, which is the U-Net, a particular kind of neural network used for medical imaging that takes a noisy image as input and outputs the noise such that subtracting it from the input results in an unnoisy image. However, training this model for high-definition images with millions of pixels would be time-consuming and inefficient. To address this, the instructor suggests using a more efficient way to store pixel values, such as grouping similar colors together instead of storing each pixel's exact value.

01:30:00

In this section, the speaker demonstrates an interesting way to compress images using a neural network. By putting an image through a series of stride two convolutions with increasing channels, and then squishing down the number of channels using Resnet blocks, the image can be compressed from millions of bytes to just 16,384 pixels, which is a 48 times decrease. To retrieve the image, an inverse convolution is used. The entire process can be placed inside a neural net, which allows images to be fed in and reconstructed.

01:35:00

In this section, the speaker discusses how to train a model that produces the same exact output as its input through a process called Mean Squared Error. This model is called an auto encoder, which can be split in half into an encoder and decoder. This creates a powerful compression algorithm that uses less data and can be sent faster. The auto encoder is trained not just on one image but on millions of images, creating a better compression algorithm that can transfer thousands of pictures by just sharing the 16,384 byte version of it.

01:40:00

In this section, the speaker explains how the U-Net can be trained using the VAE, a 48 times smaller version of the U-Net. The VAE's decoder takes in Latents (smaller encoded versions of picture data) and outputs an image, while the encoder converts the original picture into Latents. The Latents essentially serve as the input for the U-Net, which outputs denoised Latents. These denoised Latents can then be fed back into the VAE's decoder to produce a clear image. This process is optional, but it can save a lot of computing time and money. Additionally, the speaker discusses how they can train the U-Net to generate specific images, such as a handwritten three, by passing in the literal number as well as the noise as input.

01:45:00

In this section, the instructor talks about using a Neural Net to predict and remove noise from images by passing in a noisy input and a one-hot encoded version of what digit it is. By doing this, the model can learn how to predict noise and remove it from the original image. However, this method can't be used for creating other images that aren't digits, like a cute teddy bear. To address this, the instructor suggests creating a model that can take a sentence like "a cute teddy" and return a vector of numbers that represents what cute teddies look like. To do this, the model can surf the internet and download images, each with an image tag and an ALT tag to help create an embedding that represents what a cute teddy looks like.

01:50:00

In this section, the speaker explains the concept of using two models, a text encoder, and an image encoder, to generate embeddings and line them up. The text and image encoders initially have random weights and generate random features. However, when images and text are fed into their respective models, ideally, they should create embeddings that match with each other, representing similarity. The similarity is checked using the dot product of the features from each model, as the speaker explains.

01:55:00

In this section, the speaker explains how they have successfully created two models that put text and images into the same space to create a multimodal set of models. They use a contrastive loss function to train a CLIP text encoder to produce feature vectors that are similar to the image it is paired with, and small for anything off-diagonal. The resulting feature vectors can be used to train an image denoiser called the U-Net. By passing the feature vector for a specific object, such as a cute teddy bear, into the U-Net, it is able to create a similar image based on previous examples of similar objects.

02:00:00

In this section, the instructor explains the inference process for deep diffusion models and the language used around it, which can be confusing. He clarifies that although the language refers to "time steps," it is not related to time in real life. Instead, it is a holdover from the first papers on diffusion models. He suggests avoiding the term "time steps" but explains that it refers to a schedule for adding noise to the model. The instructor further explains that people use the Greek letter Beta to refer to the standard deviation of noise used in training models and that users can use either Beta or t to determine how much noise to use. Finally, he clarifies that to train the model, users create mini-batches and randomly pick an image and noise amount or t to train the model's weights to predict noise.

02:05:00

In this section, the speaker describes the process of generating a picture from pure noise during inference time. The model is starting at a point of maximum noise and learns to remove noise by predicting what the noise is and subtracting it from the noisy image. The speaker explains how a constant is used to multiply the prediction of the noise, and the reason why they don't jump all the way to the best image due to things that never appeared in the training set. The speaker also makes a connection between this process and deep learning optimizers, suggesting that similar optimizer tricks could be used to enhance the process of generating images.

02:10:00

In this section, the speaker explains how differential equation solvers and optimizers share similar concepts in taking small steps to take bigger steps. The importance of including "t" in input data in diffusion models is questioned, as it may not be necessary to inform models of noise levels. Results suggest that thinking of the problem as an optimization problem rather than a differential equation solving problem can lead to more efficient outputs. By using more sophisticated loss functions, such as Perceptual Loss, it could be possible to improve outputs and remove the need to put noise back into input data. This new approach could lead to novel research opportunities. The next lesson will focus on the code behind the scenes and building up from ground drills.

02:15:00

In this section, the speaker explains the goal of their deep learning course, which is to provide foundations for stable diffusion while exploring new research directions. The course aims to equip students with the skills and understanding of deep learning techniques necessary to tackle real-world problems. The speaker invites viewers to stay tuned for future lessons and new research findings.

More from
Jeremy-Howard

No videos found.

Trending
AI Music

No music found.