Lesson 18: Deep Learning Foundations to Stable Diffusion

Jeremy-Howard

Lesson 18: Deep Learning Foundations to Stable Diffusion by Jeremy-Howard

This video provides a detailed explanation of stochastic gradient descent (SGD) and linear regression, demonstrating how calculated and analytical methods can be used to adjust the estimated derivatives for the intercept and slope to improve prediction accuracy. The video also covers the concept of momentum and compares optimization algorithms like momentum, RMSProp, and Adam, as well as how to use PyTorch optimizers and schedulers to adjust learning rates and explore APIs using a REPL environment. Additionally, the video explains the importance of initializing models correctly and explores warmup techniques like T-Fixup. The final section covers updates to the Learner function and the addition of more parameters, as well as ResNets, deeper and wider models that can improve neural nets' ability to learn by providing more opportunities to do so, and the OneCycle learning rate with the scheduler and callbacks they use.

00:00:00

In this section of the video, the instructor uses Microsoft Excel to demonstrate a simple example of stochastic gradient descent (SGD) and linear regression using mean squared error. The data for the regression is generated with y=ax+b and the goal is to identify that the intercept is 30 and the slope is 2. The instructor explains how to use calculated and analytical methods to determine the estimated derivatives for the intercept and slope and adjusts them accordingly to improve the prediction accuracy.

00:05:00

In this section, the video presenter demonstrates the process of calculating derivatives using the finite differencing approach and explains its usage for testing purposes. He then introduces a simpler version of mini-batch gradient descent called online gradient descent and goes through a small dataset to show how to calculate the slope and intercept for each row of data. The presenter also shows his audience how to create a VBA macro to automate the process of copying and pasting the slope and intercept for each iteration, which helps to monitor the root mean squared error over time. Finally, he uses this approach to calculate a better slope and intercept for the small dataset.

00:10:00

In this section, the presenter introduces the concept of momentum and shows its effect on linear regression. A new sheet is created where everything is the same as the previous sheet, but with the addition of a momentum term called beta. Beta is used to calculate the gradient for both the intercept and slope, and it takes the previous update into account, resulting in a lerped version of the gradient. This momentum version of the gradient is then multiplied by the learning rate and used to update intercept and slope values. This approach allows for faster and more efficient updates than the previous copy-paste approach. The presenter also explains how to use VBA macros to automate the update process.

00:15:00

In this section, the instructor explains the difference between momentum, RMSProp, and Adam optimization algorithms. He visually explains the three algorithms and then runs a simulation showing how decreasing the learning rate can help optimize the model. He also answers a question from YouTube chat regarding the initialization of J33. Finally, the instructor notes that manually changing the learning rate can be annoying and suggests using a learning rate scheduler.

00:20:00

In this section, Jeremy Howard discusses an automatic scheduler he created called "Adam annealing tab", which automatically decreases the learning rate in certain situations by keeping track of the average of the squared gradients. He notes that this could be an interesting experiment for those using MiniAI. Howard then goes on to discuss annealing in PyTorch, noting that the implementation is simple since the basic idea of adjusting the learning rate has been covered already. By looking at the torch.optim.lr_scheduler module and using dir, Howard explains how to define your own learning rate scheduler or annealing within the MiniAI framework.

00:25:00

In this section, the video explores using PyTorch schedulers with PyTorch optimizers and how to work with them. The PyTorch optimizer has a different API, and a learner and optimizer can be created by passing a single batch callback. A list of available schedulers can be created with a little list comprehension. The video shows the use of the Cosine Annealing scheduler and explores PyTorch optimizers' attributes, including parameter groups and states that are dictionaries with parameter tensors as keys. These dictionaries store information that allows the optimizer to keep track of the parameters and the changes in their values over time.

00:30:00

In this section of the video, the presenters explain how the PyTorch optimizer works by storing state as a dictionary and potentially grouping parameters. The video details the process of creating a cosine annealing scheduler, which requires information on how many iterations are to be done and will store the base learning rate originally obtained from the optimizer. This section also shows how to adjust learning rates using schedulers.

00:35:00

In this section, the presenter demonstrates how to investigate and learn about APIs using a REPL environment. They explain how to create a scheduler callback with the optimizer to schedule and update the learning rate every batch during training. To see the scheduling and learning rate changes, a new callback, recorder callback, will keep track of the parameter groups and a dictionary of all the recording names and their respective values. Finally, the presenter shows how to plot all the recorded data using a dictionary of results after each batch during training.

00:40:00

In this section, the instructor demonstrates the use of a cosine annealing callable and a scheduler which steps at the end of each epoch rather than at the end of each batch. The One Cycle training method, which starts the learning rate at a low value, gradually increases it to a maximum value, and then decreases it again, while doing the exact opposite for momentum, is also introduced. The instructor explains that starting with a low learning rate is essential for not perfectly initialized models, and mentions a 2019 paper that shows how to initialize more complex models properly with ResNets.

00:45:00

In this section, the instructor discusses the importance of initializing models correctly and the fact that tricks such as warmup and batch normalization are not necessary if models are initialized correctly. Additionally, he describes the use of a warmup technique called T-Fixup that focuses on the difference between no warmup and correct initialization versus normal initialization. The instructor provides an overview of the warmup technique, stating that networks that are not quite initialized correctly are gradually increased with a low learning rate, so the weights do not jump off to areas that do not make sense while gradually increasing the momentum as the weights move into a part of the space that makes sense.

00:50:00

In this section, Jeremy discusses a few changes that were made to the code, including a new way of referencing the learner from the callbacks, which is now just "learn" instead of "self.learn". This change makes the code cleaner and reduces the risk of creating a cycle between the callbacks and the learner. He also shows a neat trick for quickly finding all of the changes made to the code in the last week by using the compare feature on GitHub. Additionally, Jeremy added a patch to the learner that lets users call the learning rate finder more easily by creating a new method called lr_find. This method calls self.fit and uses the learning rate finder callback as a callback parameter, which is only active during the learning rate finder fit and does not leave any traces in the learner's callbacks list.

00:55:00

In this section, the speaker explains the updates to the Learner function, including the addition of more parameters to fit, such as the learning rate parameter and two booleans for toggling training and validation loops. The speaker also discusses ResNets and how deeper and wider models can improve neural nets' ability to learn by providing more opportunities to do so. The speaker then suggests a stride of 1 to add an extra layer and increase the number of channels to 128, making the model deeper and wider. Finally, the speaker discusses the standard BatchNorm2d and OneCycle learning rate with the scheduler and callbacks they will use.

01:00:00

In this section, the video explores the impact of increasing the depth of deep neural networks and the resulting phenomenon of vanishing gradients. Kaiming He's work on deep residual learning for image recognition introduced ResNets through a clever insight that adding more layers is not always beneficial and that there is a threshold after which deeper networks stop working well. The solution was to add a shortcut connection that allows the shortcut of the initial input directly to the later layers that will learn only the residual function, enabling optimization of the deep networks. The concept of skip connections is illustrated through the use of two convolutions with an added identity connection.

01:05:00

In this section, the video explains the concept of a ResNet block or a residual connection between input and output. It suggests that this connection can make a network deep, but at the same time behave as if it's still shallow, as it subtracts the starting input from the final output to get the residuals. The difference can also be seen as calculating the residuals. To understand this concept, the video creates a conv block to perform two different convolutions in a row and shows how the addition of a “1 by 1” identity convolution can work with stride and filters to create an overall convolutional neural network.

01:10:00

In this section, the speaker introduces the ResBlock, which will contain convolutions and the idconv, a noop layer. The idconv will be used when the number of channels in is equal to the number of channels out. For other cases, the number of filters is changed by using a convolution with a kernel size of 1 and a stride of 1. If the stride is 2, the speaker uses average pooling to get the mean of every set of two items in the grid. The speaker applies the activation function to the result of the ResNet block as a way that works pretty well. Finally, the speaker replaces the convs in the previous code with ResBlocks to build the model.

01:15:00

In this section, the instructor explores the structure of a more complex model and discusses the inputs and outputs of each layer using a function called "_print_shape". This function prints the name of the class, the input shape, and the output shape for each layer. Additionally, the instructor creates a method called "summary" that uses a markdown table to display the same information, as well as the number of parameters for each layer and the total number of parameters in the model. Finally, the instructor explains why the last layer has nearly all of the parameters in the model and provides a visual demonstration using an Excel convolutional spreadsheet.

01:20:00

In this section, the speaker discusses the ResNet architecture and how it has improved the model's accuracy up to 92.2%. The ResNet architecture is created by thoughtfully designing a basic architecture using common sense. The speaker also mentioned that the ResNet18d model was found to be the best among various PyTorch image models in terms of accuracy. The speaker plans to further improve the model by introducing data augmentation and exploring deeper and wider architectures.

01:25:00

In this section, Jeremy Howard explains how he adjusted the kernel size to 5 and doubled all the sizes to improve the ResNet, resulting in an impressive 92.7% success rate without much trial and error. He then explains how to add more flexibility to the ResNet by creating a global average pooling layer. Finally, he introduces the concept of FLOPs, a calculation of the number of floating-point operations per second; he explains how he wrote a function to pass in the weight matrix, height, and width of the grid to calculate the FLOPs value for each function.

01:30:00

In this section, the speaker discusses the different modifications he made to the ResNet to reduce the number of parameters and/or FLOPs in the model. The first modification he made was removing a stride-1 layer, which had the biggest impact on the number of parameters but only a small impact on FLOPs. The second modification was replacing a ResBlock with just one convolution, resulting in a significant reduction in FLOPs but no change in the number of parameters. The speaker cautions against assuming that a reduction in parameters means a faster model since there is no straightforward relationship between the two. The resulting model had significantly fewer FLOPs but similar accuracy.

01:35:00

In this section, the speaker discusses the impact of weight decay on BatchNorm layers. While weight decay is a commonly used method to regularize neural networks, it doesn't work well with BatchNorm layers, which can significantly increase the effective scale of the weight matrix. This means that BatchNorm can essentially "cheat" by increasing the parameters of the neural network. Instead of weight decay, the speaker recommends data augmentation to regularize the network by modifying images slightly via random changes to prevent overfitting. The speaker also mentions that fastai has great implementations of data augmentation functions.

01:40:00

In this section, the speaker discusses data augmentation and how it adds variety to the mini-batches used during deep learning. After creating the BatchTransform callback, the speaker runs a SingleBatchCB trick to fit the model using a small amount of data augmentation with 1 padding for 20 epochs using OneCycle learning rate. The results of this show an accuracy of 93.8 in 20 epochs, which is impressive. The purpose of data augmentation is that it will create different versions of the images to train and validate the model. While the BatchTransform function means less variation in each mini-batch, the model will see a lot of mini-batches, therefore seeing different augmentations.

01:45:00

In this section, the presenter discusses test time augmentation, which involves applying the same BatchTransform callback on validation as was applied on training, except this time, it's non-random and always does a horizontal flip. By stacking and averaging the predictions from both the flipped and unflipped versions, the network gains more opportunities to understand what the image is a picture of, leading to a better result of 94.2% accuracy without additional training. Additionally, the presenter introduces the technique of random erasing, which involves deleting a small part of each picture and replacing it with some random Gaussian noise.

01:50:00

In this section, the presenter demonstrates how he implements random erasing as a data augmentation technique from scratch. The technique involves deleting a patch from each image and replacing it with identical mean and standard deviation pixels to the dataset that won't change its statistics. The technique ensures the range of pixels doesn't change when adding noise. The demonstrator combines a Gaussian or normal initialization of mini-batches with mean and standard deviation normal random noise and clamps random pixels to between the maximum and minimum values to ensure that it doesn't change the original range. The presenter creates a class to store the input parameters, including the percentage of each block to delete, the maximum number of blocks, and then calls the random erasing function in the forward function.

01:55:00

In this section, the instructor experiments with the random erasing technique to generate more data for image classification tasks. They realize that they can copy parts of the picture to ensure a guaranteed distribution of pixels and allow for better classification accuracy. They manually implement this method and test it for multiple epochs, achieving a 94% accuracy rate. They attempt to beat this record by ensembling the predictions of two individual models but fail to achieve their desired result.

02:00:00

In this section, the speaker concludes the lesson by expressing his excitement for building an absolute state-of-the-art model from scratch using common sense at every stage. He emphasizes that deep learning is not magic and that these techniques work for both small and large datasets. The speaker introduces two homework exercises to the students; the first is to create their own cosine annealing and 1-Cycle schedulers from scratch, and the second is to try and beat his models on the Fashion-MNIST dataset. The speaker encourages students to take a step-by-step approach and to learn by exploring and experimenting.

02:05:00

In this section, the instructor announces the next lesson, where she will be working with Johno and Tanishq to create a diffusion model from scratch. She mentions that they will be taking a few lessons to not only create a diffusion model but also explore other generative approaches. She thanks the viewers for joining her on this extensive journey and encourages them to share their progress with the community on forums.fast.ai.

More from
Jeremy-Howard

No videos found.

Related Videos

No related videos found.

Trending
AI Music

No music found.