Lesson 17: Deep Learning Foundations to Stable Diffusion

Jeremy-Howard

Lesson 17: Deep Learning Foundations to Stable Diffusion by Jeremy-Howard

In this YouTube video, the instructor discusses the minor changes made to the miniai library, including the addition of a TrainLearner and HooksCallback, which simplifies hooks into a callback, resulting in code that is more flexible and simpler. The video also covers the importance of creating visualizations of what is happening inside a deep learning model and the difficulties of determining what is going on with the learning rate finder. Additionally, statistical concepts, including variance, covariance, Pearson correlation coefficient, and Xavier initialization, are discussed, as well as the problem of disappearing gradients when using the ReLU activation function and how Kaiming initialization or He initialization resolves this issue. The importance of initializing neural networks correctly is also covered, which can be achieved using the LSUV method, which initializes any neural network correctly regardless of activation function.

00:00:00

In this section, the instructor discusses minor changes that he made to the miniai library this week. One of the changes was adding a dunder getattr to the Callback class that adds four attributes which can be passed down to self.learn. Another change he made was the addition of a TrainLearner, which is just a Learner that has the usual training. He then added a HooksCallback to the activations notebook which simplifies hooks by putting them into a callback. This callback filters the modules in the Learner and creates the Hooks object and puts it in hooks. The Hook function is now more convenient and hooks are now removed after they finish. These changes should simplify the code and make it more flexible.

00:05:00

In this section, the instructor demonstrates how to use callbacks to create visualizations of what is happening inside a deep learning model. Adding the ActivationStats callback, the colorful dimension method, the dead chart method, and the plot stats method allows for easy visualizations of the model. The goal is to achieve a 90% accuracy on Fashion-MNIST training without making any architectural changes, which is demonstrated through a simple convolutional architecture model. The instructor shows the process of doing a learning rate finder on this model and explains the effectiveness of its architecture.

00:10:00

In this section of the video, the speaker explains how it is difficult to determine what's going on with the learning rate finder due to the current learning rate causing high instability during training. The speaker mentions the necessity to reduce the multiplier, called gamma or lr_mult, to obtain a useful learning rate, which is then dialed up. After attempting different values, the highest learning rate that doesn't cause significant instability is found to be 0.2, but it is still too high and results in a classic problem of activations crashing during training due to the lack of 0 mean standard deviation 1 layers at the start. Additionally, the speaker provides tips on how to clean out memory on Jupyter notebooks to prevent running out of GPU memory during training.

00:15:00

In this section, the video discusses the reason for needing a 0 mean and 1 standard deviation and how to implement it. The video explains that for deep learning neural nets, the input goes through several matrix multiplications with activation functions in between. The weights in the matrix multiplication can get too big or too small, causing problems with NaNs or zeros. The weights need to be scaled in such a way that the standard deviation at every point is one and the mean is zero. To do this, the video refers to a paper that describes initializing neural networks with a specific uniform bound and thoroughly studies various different activation functions.

00:20:00

In this section, the speaker explains various statistical concepts like variance and standard deviation and how they are measured in a tensor using mean absolute deviation or squared differences. The speaker also discusses the Glorot initialization or Xavier initialization, which is a technique used to scale the initialization for weights in deep learning models based on the number of inputs. The variance and standard deviation calculation can be influenced by outliers, and while mean absolute deviation is not used as commonly as standard deviation, it can also be used to measure variation in data points.

00:25:00

In this section, the concepts of variance, covariance, and Pearson correlation coefficient in the context of tensors are discussed. Variance tells us how a single tensor varies on its own, whereas covariance describes how two tensors vary together and separately. The Pearson correlation coefficient is the covariance divided by the product of standard deviations. Xavier init or Glorot init is then derived through the explanation of matrix multiplication.

00:30:00

In this section, the instructor explains how to calculate the standard deviation of weights through Xaiver initialization, which combines the backpropagation algorithm with statistics to initialize the weights. He shows that it works for Gaussian distributions but doesn't work with rectified linear units (ReLU) since it changes the original mean and standard deviation. Further, the instructor demonstrates creating a linear layer function and two weight matrices and two bias vectors to explain how to apply the linear layer to calculate the mean and standard deviation of the inputs. However, he also shows that ReLU messes up the calculation of mean and standard deviation in deep neural networks while using GloRot initialization.

00:35:00

In this section, the instructor discusses the problem of disappearing gradients when using the ReLU activation function and how Kaiming initialization or He initialization resolves this problem. He explains that the initialization function for this, which is an extension of the Glorot Initialization, involves multiplying the Glorot factor by root 2 over n, where n represents the number of inputs. To use this initialization function for a deep neural network, the apply method - which applies a function to all of the nn.Modules - can be used. Additionally, the instructor applies the init.kaiming_normal method, and underscored PyTorch methods which modify properties in place are explained. The instructor also applies the learning rate finder and the MomentumLearner to achieve good results.

00:40:00

In this section, the instructor answers a question about why the number of filters in successive convolutions are doubled, explaining that it is to compensate for the reduction in grid size caused by stride-2 convolutions in each layer and to force the model to learn more by having enough channels without decreasing the number of activations too much. However, they realize that the input normalization was not applied correctly, resulting in a mean above 0 and a standard deviation below 1, making it difficult to train the model. To fix this, they create a batch transform callback that modifies the input batches to have a mean of 0 and a standard deviation of 1.

00:45:00

In this section, the video discusses the process of normalizing input data to improve the training of a deep convolutional neural network. By subtracting the mean and dividing by the standard deviation using a normalization function on the x part of a batch, the network was able to successfully train and achieve an accuracy of 85 percent. The video also explains that the output of a ReLU activation function can't have a mean of 0 due to the function's nature, making it incompatible with the concept of correctly calibrated layers in a neural net.

00:50:00

In this section, the presenter discusses two methods of changing the ReLU activation function called Leak ReLU. The idea is to subtract from the ReLU to get a mean of zero and add a slope factor to the negative side. The presenter creates a plotting function in PyTorch and uses partial to create a function with built-in parameters, including activation functions and initialization weights. They also created a new convolutional function to change the activation function and the number of filters in each layer. The ability to change the activation function allows for the use of leak ReLU activation function. The PyTorch model obtained an accuracy of 845, similar to the previous model.

00:55:00

In this section, the speaker discusses the importance of initializing neural networks correctly and how most people don't do it properly. They explain the LSUV method, which is a paper called "All You Need Is a Good Init". This method initializes any neural network correctly regardless of activation functions by creating a model, initializing it, and going through and putting a single batch of data through, continually checking for the correct mean and standard deviation, and then repeating for the next layer. The speaker demonstrates how to use hooks with the LSUV method, using all the ReLU's and convs to initialize a neural network correctly.

01:00:00

In this section, the speaker demonstrates how to use LSUV initialization, which stands for "Layer-sequential unit-variance" initialization, to correct bias and weights in a neural network model. This is done by looping through ReLU's and convolutions in the model, and calling lsuv_init on each module pair with the batch data. The mean and standard deviation are then printed before and after the initialization process. LSUV initialization is useful in situations where the activation function of a neural network model has an unknown or undefined initialization. The speaker also mentions the similarity of LSUV to batch normalization, which will be discussed in the next section.

01:05:00

In this section, the video discusses the concept of batch normalization and how it normalizes layer inputs during training to avoid the internal covariate shift problem, where the distribution of each layer's inputs changes during training. The solution is to make normalization a part of the model architecture and perform it for each mini-batch. The video introduces an alternative to batch normalization called layer normalization, which is simpler and normalizes activations along the features instead of the batch. The video explains the code for layer normalization and how it calculates the mean and variance along with scaling with learned parameters.

01:10:00

In this section, the lecturer explains how normalization layers work in machine learning models. These parameters change the amount of variation and distortion that exists within a data set, by allowing SGD algorithms to alter and optimize the self.mult and self.add aspects of the data. While layer and batch normalization may not be true normalization practices, as after the first few batches the data may be distributed differently, focusing on these two key parameters in the initial stages makes training much easier. The lecturer then goes on to explore the step-by-step process involved in creating convolutional layers with normalization as an integrated feature. While normalization layers greatly increased the efficiency and power of machine learning models, it also presents a level of complexity.

01:15:00

In this section, the video delves into BatchNorm and its differences with LayerNorm. BatchNorm allows for increasingly important initialization of models as people try to move away from normalization layers. There is one multiply and add for every channel, where mean and variance are taken over the batch dimension and the height and width dimensions resulting in one mean and variance per channel. Another difference lies in its use of exponentially weighted moving averages of means and variances in the last few batches during training. Lerp, a function that creates a weighted average, is also presented as a crucial component in BatchNorm to allow for a sliding scale of values between two tensors.

01:20:00

In this section, the lecturer explains how Batch Norm works and its importance in making training smoother. The process of Batch Norm involves taking an exponentially weighted moving average of means and variances and also buffers that mean that this information about means and variances that the model saw is saved in the model. The lecturer also mentions the different types of normal-layer based normalizations which include Batch Norm, Layer Norm, Instance Norm and Group Norm. Finally, the lecturer explains the different ways each type of normalization averages over elements of the mini-batch, channels, and batch, and how they have a separate one per item in the mini-batch.

01:25:00

In this section, the speaker discusses the different normalization techniques in deep learning models, including Batch Norm, Layer Norm, and Group Norm. They explain how these techniques can help solve the issue of internal covariate shift and improve model performance. The speaker also introduces a new technique called MomentumLearner, which can help accelerate training. They suggest combining these techniques with good initialization methods and decreasing the batch size to improve performance. Finally, they mention a fun trick with the GeneralRelu activation function that can be used to further improve performance.

01:30:00

In this section, the instructor introduces the concept of clamping in deep learning and how it can be used to avoid numbers from getting too big. He then shows how to create a Stochastic Gradient Descent (SGD) optimizer class in PyTorch to optimize parameters, using the gradient times the learning rate to subtract out from the parameter and the .data method to zero the gradients. Finally, he discusses weight decay or L2 regularization and how adding the square of the weights to the loss function can help with regularization in deep learning models.

01:35:00

In this section, the video explains the concept of momentum as an optimizer. Momentum refers to the exponentially weighted moving averages of gradients, which smooth out variations between successive gradients to ensure the optimizer maintains constant acceleration. Ideally, large changes get large gradients, and there is constant low acceleration between changes in a curve. As a result, this optimizer is less prone to oscillations and convergence is rapid. By the end of this part of the video, momentum is shown to possess a high level of accuracy when tested with data.

01:40:00

In this section, the instructor explains the concept of momentum, which is useful when using a loss function that is very bumpy. Momentum can help follow the actual curve by providing a version of it that is offset to the right. This is achieved through a look-back mechanism in which the weight of past data decreases gradually. To use momentum, the instructor overrides the definition of the optimization step inherited from the SGD method and creates a momentum object with a momentum of 0.9. Parameters are updated using the exponentially weighted moving average of gradients. By removing the huge bumps from the loss function, the learning rate can be increased up to 1.5, and the model achieves an accuracy of 87.6%.

01:45:00

In this section of the video, the presenter discusses two optimization techniques in deep learning: Momentum and RMSProp. Momentum is a method of updating the optimization step that involves incorporating a momentum part to ensure that the grad is always included. This results in a smoother update and yields better results than just increasing the batch size. The second technique, RMSProp, was developed by Geoffrey Hinton in 2012 and involves lerp on p.grad squared. The goal is to decrease the learning rate since the gradient is divided by the variance of the gradients, which can be a small number. To solve this, the learning rate is reduced, and the process is training well. The colorful dimension plot shows that the results are very nice.

01:50:00

In this section, the speaker explains how to initialize RMSProp in order to avoid having a high initial learning rate. Instead of initializing to zeros, the speaker suggests initializing the method to the first mini-batch's gradient squared. This is a useful trick for optimizing finicky architectures like EfficientNet. The speaker then introduces the optimization method Adam, which is simply RMSProp and momentum combined and is a commonly used approach. To avoid the problem of using zeros as a starting point and avoid an initial learning rate that's too high, the speaker suggests using zeros as the starting point and then unbiased them.

01:55:00

In this section, the speaker discusses the beta1, beta2 values in the code and how to modify them to improve the Momentum version. The speaker also mentions that they will be introducing some very cool stuff in the next lesson, which will help get above 90% and invites viewers to join in the next lesson. The speaker expresses excitement about having implemented everything from scratch using only the Python standard library, and hopes that the viewers feel the same way.

More from
Jeremy-Howard

No videos found.

Related Videos

No related videos found.

Trending
AI Music

No music found.