Reproducing Cyclic Learning papers + SuperConvergence using fastai
Four papers by Leslie N. Smith, introducing cyclic learning and Superconvergence are reproduced in this post using pytorch+fastai.
- Summary of hyper-parameters
- Things to remember
- Underfitting vs Overfitting
- Deep Dive into Underfitting and Overfitting
- Choosing Learning Rate
- Introducing Superconvergence
- Explanation behind Super-Convergence
- Choosing momentum value
- Choosing Weight Decay
- Train a final classifier model with the above param values
The following papers by Leslie N. Smith are covered in this notebook :-
- A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. paper
- Super-Convergence: Very Fast Training of Neural Networks Using Learning Rates. paper
- Exploring loss function topology with cyclical learning rates. paper
- Cyclical Learning Rates for Training Neural Networks. paper
Although, the main aim is to reproduce the papers but a lot of research has been done since than and thus where needed I would change some things to match the state of the art practices. Most of these things are taught in the fastai courses, namely Practical Deep Learning for Coders, v3.
If you are not familiar with fastai, it is a deep learning library build on top of PyTorch and it contains the implementations of most of the state of the art practices, which keep changing over time. As a result of this you can get state of the art results in most of the tasks by using the defaults of the library.
How this notebook is structured. I would explain all the concepts discussed in the paper and would provide a walkthrough with a CIFAR-100 example along the way. So there would be explanation of the topic and then the code for that. If you are to use these techniques for your own work, you can follow along the notebook from top-to-bottom. For the implementations of some concepts, I would use the fastai built in functions as fastai provides a callback system that really helps a lot when working on projects in real life. So if you do not know fastai, you can watch the course mentioned above or read the docs, as the docs contains ample examples. For Tensorflow users, as of now I am not aware if some library provides this much functionality as fastai but you can still follow on as the concepts discussed are general, only the implementation is different.
If you are learning fastai, this notebook can be a very good tutorial on how to use the vision API in fastai.
# and much more, so no need to import them again in our work.
from fastai import *
from fastai.vision import *
Although deep learning has produced dazzling successes for applications of image, speech and video processing in the past few years, most trainings are done with suboptimal hyper-parameters, requiring unnecessarily long training times. Setting the hyperparameters reamains a black art that requires years of experience to acquire. So I present several efficient ways to set the hyper-parameters that significantly reduce the training time and improves performance. Specifically, we examine the training and validation/test loss for subtle clues of underfitting and overfitting and suggest guidelines for moving toward the optimal balance point.
And I agree with that. Most of the times there is not much need to use these libraries and as we would soon find that most of the hyper-parameters are linked with each other, so we should tune them accordingly. An argument can be made for Bayesian Optimizations, but I have not used them and I find the techniques discussed here much simpler and safer.
Summary of hyper-parameters
- Learning rate :- Use learning rate finder test and get the maximum value of learning rate that you would use in the 1cycle policy.
- Batch size :- The largest value that fits on your GPU. You can use batch sizes like 20 that are not powers of two. The performance drop that is discussed when using batch size that are not powers of 2 is true, but we can ignore it if we want.
- Momentum :- Use cyclic momentum in most of the tasks. When you are using GANs use a constant value of momentum.
- Weight Decay :- Larger value when using smaller dataset and model. Smaller value when using bigger datasets and models. Use a constant value.
Things to remember
- Setting hyperparameters is very important. Every dataset would have their own set of hyperparameter values, and setting the right hyper-parameter values should be your only priority I would say initially.
- Regularization vs weight decay. In regularization we subtract something from the loss function, while in weight decay we subtract something from the parameter update step.
- When we use modern architectures like Resnet it is better to use weight decay than L2 regularization.
- You can set all your hyper-parameters in a few epochs.
Underfitting vs Overfitting
The basis of this notebook is based on the concept of underfitting vsersus overfitting. Specifically, it consists of examining the training's test/validation loss for clues of underfitting and overfitting in order to strive for the optimal set of hyper-parameters. By observing and understanding the clues available early during training, we can tune our architecture and hyper-parameters with short runs of a few epochs. In particular, by monitoring validation/test loss early in the training, enough information is available to tune the architecture and hyper-parameters and this eliminates the necessity of running complete grid or random searches.
One key finding in the paper is that the total regularization needs to be in balance for a given dataset and architecture. And it was found that learning rate, momentum and regularization are tightly coupled and optimal values must be determined together. This means that if you set a learning rate as a large value than other regularizations like momentum must come down, so that the total remains preserved.
In Figure 2(a) early overfitting is observed but test loss still decreases a little after that. This can be misleading in some cases where one can get blindsided by reduction in the amount of test loss.
Underfitting
Underfitting is when the machine learning model is unable to reduce the error for either the test or training set, which is due to the under capacity of the machine learning model i.e. it is now pwerful enough to fit the underlying complexities of the data distributions. Whenever your valid loss is less than the training loss than it means your model is underfitting.
Choosing Learning Rate
If the learning rate is too low overfitting can occur. Large learning rates help to regularize the training but if the learning rate is too large, the training will diverge. Hence a grid search of short runs to find learning rates that converge or diverge is possible but there is an easier way.
By training with high learning rates we can reach a model that gets 93% accuracy in 70 epochs which is less than 7k iterations (as opposed to the 6rk iterations which made roughly 360 epochs in the original paper of Resnet).
Sylvain Gugger has written two very good blog posts explaining this topic, so I would recommend you to read those first.
Here I just summarize the topic with important details, for a detailed overview refer to the above two links.
# https://course.fast.ai/datasets.
# 'Path' is a python package which makes working with directory
# names a lot easier, to import it use `from pathlib import Path`
path = Path('/home/kushaj/Desktop/Data/cifar100/')
path.ls()
# where every image is placed in a folder with the same name as the
# class of the image.
src = (ImageList.from_folder(path)
.split_by_folder(valid='test')
.label_from_folder())
data = (src.transform(get_transforms(), size=(32,32))
.databunch(bs=256, val_bs=512, num_workers=8)
.normalize(cifar_stats))
cifar_stats
data.show_batch(rows=3, figsize=(4,4))
After creating the databunch we have done the following things
- Added data augmentation with the following transforms and the size of images is taken as (32,32)
- Normalied the images with the CIFAR_STATS
- I am using a batch size of 256 for both trian and valid sets
class AdaptiveConcatPool2d(nn.Module):
"Layer that concats `AdaptiveAvgPool2d` and `AdaptiveMaxPool2d`."
def __init__(self, sz=None):
"Output will be 2*sz or 2 if sz is None"
super().__init__()
self.output_size = sz or 1
self.ap = nn.AdaptiveAvgPool2d(self.output_size)
self.mp = nn.AdaptiveMaxPool2d(self.output_size)
def forward(self, x):
return torch.cat([self.mp(x), self.ap(x)], 1)
class Flatten(nn.Module):
"Flatten `x` to a single dimension, often used at the end of a model. `full` for rank-1 tensor"
def __init__(self, full:bool=False):
super().__init__()
self.full = full
def forward(self, x):
return x.view(-1) if self.full else x.view(x.size(0), -1)
class BasicBlock(nn.Module):
def __init__(self, c_in, c_out, stride=1):
super().__init__()
self.conv1 = nn.Conv2d(c_in, c_out, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(c_out)
self.conv2 = nn.Conv2d(c_out, c_out, kernel_size=3, stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(c_out)
if stride != 1 or c_in != c_out:
self.shortcut = nn.Sequential(
nn.Conv2d(c_in, c_out, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(c_out)
)
def forward(self, x):
shortcut = self.shortcut(x) if hasattr(self, 'shortcut') else x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += shortcut
return F.relu(out)
class Resnet(nn.Module):
def __init__(self, num_blocks=[9,9,9], num_classes=100):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(16)
self.layer1 = self.make_group(16, 16, num_blocks[0], stride=1)
self.layer2 = self.make_group(16, 32, num_blocks[1], stride=2)
self.layer3 = self.make_group(32, 64, num_blocks[2], stride=2)
self.head = nn.Sequential(
AdaptiveConcatPool2d(),
Flatten(),
nn.BatchNorm1d(128),
nn.Dropout(0.25),
nn.Linear(128, 128, bias=True),
nn.ReLU(inplace=True),
nn.BatchNorm1d(128),
nn.Dropout(0.5),
nn.Linear(128, num_classes, bias=True)
)
def make_group(self, c_in, c_out, num_blocks, stride):
layers = [BasicBlock(c_in, c_out, stride)]
for i in range(num_blocks-1):
layers.append(BasicBlock(c_out, c_out, stride=1))
return nn.Sequential(*layers)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.layer1(out)
out = self.layer2(out)
out = self.layer3(out)
out = self.head(out)
return out
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=[ShowGraph])
Till now we have done the following:
- Fot out data, which is stored in
data
as a databunch - A Resnet model is created
- A
learner
object is created which is nameslearn
- AdamW is used as the optimization function
- CrossEntropyLoss is used as loss function
Cyclic Learning Rate
The essence of this learning rate policy comes from the observation that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial. So we vary learning rate from a small value to large value than back to small value. This is termed as one complete cycle.
This learning rate policy is taken from the original paper, but that was few years back. Now we use a cosine policy that lookes something like this.
learn.fit_one_cycle(1, max_lr=1e-2)
learn.recorder.plot_lr()
If you have any question on why is the learning rate policy defined as this, or in general to the learning rate policy in fastai, I have answered all these questions in the forums which you can check here. There I clearly explain why we use a learning rate policy that looks like the one shown above and what are the reasons behind choosing the deafults. (The main reason is we want to train our model at higher learning rates and then fine-tune them at lower learning rates).
For the implementation of this, you can check KushajveerSingh/fastai_without_fastai where I implement the one-cycle policy in pure pytorch.
Difference from original paper
There are some changes from the original paper. First, in the papers for most of the cases accuracy is used as the metric, while I use loss values. The reason loss values are used to compare hyper-parameter value is because loss is the actual thing that is being optimized, we want to reduce it the most.
One-cycle policy summary
There are two phases, first learning rate increases from small lr to maximum value and in the second phase it decreased from the maximum value to the minimum value (the minimum value is smaller than the starting value in the first phase). In implementation, you only need to define the maximum value of the learning rate and the minimum values would be calculated appropriately.
learn.lr_find()
learn.recorder.plot_lr()
In fastai this test can be done using learn.lr_find
. After running this test we plot a diagram of the loss values for the different values of learning rate and try to find the maximum value of learning rate that we can use.
learn.lr_find(start_lr=1e-7, end_lr=10, num_it=100)
learn.recorder.plot()
Now you need to look at the above figure and select the lr value that you want to use. I would use lr=1e-2. The art of selecting good values comes down to practice, just try the values you are confused about and select the one that gives the best results. There are certain guidelines that you should follow while finding the lr value
- Select the one where the loss is small for larger amount of time
- When loss begins to increase, select a value that is 10 times less
For experimentation, let us see what the resutls are when I use max_lr=1e-2 and max_lr=1e-1
# So you don't have to code up anything
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=[ShowGraph])
learn.fit_one_cycle(3, max_lr=1e-2)
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=[ShowGraph])
learn.fit_one_cycle(3, max_lr=1e-1)
As you can see clearly max_lr=1e-2 is better than max_lr=1e-1. Athough this test took around 11 minutes but we easily got the value of lr that we should use. With enough practice you can predict the best value of max_lr from directly the lr_find
graph but whenever you are confused running some epochs can help.
There are two noteworthy things to see in this figure. First is the dip in the accuracy around LR=0.1. Second is the consistently high test accuracy over a large span of learning rates (LR=0.25 to 1.0), which is unusual. Seeing these unsual facts, experiments were carried out using cyclic learning rate and the following counterintuitive things results appeared.
You can see an anomly that occurs as the LR increases from 0.1 to 0.35. The training loss increases sharply by four orders of magnitude at a learning rate of approximately 0.255 but training convergence resumes at larger learning rates. In addition, there are divergenct behaviours between test accuracy and loss curves that are not easily explained.In the first cycle, when the learning rate is increasing from 0.13 to 0.18 the test loss increases but the test accuracy also increases. This simultaneous increase in the test loss and the test accuracy also occurs in the second cycle as the learning rate decrease from 0.35 to 0.1 and in various portions of subsequent cycles.
Another interesting fact can be see from the second figure. The cyclic learning rate method is able to get 0.93 accuracy after just once cycle and it remains the same for the subsequent cycles, while the standard approach of using a constant learning rate manages to achieve the accuracy close to 0.93 at about 5 times more iterations.
Testing linear interpolation results
In the Exploring Loss Function Topology with Cyclical Learning Rates (only 3 pages) an important topic of networkinterpolation is discussed which shows that the solution found by each cycle is different from each other (intuitively you can think the solutions as belonging to different valleys). So we can interpolate solutions from different cycles and we can except better accuracy. So I test this fact now?
Interpolation between two values is defined as $w_{new}=\alpha * w_1+(1-\alpha)*w_2$.
From this plot it was observed that the solution found by each cycle indeed belonged to different minima. Also an additional noteworthy feature, some amount of regularization is possible through interpolating between two solutions. The minima for training loss are at a=0.0 and 1.0
but the test loss minima are slightly offset towards the center.
Coding Linear Interpolation
To check the results for the above topic the following things would be done:
- Do some training epochs on the training dataset and get the weights for diferent cycles
- Make a plot of interpolation between different weight values
To modify our training loop callbacks would be used. I consifer callbacks to tbe the most important feature of fastai as it allows infinite customization without chaning the training loop.
class GetWeights(Callback):
def __init__(self, learn, save_list=[5, 7]):
self.learn = learn
self.save_list = save_list
def on_train_begin(self, **kwargs):
self.weights = {}
def on_epoch_end(self, **kwargs):
if kwargs['epoch'] in self.save_list:
weights = learn.model.state_dict()
for k, v in weights.items():
weights[k] = v.cpu()
self.weights[kwargs['epoch']] = weights
def get_weights(self):
return self.weights
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=ShowGraph)
getWeights = GetWeights(learn)
learn.fit_one_cycle(8, max_lr=1e-2, callbacks=getWeights)
weights = getWeights.get_weights()
w1 = weights[5]
w2 = weights[7]
learn.model.load_state_dict(w1)
learn.validate()
def interpolate(alpha):
w_new = {}
keys = list(w1.keys())
for key in keys:
w_new[key] = alpha*w1[key] + (1 - alpha)*w2[key]
return w_new
alpha_range = np.linspace(start=-0.5, stop=1.5, num=100)
train_loss = []
val_loss = []
for i, alpha in enumerate(alpha_range):
print(f'{i}/{len(alpha_range)} started')
w_new = interpolate(alpha)
learn.model.load_state_dict(w_new)
loss1, _, _ = learn.validate()
loss2, _, _ = learn.validate(data.train_dl)
val_loss.append(loss1)
train_loss.append(loss2)
plt.figure(figsize=(10,6))
plt.plot(alpha_range, train_loss, 'b')
plt.plot(alpha_range, val_loss, 'r')
plt.show()
alpha_range = np.linspace(start=-10, stop=4, num=50)
train_loss2 = []
val_loss2 = []
for i, alpha in enumerate(alpha_range):
print(f'{i}/{len(alpha_range)} started')
w_new = interpolate(alpha)
learn.model.load_state_dict(w_new)
loss1, _, _ = learn.validate()
loss2, _, _ = learn.validate(data.train_dl)
val_loss2.append(loss1)
train_loss2.append(loss2)
plt.figure(figsize=(10,6))
plt.plot(alpha_range, train_loss2, 'b')
plt.plot(alpha_range, val_loss2, 'r')
plt.show()
Losses went nan for the initial values. Seeing this we cannot combine both of them, but the reason being in the paper SGD was used while here we are using AdamW with cyclic momentum by default so it changes up the situation by a lot. But it is a fun experiment.
Explanation behind Super-Convergence
As we discussed earlier, one of the indicators of super-convergence are the consistent high value of accuracies with increasing learning rate. Cyclic Learning which allowed super-convergence is indeed a combination of Curriculm learning and simulated annealing. Also, as the amount of data decreases the gap in performance between the result of standard training and super-convergence increases. Specifically, with a peicewise constant learning rate schedule the training encounters difficulties and diverges along the way.
The above figure gives an intuitive understanding of how super-convergence happens. The blue line in the figure represents the trajectory of the training while converging and the x's indicate the location of the solution at each iteration and indicates the progress made during the training.
The while loss surface can be divided into 3 phases:-
- In early training, the learning rate must be small in order for the training to make progress in appropriate direction. As you can see in the figure a significant progress is made in those early iterations (the part where we descend the valley)
- Now as the slope decreases so does the amount of progress made per iteration and little improvement occurs over the bulk of the iteration. This is the reason why we increase the learning rate to high values, so that we can quickly move over this region.
- As we approach the bottom of the loss surface (you can think of it as bottom of valley with bumps), so here we need to slow down and get to the bottom of these bumps i.e. why we decrease the value of learning rate to minimum value so that we can fine-tune our final result.
A quick summary
Initially when the learning starts there is huge slope and we move down it quickly, now there is a straight out path and we make very less progress through each iterations. At the end we enter a valley and we have to move towards the minima. Cyclic Learning Rate solves it. We start with small learning rate to get over that initial big slope. Then we increase the learning rate to quickly move through the straight path and then we again decrease learning rate to move through the valley.
Choosing momentum value
Choosing the value of momentum depends on the task at hand. So therea are two options either use cyclic momentum or constant value of momentum. I would directly give you the results here
- IF you are training GAN's or any task where you are quickly shifting between different models (like in GANs we shift between generator and discriminator) you should use constant value of momentum. Because you will not have enough time to get the benefits of cyclic momentum in these tasks.
- Cyclic momentum should be your default choice for other tasks. And for the momentum values they should be high->low i.e. opposite of learning rate.
In implementation we would specify the minimum and maximum value of momentum and move from max->min and then min->max. Below I show the cyclic learning rate and cyclic momentum values side by side.
learn = Learner(data, Resnet())
learn.fit_one_cycle(1)
learn.recorder.plot_lr(show_moms=True)
The reason behind using this approach is we have to keep the total regularization in check. So if I use higher learning rates which provide regularization on their own, I do not need high values of momentum (as high momentum value would also provide large regularization values) and this would ensure that this results in convergence for a longer range of learning rate and faster convergence. The optimal learning rate is dependent on the momentum and momentum is dependent on the learning rate.
Some good values of momentum to test
This is a difficult question to answer. Genrally the default of (0.95 - 0.85) works good, but there are some values that you can test
- 0.99 - 0.90
- 0.95 - 0.85
- 0.99
- 0.9
- 0.85
Note:- In practice, we would choose this value in combination with the value of weight decay. But for quick demonstration I show how to choose value of momentum only.
I test for the first two cases.
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=ShowGraph)
learn.fit_one_cycle(3, max_lr=1e-2, moms=(0.99, 0.90))
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=ShowGraph)
learn.fit_one_cycle(3, max_lr=1e-2, moms=(0.95, 0.85))
The testing took around 11 minutes and I think the value of 0.99-0.90 for momentum is better than 0.95-0.85. This is it. Using this method you can get creative and test out many of your hyperparameter choices without using any external libraries which in most cases do not even work.
One of the things that you may be wondering what to do if I have large datasets like Imagenet dataset where every epoch takes hours to run. In that situation I would suggest to take a small smaple of the dataset and adjust your hyper-parameters using those. Generally you don't need to have very large validation set, a subset of the validation set can also work provided that subset is a good representation of the actual validation set. You can become creative when using large datasets where for the training dataset you treat a specific number of batches as a single epoch. For example, if my dataset can be divided into 100 batches and I decide every 20 batches would be treated as a single epoch. So after 100 batches I would have done 5 epochs instead of 1.
Note:- I did not even need 3 epochs to decide which value is better. If we just compare the resutls from the first two epochs we can clearly make our choice without having to do third epoch.
Choosing Weight Decay
Weight decay is not like momentum and learning rate and the best value should remain constant throughout the training. Since the networks performance is dependent on a proper weight decay value, a grid search is worthwhile and differences are visible early in the training. That is the validation loss early is sufficient for determining a good value.
So to set the value of weight decay you should run combined runs using different values of weight decay and momentum and possibly learning rate. Generally what I found good is using the lr_range
test to get the value of learning rate and then adjust momentum and weight decay accordingly using various combinations.
The reason we can avoid finding the values of learning rate, momentum and weight decay simultaneously is all of these hyper-parameters are coupled and if we set a low value for some param we can set higher values of other params, so as to keep the total regularization in check.
How to set the value
This requires a grid search to determine the proper magnitude but usually does not require more than one significant figure accuracy. Use your knowledge of the dataset and architecture to decide which values to use. For example, a more complex dataset requires less regularization so test smaller weight decay values such as 1e-4, 1e-5, 1e-6. A shallow architecture requires more regularization so test larger weight decay values such as 1e-2, 1e-3, 1e-4. The reason being complex datasets provide regularization on their own and other regularizations should be reduced.
So if you guess that 1e-4 should be a good value than test 3e-5, 1e-4, 3e-4. How I chose the value 3? So if you think your weight decay best value lies between 10^{-4} and 10^{-3}, than you should choose the value 10^{-3.5} i.e take average of the exponent. You can keep going in this way.
To make testing simpler I would use cyclic momentum as (0.99-0.90) and find the optimal value of weight decay. The reason I do not need to change the momentum value here is that momentum is already changing from 0.99 to 0.9 as we are using cyclic momentum, so in implementation we can first find a good cyclic momentum value and then test out weight decay values.
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy])
learn.fit_one_cycle(2, max_lr=1e-2, moms=(0.99-0.90), wd=0)
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy])
learn.fit_one_cycle(2, max_lr=1e-2, moms=(0.99-0.90), wd=1e-4)
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy])
learn.fit_one_cycle(2, max_lr=1e-2, moms=(0.99-0.90), wd=1e-5)
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy])
learn.fit_one_cycle(2, max_lr=1e-2, moms=(0.99-0.90), wd=1e-3)
From the above results is clear that WD=1e-4 is the best. So now I test around 1e-4 with the same rule as explained above. So I take the average of -4 and -3 and I get -3.5, so the multiplier I choose is 10^{-3.5} which is approx 3.12*10^{-3}. The values I test are 3x10^{-4} and 3x10^{-5}
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy])
learn.fit_one_cycle(2, max_lr=1e-2, moms=(0.99-0.90), wd=3e-4)
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy])
learn.fit_one_cycle(2, max_lr=1e-2, moms=(0.99-0.90), wd=3e-5)
The final weight decay value that I choose after seeing the above resuls is 1e-4. In this way you can test different hyperparameter values and see which performs the best.
learn = Learner(data, Resnet(), metrics=[accuracy, top_k_accuracy], callback_fns=ShowGraph)
learn.fit_one_cycle(15, max_lr=1e-2, moms=(0.99-0.9), wd=1e-4)
Congratulations you made it to the end. You can now set any hyper-parameter value by just visualizing the validation loss for a few epochs and seeing whether our models overfit or not. The content of this notebook has been taken from the four papers by Lesli N. Smith as mentioned in the starting of the notebook and from the fastai courses taught by Jeremy Howard.