What are the levels in ResNet

1. Gradient dispersion and gradient explosion

CNN14. Residual Networks (ResNets)

1.1 Vanishing gradient (vanishing gradient) and exploding gradient

Reference gradient loss and gradient explosion in deep neural networks

1.1.1 The gradient is unstable

Gradient loss and gradient explosion are collectively referred to asGradient instability. The reasons for this are similar. To illustrate how the gradient loss is created, we will simplify the problemChain ruleTake an example to illustrate this.

1.1.2 Simplify the problem - the chain rule

Let's consider the following simple deep neural network: each layer contains only one neuron and there are three hidden layers:


This equation also reflects the backpropagation mode of operation: it starts from the output layer and calculates the partial derivative layer by layer up to the input layer. Then use these calculated partial derivatives to update the appropriate weights and biases and achieve the purpose of back propagation.

However, we have found that the gradient calculation formula depends on more and more parameters of each layer as the number of layers increases. The influence of these parametersShift after shift, Which eventually led to itGradient explosion and gradient disappear

1.1.3 Vanishing gradient (i.e. vanishing gradient)

The sigmoid function often causes problems with gradient instability, so we use this as a research object.
The picture looks like this:


The derived functional picture is as follows:

The maximum value of the derivative of this function is 0.25, and as the absolute value of the value increases, the output decreases.

Since we usually sample the standard normal distribution when we initialize the weight value, the absolute value of the weight w is usually less than 1, so we can get:

To calculate the gradient of the initial layer, we multiply several in the deep network

Thanks to the formula


The final calculation result is exponentially smaller, which is why the gradient is lost.

1.1.4 Gradient explosion

The cause of the gradient explosion is the opposite of the gradient loss. If we choose a larger weight value,

Will be greater than 1. When these items are accumulated, the calculation result increases exponentially.

2.1 Background

The most fundamental motivation of ResNet is the so-called "degeneration" problem, ie as the level of the model deepens, the error rate increases.

Since AlexNet, the state-of-the-art CNN structure has been getting deeper and deeper. There are only 5 AlexNet and 19 and 22 folding layers in VGG and GoogLeNet.

However, we cannot increase the depth of the network simply by adding layers.Vanishing gradientThe presence of the problem makes deep network training quite difficult. The "gradient disappearance" problem relates to the fact that repeated multiplication can make the gradient infinitely small as the gradient is carried back to the previous layer. therefore,As the network depth increases, the performance gradually saturates and even decreases

Before the advent of ResNet, researchers found several methods to solve the gradient disappearance problem, such as adding an auxiliary loss in the middle layer as additional monitoring. However, there is no method that can completely solve this problem all at once.
Eventually someone suggested resNet which made the situation a lot better.


A remainder block (remainder block)

The main idea behind ResNet is to introduce a "shortcut connection" that can skip one or more levels, as shown in the figure above. The "curved arch" in the picture is the so-called "link connection"identity mapping

The author of resNet believes:

  1. Increasing the network layers should not affect the performance of the network because we can simply "identity mapping" over the network and the resulting output architecture will do the same thing. This implies that the training error rate of the deeper model should not be higher than the corresponding flat model.
  2. They also assumed that it would be easier to match the stacked levels to a remainder map rather than to match them directly to the required underlying map. The remainder block shown in the figure above clearly enables this.

In addition, resNet tooUse reLU activationAs the value of the reLU function increases, the gradient does not decrease, so that the gradient disappears as well.

So ResNet offers two advantages: identity mapping and reLU activation

2.2 Two block constructions


These two structures come from ResNet34 (left picture) and ResNet50 / 101/152 (right picture).
In general, the whole structure is referred to as one. "building block"And the picture on the right is also called"bottleneck design”。
The purpose of the image on the right is to reduce the number of parameters: the first 1x1 convolution requires a 256-dimensional channelUp to 64 dimensionsAnd then at the endRecovery via 1x1 folding
The number of parameters used in the structure of the entire bottleneck is. But if no bottleneck is used, then it will be two 3x3x256 turns, the number of parameters: the difference is 16.94 times.
For regular ResNet, it can be used on networks with 34 layers or less. And for bottleneck design, ResNet is usually used in deeper networks like 101Reduce the number of calculations and parameters(Because of practical reasons).

2.3 Two link connection methods

Some people might wonder what if the number of channels of F (x) and x is different because F (x) and x are added according to the channel dimension, how can they be added for different channels?
As to whether the number of channels is the same, there are two cases to consider. Let's take a look at the picture below firstsolid lineWithdotted lineTwo connection methods:

  • solid lineThe connecting part ("the first pink rectangle and the third pink rectangle") all do a 3x3x64 convolution, and their channel numbers are the same, so the calculation method is applied:
    y = F (x) + x
  • dotted lineThe connecting part ("the first green rectangle and the third green rectangle") are the convolution operations of 3x3x64 and 3x3x128, respectively. Their number of channels is different (64 and 128), so the method of calculation is applied:
    y = F (x) + Wx

Where W is the convolution operation used to set the channel dimension of x

2.4 Network each layer architecture anew


Above, a total of 5 depths were suggested by ResNet, namely 18, 34, 50, 101 and 152. resNet-101 just means that the folding layers or fully connected layers add up to 101 layers, while the activation layer or the pooling layer are not counted and other resNets are analogized.

All networks are divided into 5 parts: conv1, conv2_x, conv3_x, conv4_x. .

Here we pay attention to the two columns 50-layer and 101-layer. We can see that the only difference between them is that conv4_x-ResNet50 has 6 blocks while ResNet101 has 23 blocks. There is a difference of 17 blocks, which equals 17 x 3 = 51 layers.

3.1. Install Java JDK

Reference How To Install Java with Apt-Get on Debian 8

3.2. Confirm the system version

Cmd execute command

I learned my system was Ubuntu so I installed the Ubuntu tutorial

3.2. Install Bazel

Reference Installing Bazel on Ubuntu

3.3. Download the dataset

Reference Inception in TensorFlow
After thisGetting startedJust do it, you need to install bazel in advance and the above1-3This is the process of installing Bazel.

Since the download speed of the virtual laboratory machine is too slow, I switched to CIFAR as the training data set

Due to limited time and difficulty, I only used the internet at https://github.com/tensorflow/models/tree/master/official/resnet The model provided ran it again and recorded the results. I tried modifying the code myself but found that the official code framework is too big to parse. Hope to make amends for this part in the future ...

4.1 Add environment variables

You need to add the file path of the models to the environment variable, otherwise it can cause such problems.

4.1 Download and unzip Cifar 10 data

4.2 Start the training

In the later stages of training, the training accuracy rate is over 99%, and due to overfitting, the assessment accuracy rate is around 92%.

training

evaluating

It can be seen that the test accuracy is slightly lower than the training accuracy.

4.3 Various problems and solutions

4.3.1 Problem: AttributeError: Module 'tensor flow' has no attribute data '

reference
AttributeError: module 'tensorflow' has no attribute 'data'

Yes, as @blairjordan mentions, has been upgraded to just in TensorFlow v1.4. So you need to make sure you're using v1.4.

the reason

The new and old versions have different interfaces and different function calls

Solution 1

Change the way the function is called or install a new version of Tensorflow (v1.4 or v1.7).

Why solution 1 works

At first I was confused. For the installation of tensorflow-gpu the corresponding versions of cudatoolkit and cudnn have to be installed in advance. Since the preinstalled cuda and cudnn versions of the virtual machine are not high, I can only install the low version of tf. But when I run later I found that conda now automatically installs the appropriate version of the cuda and cudnn dependencies for you. As shown below:

cuda and cudnn will install it for you

So you can safely install the high version of Tensorflow and no longer have to worry about installing Cuda and Cudnn in the future. As long as the GPU supports it, you can easily install it.

Reference 2

The arguments accepted by the Dataset.map () transformation have also changed:

I'd check that you have the latest version. In my case, I still need to use the old tf.contrib.data.

Solution 2 (not confirmed)

Use the old function call, e.g. B. calling data.map.prefetch to data.map.
This method is just an idea that has not been verified because I used itSolution 1To solve the Problem. I will not go into that again here.