Why do linear activation functions work poorly

Why does a nonlinear activation function need to be used in a backpropagation neural network?


Reply:


The purpose of the activation function is to Non-linearity in the network to introduce

This allows you to model a response variable (also known as a goal variable, class label, or score) that does not change linearly with its explanatory variables

Non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as the output rendered on a straight line - is the word for it affine ).

Another way of looking at it: without one nonlinear Activation function in the network, an NN, regardless of how many layers it had, would behave like a single-layer perceptron, since adding up these layers would only result in another linear function (see definition above).

A common activation function for backprop ( hyperbolic tangent ), rated from -2 to 2:







However, a linear activation function can be used in very limited cases. To better understand the activation functions, it is important to look at the ordinary least square, or simply linear regression. A linear regression aims to find the optimal weights that, when combined with the input, result in a minimal vertical effect between the explanatory and the target variable. In short, if the expected output reflects the linear regression shown below, linear activation functions can be used: (top figure). But as in the second figure below, the linear function does not produce the desired results: (middle figure). However, a nonlinear function as shown below would give the desired results:

Activation functions cannot be linear, since neural networks with a linear activation function are only effective one layer deep, regardless of how complex their architecture is. The input to networks is usually a linear transform (input * weight), but the real world and problems are not linear. To make the incoming data non-linear, we use a non-linear mapping called the activation function. An activation function is a decision function that determines the presence of a certain neural feature. It is mapped between 0 and 1, where zero means the absence of the feature, while one means the presence of the feature. Unfortunately, the small changes in the weights cannot be reflected in the activation values, as they can only assume 0 or 1. Therefore, nonlinear functions must be continuous and differentiable between this range. A neural network must be able to take input from -infinity to + infinite, but it should be able to map it to output that in some cases is between {0,1} or between {-1,1} lies - i.e. the need for the activation function. Nonlinearity is needed in activation functions because the goal in a neural network is to create a nonlinear decision boundary over nonlinear combinations of weight and inputs.





If we only allow linear activation functions in a neural network, the output is just a linear transformation of the input, which is not enough to form a universal function approximator. Such a network can only be represented as matrix multiplication, and you might not get very interesting behaviors from such a network.

The same is true in the event that all neurons have affine activation functions (i.e. an activation function in the form where and are constants, which is a generalization of linear activation functions) that only result in an affine transformation from input to output, which is also not very is exciting.

A neural network can very well contain neurons with linear activation functions, such as in the output layer, but these require the company of neurons with a non-linear activation function in other parts of the network.

Note: An interesting exception is DeepMind's synthetic gradients, for which they use a small neural network to predict the gradient in the backpropagation run given the activation values, and they find that they can get away with using a neural network with no hidden layers and with can only do linear activations.







A forward neural network with linear activation and any number of hidden layers corresponds only to a linear neural network with no hidden layer. For example, consider the neural network in the figure with two hidden layers and no activation

We can do the last step because the combination of multiple linear transformations can be replaced by one transformation and the combination of multiple bias terms is just a single bias. The result is the same even if we add linear activation.

So we could replace this neural network with a single-layer neural network. This can be extended to layers. This indicates that adding layers does not increase the approximation performance of a linear neural network at all. We need nonlinear activation functions to approximate nonlinear functions, and most real world problems are very complex and nonlinear. If the activation function is not linear, it can actually be proven that a two-layer neural network with a sufficiently large number of hidden units is a universal function approximator.


"This work uses the Stone-Weierstrass theorem and the Gallant and White cosine squasher to establish that standard multilayer feedforward network architectures using spatial squashing functions approximate virtually any function of interest to any desired degree of accuracy provided there are enough hidden units available. "(Hornik et al., 1989, Neural Networks)

A squashing function is, for example, a non-linear activation function which, like the sigmoid activation function, is mapped to [0,1].


There are times when a purely linear network can produce useful results. Suppose we have a network of three layers with shapes (3,2,3). By restricting the middle layer to only two dimensions, we get a result that is the "plane of best fit" in the original three-dimensional space.

However, there are easier ways to find linear transforms of this form like NMF, PCA, etc. However, this is one case where a multi-layer network does NOT behave like a single-layer perceptron.


To understand the logic behind nonlinear activation functions, you should first understand why activation functions are used. In general, real world problems require nonlinear solutions that are not trivial. So we need some functions to create the non-linearity. Basically, an activation function is to create this non-linearity while input values ​​are mapped into a desired range.

However, linear activation functions can be used in very limited cases where you don't need hidden levels like linear regression. Usually there is no point in generating a neural network for these types of problems, because regardless of the number of hidden layers, that network generates a linear combination of inputs that can be performed in just one step. In other words, it behaves like a single layer.

There are also some more desirable properties for activation functions like that continuous differentiability . Since we use backpropagation, the function we generate must be differentiable at all times. I advise you to check out the Wikipedia page for activation functions to have a better understanding of the subject here.


Here are some great answers. It will be good to refer to the book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. It's a book worth referring to for a deeper look into various ML-related concepts. Extract from page 229 (section 5.1):

If the activation functions of all hidden units in a network are assumed to be linear, we can always find an equivalent network without hidden units for any such network. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is less than the number of input or output units, the transformations that the network can produce are not the most general possible linear transformations from inputs to outputs, since information is lost in the dimensional reduction of the hidden units. In Section 12.4.2 we show that networks of linear units lead to a principal component analysis. In general, however, there is little interest in multilayer networks of linear units.


As I remember, sigmoid functions are used because their derivative, which fits into the BP algorithm, is easy to compute, something as simple as f (x) (1-f (x)). I don't exactly remember the math. In fact, any function with derivatives can be used.



A layered NN of multiple neurons can be used to learn linearly inseparable problems. For example, the XOR function can be obtained with two layers with a step activation function.


Let me explain it to you as simply as possible:

Are neural networks used correctly in pattern recognition? And finding patterns is a very non-linear technique.

For reasons of argument, let us assume that we use a linear activation function y = wX + b for each individual neuron and set something like if y> 0 -> class 1, otherwise class 0.

Now we can calculate our loss using the squared loss of error and return it so the model learns well, right?

NOT CORRECT.

  • For the last hidden layer, the updated value is w {l} = w {l} - (alpha) * X.

  • For the penultimate hidden level, the updated value is w {l-1} = w {l-1} - (alpha) * w {l} * X.

  • For the i-th last hidden level, the updated value is w {i} = w {i} - (alpha) * w {l} ... * w {i + 1} * X.

This leads to the fact that we multiply all weight matrices together, which leads to the following possibilities: A) w {i} hardly changes due to the vanishing gradient B) w {i} changes dramatically and imprecisely due to the exploding gradient C) w { i} changes well enough to give us a good fit

When C occurs it means that our classification / prediction problem was most likely a simple linear / logistic regressor based problem and didn't need a neural network at all!

No matter how robust or well tuned your NN is, if you use a linear activation function, you will never be able to solve nonlinear pattern recognition problems


It's not a requirement at all. In fact, the rectified linear activation function is very useful in large neural networks. The computation of the gradient is much faster and leads to sparsity by setting a minimum limit to 0.

For more information, see the following: https://www.academia.edu/7826776/Mathematical_Intuition_for_Performance_of_Rectified_Linear_Unit_in_Deep_Neural_Networks


To edit:

There has been some discussion about whether the rectified linear activation function can be called a linear function.

Yes, it is technically a nonlinear function as it is nonlinear at the point x = 0. However, it is still correct to say that it is linear at all other points. So I don't think it's that useful to peck here.

I could have chosen the identity feature and it would still be true, but I chose ReLU as an example because of its recent popularity.






We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.