What is an Activation Function?
In Deep learning you have many neurons that have inputs and outputs. The output depends on the calculation of the weights and biases. The outputs, taken together, more and more form a pattern as the weights are updated towards a convergence of the truth of a model (which mimics some thing of reality). The weights and biases form a linear function and the activation function turns it into a non-linear function so that the network can learn complex patterns. So the activation functions outputs the patterns sought by the training of the neural network.
Here are some different Neural Networks with a simplified description of each.
Linear Activation Function
Other name is ‘Identity Activation Function’
A simple activation function used in The Perceptron.
Cannot perform Backpropagation
Since it’s just linear it can’t learn complex patterns
For linear regression models
Used in Output layers.
Binary Step Activation Function
A simple activation function.
Outputs 0 or 1 based on a threshold.
Good for binary classification problems
Provides a clear decision boundary
Doesn’t handle non-linearity well, so not good for backpropagation
Used in Output layers.
Sigmoid Activation Function
Suffers from Vanishing Gradient Problem
Output is between 0 and 1.
Output can be interpreted as possibilities.
Used in Output layer for Binary Classification.
Tanh Activation Function
Short for ‘Hyperbolic Tangent’.
Suffers from Vanishing Gradient problem.
Outputs between -1 and 1.
Negative to positive values makes for better optimization.
Used in Hidden layers for better convergence.
ReLU Activation Function
The standard activation function to which others are compared.
Positive input = returns same value
Negative input = returns 0.
Some neurons may die because negative output always ending on hard zero makes it not able to update. This is the problem of ‘Dead Neurons’.
All the ReLU types of activation functions are good at mitigating the Vanishing Gradient problem because it doesn’t have saturation of positive values, outputting the input directly when it's positive.
Very effective in Convolutional Neural Networks (CNNs).
Used in Hidden layers.
Leaky ReLU Activation Function
Addresses the problem of ‘Dead Neurons’ in ReLU by eliminating hard zero for negative outputs.
Positive input = returns the same value.
Negative input = multiplied by a small constant.
All the ReLU types of activation functions are good at mitigating the Vanishing Gradient problem because it doesn’t have saturation of positive values, outputting the input directly when it's positive.
Effective in Hidden layers.
Parametric ReLU Activation Function
Keeps neurons active avoiding the ‘Dead Neurons’ problem in ReLU.
Positive input = returns the same value.
Negative input = multiplied by a small learnable parameter (value is updated during training).
Its flexibility can handle a wide range of patterns.
All the ReLU types of activation functions are good at mitigating the Vanishing Gradient problem because it doesn’t have saturation of positive values, outputting the input directly when it's positive.
Effective for Hidden layers.
Exponential Linear Unit Activation Function (ELU)
Output for negative values curves smoothly leading to less chaotic convergence.
Steady Gradients, Smoother Learning.
Uses an exponential algorithm for negative values.
Used in Hidden layers.
Swish Activation Function
New Activation Function.
Smoother transition for negative inputs than ReLU.
Pick up linear patterns which may be in non-linear patterns making it very adaptable to deep learning.
Used in both Hidden and Output layers.
Maxout Activation Function
More flexible, adaptable than traditional activation functions.
In a typical neural network, the weights and biases create one linear function. But in Maxout multiple linear functions are created because there are different set of weights and biases.
Maxout chooses the maximum linear function because it represents the features most potently. It has representational power.
Used in both Hidden and Output layers.
Softmax Activation Function
Enables better gradients for classification problems.
Used in Output layer for multi-class classification.
Outputs are a probability distribution, a value between 0 and 1.
Outputs summed up equal 100%.
Summary
The choice of an activation function depends on the task at hand and experimentation is common. ReLU remains a popular choice because of it’s simplicity as well as robustness to different initialization schemes and learning rates.New innovations in Activation functions are continually being explored such as adaptive activation functions which dynamically can adjust their behavior based on input data or training progress. So they can become like another type of function in a dynamic way.