Understanding Activation Function in Machine Learning

Activation functions are the mathematical components that determine whether a neuron should activate based on its input. They introduce non-linearity into neural networks, enabling them to learn complex patterns and solve real-world problems like image recognition, natural language processing, and time series forecasting.

What is an Activation Function?

An activation function is a mathematical function applied to a neuron's output that determines whether the neuron should be activated or not. Without activation functions, neural networks would only perform linear transformations, severely limiting their ability to model complex relationships in data.

The primary purpose of activation functions is to introduce non-linearity into the network. This non-linearity allows neural networks to approximate any continuous function and learn intricate patterns that cannot be captured by simple linear models.

Importance of Non-linearity

Non-linearity is crucial because most real-world phenomena involve complex, non-linear relationships. Linear activation functions can only model simple additive relationships, while non-linear functions enable networks to capture sophisticated patterns like curves, interactions, and hierarchical features in data.

Types of Activation Functions

Sigmoid Activation Function

The sigmoid function maps input values to a range between 0 and 1, creating an S-shaped curve. It's particularly useful for binary classification problems where outputs can be interpreted as probabilities.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Generate values and plot
x = np.linspace(-10, 10, 100)
y = sigmoid(x)

print("Sigmoid function examples:")
print(f"sigmoid(-5) = {sigmoid(-5):.4f}")
print(f"sigmoid(0) = {sigmoid(0):.4f}")  
print(f"sigmoid(5) = {sigmoid(5):.4f}")
Sigmoid function examples:
sigmoid(-5) = 0.0067
sigmoid(0) = 0.5000
sigmoid(5) = 0.9933

Drawbacks: The sigmoid function suffers from the vanishing gradient problem, where gradients become very small in deep networks, slowing down learning.

Tanh Activation Function

The hyperbolic tangent (tanh) function maps inputs to a range between -1 and 1. It's similar to sigmoid but produces zero-centered outputs, which can improve training efficiency.

import numpy as np

def tanh(x):
    return np.tanh(x)

# Compare tanh and sigmoid
x_values = [-2, -1, 0, 1, 2]
print("Comparison of Tanh and Sigmoid:")
print("x\tTanh\t\tSigmoid")
print("-" * 30)

for x in x_values:
    tanh_val = tanh(x)
    sigmoid_val = sigmoid(x)
    print(f"{x}\t{tanh_val:.4f}\t\t{sigmoid_val:.4f}")
Comparison of Tanh and Sigmoid:
x	Tanh		Sigmoid
------------------------------
-2	-0.9640		0.1192
-1	-0.7616		0.2689
0	0.0000		0.5000
1	0.7616		0.7311
2	0.9640		0.8808

Rectified Linear Unit (ReLU)

ReLU is the most popular activation function in modern deep learning. It outputs the input directly if positive, otherwise outputs zero. This simplicity makes it computationally efficient and helps mitigate the vanishing gradient problem.

import numpy as np

def relu(x):
    return np.maximum(0, x)

# Test ReLU function
test_values = [-5, -2, 0, 2, 5]
print("ReLU function examples:")
print("Input\tOutput")
print("-" * 15)

for val in test_values:
    output = relu(val)
    print(f"{val}\t{output}")

# ReLU derivative (useful for backpropagation)
def relu_derivative(x):
    return np.where(x > 0, 1, 0)

print("\nReLU derivatives:")
print("Input\tDerivative")
print("-" * 18)
for val in test_values:
    deriv = relu_derivative(val)
    print(f"{val}\t{deriv}")
ReLU function examples:
Input	Output
---------------
-5	0
-2	0
0	0
2	2
5	5

ReLU derivatives:
Input	Derivative
------------------
-5	0
-2	0
0	0
2	1
5	1

Advantage: Computationally efficient and reduces vanishing gradient problems.
Drawback: Can suffer from "dying ReLU" problem where neurons become permanently inactive.

Softmax Activation Function

Softmax is used in multi-class classification problems. It converts a vector of real numbers into a probability distribution where all probabilities sum to 1.

import numpy as np

def softmax(x):
    exp_x = np.exp(x - np.max(x))  # Subtract max for numerical stability
    return exp_x / np.sum(exp_x)

# Example with 4 classes
logits = np.array([2.0, 1.0, 0.1, 3.0])
probabilities = softmax(logits)

print("Multi-class classification example:")
print("Class\tLogit\tProbability")
print("-" * 30)

for i, (logit, prob) in enumerate(zip(logits, probabilities)):
    print(f"{i}\t{logit:.1f}\t{prob:.4f}")

print(f"\nSum of probabilities: {np.sum(probabilities):.4f}")
print(f"Predicted class: {np.argmax(probabilities)}")
Multi-class classification example:
Class	Logit	Probability
------------------------------
0	2.0	0.2689
1	1.0	0.0994
2	0.1	0.0402
3	3.0	0.5915

Sum of probabilities: 1.0000
Predicted class: 3

Comparison of Activation Functions

Function Range Best For Main Advantage Main Drawback
Sigmoid (0, 1) Binary classification Probabilistic output Vanishing gradients
Tanh (-1, 1) Hidden layers Zero-centered output Vanishing gradients
ReLU [0, ?) Hidden layers (deep networks) Simple, efficient Dying ReLU problem
Softmax (0, 1) Multi-class output Probability distribution Only for output layer

Conclusion

Activation functions are essential for enabling neural networks to learn complex, non-linear patterns. ReLU is preferred for hidden layers due to its efficiency, while softmax is ideal for multi-class classification outputs. The choice of activation function significantly impacts model performance and training efficiency.

---
Updated on: 2026-03-27T13:27:31+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements