What is Bellman Equation in Reinforcement Learning?

Anyone who has encountered reinforcement learning (RL) knows that the Bellman Equation is an essential component of RL and appears in many forms throughout RL. By merging several RL functions, the Bellman equation aids in producing more calculated and better-bearing outcomes. In this post, we will first go over some of the fundamental terms related to reinforcement learning, then we'll get into the meat of some of the equations that are frequently used in reinforcement learning, and finally, we'll take a deep dive into the Bellman Equation.

What is Reinforcement Learning?

Reinforcement learning is a form of machine learning that teaches a model to choose the best course of action while solving a problem. We create an environment using the problem description as a guide. The model interacts with this environment and finds solutions on its own, independent of human intervention. Simply rewarding it positively when it takes a step toward its objective or negatively when it takes a step away from it will help it go in the correct direction. Let's use an illustration to better grasp this.

Think back to when you were learning to ride a bicycle for the first time as a youngster. Your guardian or parent helped maintain your equilibrium and occasionally gave instructions. The fact that they (your parent or guardian) did not completely oversee you while you were studying is what matters the most. Instead, you by yourself made mistakes, learned from them, and kept trying. Your brain ultimately acclimated to this new information with enough practice, and you were finally able to balance yourself while riding a bicycle on both sides.

However, this learning process was neither entirely supervised nor completely unsupervised. Instead, this learning was rather loosely controlled. RL (Reinforcement Learning) is a field distinct from supervised and unsupervised learning, so keep that in mind. When you have fallen off a bicycle, you have realized that it is not the proper way to ride, so you have tried something else. When you have been able to maintain your balance for a longer period of time, you have recognized that what I am doing is the correct method. The same principles apply to reinforcement learning. RL is a "trial and error" approach to learning. Although direct supervision is available, we can make up for it via feedback (rewards and punishments) to enhance learning.

Basic Terminologies of Reinforcement Learning

Let's now comprehend the fundamental terms used in reinforcement learning (RL), which will finally take us to the official definition of RL, after having understood the fundamental idea underlying RL.


In real life, an agent is a thing that attempts to figure out how to do something in the best manner possible. In our illustration, the young youngster is the agent who masters bicycle riding.


What the agent performs at each time step is the action in real-world terms. The action would be "walking" in the case of a youngster learning to walk.


In real life, incentives are nothing more than a form of feedback sent to the agent based on the behavior of the agent. Positive rewards are provided to agents whose actions are successful and have the potential to result in success, and vice versa. This is analogous to a youngster receiving praise from older children after successfully riding a bicycle for a longer period of time while maintaining balance.


In real life, the environment refers to an agent's external environment or the actual context in which the agent functions.

Understanding Bellman Equation

One of the essential building blocks of reinforcement learning is the Bellman equation. The equation shows us what long-term gain we can anticipate, given our current situation and assuming that we perform as best we can at this point and every following step.

The Bellman equation can be used to determine if we have achieved the aim because the main objective of reinforcement learning is to maximize the long-term reward. The value of the present condition is revealed when the optimal course of action is selected. For deterministic situations, the Bellman equation is shown in the equation below.


The equation has three parts −

  • The max function, which selects the action that maximizes the reward (max a)

  • The discount factor is a hyperparameter that can be modified to highlight the long-term benefit or to have the model focus on low-hanging fruits and promote the best short-term solution. (gamma)

  • The function that computes the reward based on the selected action and the current state (R(s, a))

The Bellman equation is a recursive function since it calls itself (s' is the state in the following step).

It can appear contradictory that the function calculated within the current step relates to the future step rather than the previous one.

That is because we can only calculate the value of actions once we have reached the terminal state. At this stage, we reverse the process, applying the discount factor and adding the reward function at each step until we reach the first step. The final component is the overall reward.


Reinforcement learning, which is essentially a subset of machine learning, is about making a logical decision to choose the best performance or course of action to take in a specific situation. It also helps you use the Bellman Equation to boost a portion of the total payout.