We previously understood how Q-learning works, with the help of Q-value and Q-table. Q-learning is a type of reinforcement learning algorithm that contains an ‘agent’ that takes actions required to reach the optimal solution. This is achieved with the help of Q-table that is present as a neural network. It helps take the right step that maximizes the reward, thereby reaching the optimal solution.
Now, let us see how the agent uses the policy to decide on the next step that it needs to take to achieve optimum results.
The policy considers the Q-values of all possible actions that could be taken, based on the current state that the agent is in.
The higher the value of Q-value, better is the action.
There are times where the policy chooses to ignore the Q-table if it already has the knowledge required to take the next step.
Instead, it chooses to take another random action and find higher potential rewards.
When the episode begins, the agent takes random actions because the Q-table wouldn’t have been populated yet, and not much information would be present.
But as time progresses, the Q-table is gradually filled.
Due to this Q-table being populated, the agent has more knowledge about how it needs to interact with the environment to achieve maximum reward.
The Q-value is updated after every new action that the agent takes, with the help of Bellman equation.
It is important to understand that the Q-value that is updated, is based on the newly received reward and the maximum possible value of Q-value with respect to new state.
The Q-table is huge, since it contains all possible configurations and moves of the chess board. This would take up a lot of memory in the system. Hence, neural networks are used to store Q-table that helps suggest the optimal action to the agent for every state.
Due to the usage of neural network, reinforcement learning algorithms have achieved better performances on tasks such as Dota 2, and Go.