Q-learning is a type of reinforcement learning algorithm that contains an ‘agent’ that takes actions required to reach the optimal solution.
Reinforcement learning is a part of the ‘semi-supervised’ machine learning algorithms. When an input dataset is provided to a reinforcement learning algorithm, it learns from such a dataset, otherwise it learns from its experiences and surroundings.
When the ‘reinforcement agent’ performs an action, it is awarded or punished (awards and punishments are different, as they depend on the data available in hand) based on whether it predicted correctly (or took the right path or took a path that was least expensive).
If the ‘reinforcement agent’ gets an award, it moves in the same direction or on similar lines. Otherwise, if the agent is punished, it comes to the understanding that the solution it gave out was not correct or optimal, and that it needs to find better paths or outputs.
The reinforcement agent interacts with its surroundings, takes actions on certain issues thereby ensuring that the total amount of rewards/awards is maximized.
To understand this better, let us take the example of a game of chess. The idea is that every player in the game takes an action so as to win (perform a checkmate, take off all the pawns of the opponent player, and so on). The ‘agent’ would move the chess pawns, and change the state of the pawn. We can visualize the chess board as a graph that has vertices and the ‘agent’ moves from one edge to another.
Q-learning uses Q-table that helps the agent to understand and decide upon the next move that it should take. Q-table consists of rows and columns, where every row corresponds to every chess board configuration and columns correspond to all the possible moves (actions) that the agent could take. The Q-table also contains a value known as Q-value that contains the expected reward which the agent receives when they take an action and move from current state to next state.
Let us understand how it works.
In the beginning of the game, the Q-table is initialized with a random value.
Next, for every episode −
This goes on till the end stage for a particular episode is reached.
Note − One episode can be understood as an entire game of chess, in our example. Else, it is just one entire working of a problem in hand.