How to Use Pandas cut() and qcut()?


Pandas is a Python library that is used for data manipulation and analysis of structured data. The cut() and qcut() methods of pandas are used for creating categorical variables from numerical data. The cut() and qcut() methods split the numerical data into discrete intervals or quantiles respectively and assign labels to each interval or quantile. In this article, we will understand the functionalities of the cut() and qcut() methods with the help of various examples.

The cut() Function

The cut() divides a continuous variable into discrete bins or intervals based on specified criteria. It creates groups or categories of data based on the range of values present in the input data.

Syntax

pandas.cut(x, bins, labels=None, right=True, include_lowest=False, ...)

The parameters used in the above syntax are :

  • x: The input data, which can be a Pandas Series or a NumPy array.

  • bins: This can be an integer value specifying the number of equal-width bins to create, or a sequence of scalar values defining the bin edges. If an integer is provided, the range of values in x will be divided into that many equal-width bins.

  • labels (optional): An array-like object of labels to assign to each bin. If not provided, the labels will be integers indicating the bin index.

  • right (optional): A boolean value indicating whether the intervals should be right-closed (includes the right bin edge) or left-closed (includes the left bin edge). By default, it is set to True.

  • include_lowest (optional): A boolean value indicating whether to include the lowest value of the interval. By default, it is set to False.

Example 1: Equal-width Bins

In the below example, we have a list of numeric data. We specify bins as 3, indicating that we want to divide the data into three equal-width bins. The output shows the intervals in which each value falls, along with the corresponding category.

import pandas as pd

# Example 1: Equal-width bins
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
bins = 3
categories = pd.cut(data, bins)
print(categories)

Output

[(9.91, 40.0], (9.91, 40.0], (9.91, 40.0], (9.91, 40.0], (40.0, 70.0], (40.0, 70.0], (40.0, 70.0], (70.0, 100.0], (70.0, 100.0], (70.0, 100.0]]
Categories (3, interval[float64, right]): [(9.91, 40.0] < (40.0, 70.0] < (70.0, 100.0]]

Example 2: Custom bin edge and labels

In the below example, we define custom bin edges [0, 30, 60, 100] and corresponding labels ['Low', 'Medium', 'High']. The cut() function assigns each value in the data to the appropriate category based on the provided bins and labels.

# Example 2: Custom bin edges and labels
import pandas as pd
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
bins = [0, 30, 60, 100]
labels = ['Low', 'Medium', 'High']
categories = pd.cut(data, bins, labels=labels)
print(categories)

Output

['Low', 'Low', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'High', 'High', 'High']
Categories (3, object): ['Low' < 'Medium' < 'High']

The qcut() Function

The qcut() function splits the data based on quantiles or percentiles unlike the cut() function which splits the data into equal-width intervals. There are an equal number of data points in each bin making it useful for creating evenly distributed groups.

Syntax

pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

The parameters used in the syntax are:

  • x: The input data, which can be a Pandas Series or a NumPy array.

  • q: An integer value specifying the number of quantiles to create or a sequence of quantiles (values between 0 and 1) that define the cut-off points.

  • labels (optional): An array-like object of labels to assign to each bin. If not provided, the labels will be integers indicating the bin index.

  • retbins (optional): A boolean value indicating whether to return the bin edges along with the categories. By default, it is set to False.

  • precision (optional): An integer value specifying the precision of the quantile values. By default, it is set to 3.

  • duplicates (optional): How to handle duplicate values. By default, it is set to 'raise', which raises an error.

Example 1: Equal Number of Quantiles

In the below example, we have the same numeric data as before. By specifying quantiles as 3, we divide the data into three equal-sized quantiles. The output displays the intervals in which each value falls and the corresponding category.

import pandas as pd

# Example 1: Equal number of quantiles
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
quantiles = 3
categories = pd.qcut(data, quantiles)
print(categories)

Output

[(9.999, 40.0], (9.999, 40.0], (9.999, 40.0], (9.999, 40.0], (40.0, 70.0], (40.0, 70.0], (40.0, 70.0], (70.0, 100.0], (70.0, 100.0], (70.0, 100.0]]
Categories (3, interval[float64, right]): [(9.999, 40.0] < (40.0, 70.0] < (70.0, 100.0]]

Example 2: Custom Quantiles and Labels

In the below example, we define custom quantiles [0, 0.3, 0.6, 1] and corresponding labels ['Low', 'Medium', 'High']. The qcut() function assigns each value to the appropriate category based on the provided quantiles and labels.

import pandas as pd

# Example 2: Custom quantiles and labels
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
quantiles = [0, 0.3, 0.6, 1]
labels = ['Low', 'Medium', 'High']
categories = pd.qcut(data, quantiles, labels=labels)
print(categories)

Output

['Low', 'Low', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'High', 'High', 'High']
Categories (3, object): ['Low' < 'Medium' < 'High']

Conclusion

In this article, we discussed how we can use the pandas cut() and qcut() methods for creating categorical variables from numerical data. The cut() function divides the data into discrete intervals based on given conditions while qcut() method splits the data into quantiles or percentiles. Both the function can also assign labels to each interval or quantile which helps in transforming numerical data into categorical data.

Updated on: 13-Oct-2023

132 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements