- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to Use Pandas cut() and qcut()?
Pandas is a Python library that is used for data manipulation and analysis of structured data. The cut() and qcut() methods of pandas are used for creating categorical variables from numerical data. The cut() and qcut() methods split the numerical data into discrete intervals or quantiles respectively and assign labels to each interval or quantile. In this article, we will understand the functionalities of the cut() and qcut() methods with the help of various examples.
The cut() Function
The cut() divides a continuous variable into discrete bins or intervals based on specified criteria. It creates groups or categories of data based on the range of values present in the input data.
Syntax
pandas.cut(x, bins, labels=None, right=True, include_lowest=False, ...)
The parameters used in the above syntax are :
x: The input data, which can be a Pandas Series or a NumPy array.
bins: This can be an integer value specifying the number of equal-width bins to create, or a sequence of scalar values defining the bin edges. If an integer is provided, the range of values in x will be divided into that many equal-width bins.
labels (optional): An array-like object of labels to assign to each bin. If not provided, the labels will be integers indicating the bin index.
right (optional): A boolean value indicating whether the intervals should be right-closed (includes the right bin edge) or left-closed (includes the left bin edge). By default, it is set to True.
include_lowest (optional): A boolean value indicating whether to include the lowest value of the interval. By default, it is set to False.
Example 1: Equal-width Bins
In the below example, we have a list of numeric data. We specify bins as 3, indicating that we want to divide the data into three equal-width bins. The output shows the intervals in which each value falls, along with the corresponding category.
import pandas as pd # Example 1: Equal-width bins data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] bins = 3 categories = pd.cut(data, bins) print(categories)
Output
[(9.91, 40.0], (9.91, 40.0], (9.91, 40.0], (9.91, 40.0], (40.0, 70.0], (40.0, 70.0], (40.0, 70.0], (70.0, 100.0], (70.0, 100.0], (70.0, 100.0]] Categories (3, interval[float64, right]): [(9.91, 40.0] < (40.0, 70.0] < (70.0, 100.0]]
Example 2: Custom bin edge and labels
In the below example, we define custom bin edges [0, 30, 60, 100] and corresponding labels ['Low', 'Medium', 'High']. The cut() function assigns each value in the data to the appropriate category based on the provided bins and labels.
# Example 2: Custom bin edges and labels import pandas as pd data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] bins = [0, 30, 60, 100] labels = ['Low', 'Medium', 'High'] categories = pd.cut(data, bins, labels=labels) print(categories)
Output
['Low', 'Low', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'High', 'High', 'High'] Categories (3, object): ['Low' < 'Medium' < 'High']
The qcut() Function
The qcut() function splits the data based on quantiles or percentiles unlike the cut() function which splits the data into equal-width intervals. There are an equal number of data points in each bin making it useful for creating evenly distributed groups.
Syntax
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
The parameters used in the syntax are:
x: The input data, which can be a Pandas Series or a NumPy array.
q: An integer value specifying the number of quantiles to create or a sequence of quantiles (values between 0 and 1) that define the cut-off points.
labels (optional): An array-like object of labels to assign to each bin. If not provided, the labels will be integers indicating the bin index.
retbins (optional): A boolean value indicating whether to return the bin edges along with the categories. By default, it is set to False.
precision (optional): An integer value specifying the precision of the quantile values. By default, it is set to 3.
duplicates (optional): How to handle duplicate values. By default, it is set to 'raise', which raises an error.
Example 1: Equal Number of Quantiles
In the below example, we have the same numeric data as before. By specifying quantiles as 3, we divide the data into three equal-sized quantiles. The output displays the intervals in which each value falls and the corresponding category.
import pandas as pd # Example 1: Equal number of quantiles data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] quantiles = 3 categories = pd.qcut(data, quantiles) print(categories)
Output
[(9.999, 40.0], (9.999, 40.0], (9.999, 40.0], (9.999, 40.0], (40.0, 70.0], (40.0, 70.0], (40.0, 70.0], (70.0, 100.0], (70.0, 100.0], (70.0, 100.0]] Categories (3, interval[float64, right]): [(9.999, 40.0] < (40.0, 70.0] < (70.0, 100.0]]
Example 2: Custom Quantiles and Labels
In the below example, we define custom quantiles [0, 0.3, 0.6, 1] and corresponding labels ['Low', 'Medium', 'High']. The qcut() function assigns each value to the appropriate category based on the provided quantiles and labels.
import pandas as pd # Example 2: Custom quantiles and labels data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] quantiles = [0, 0.3, 0.6, 1] labels = ['Low', 'Medium', 'High'] categories = pd.qcut(data, quantiles, labels=labels) print(categories)
Output
['Low', 'Low', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'High', 'High', 'High'] Categories (3, object): ['Low' < 'Medium' < 'High']
Conclusion
In this article, we discussed how we can use the pandas cut() and qcut() methods for creating categorical variables from numerical data. The cut() function divides the data into discrete intervals based on given conditions while qcut() method splits the data into quantiles or percentiles. Both the function can also assign labels to each interval or quantile which helps in transforming numerical data into categorical data.