Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Use Pandas cut() and qcut()?
Pandas is a Python library that is used for data manipulation and analysis of structured data. The cut() and qcut() methods of pandas are used for creating categorical variables from numerical data. The cut() method splits numerical data into discrete intervals based on value ranges, while qcut() splits data into quantiles with equal frequencies. In this article, we will understand the functionalities of both methods with practical examples.
The cut() Function
The cut() function divides a continuous variable into discrete bins or intervals based on specified criteria. It creates groups or categories of data based on the range of values present in the input data.
Syntax
pandas.cut(x, bins, labels=None, right=True, include_lowest=False, ...)
Parameters
x: The input data, which can be a Pandas Series or a NumPy array.
bins: This can be an integer value specifying the number of equal-width bins to create, or a sequence of scalar values defining the bin edges.
labels (optional): An array-like object of labels to assign to each bin. If not provided, the labels will be intervals.
right (optional): A boolean value indicating whether the intervals should be right-closed (default: True).
include_lowest (optional): A boolean value indicating whether to include the lowest value (default: False).
Example 1: Equal-width Bins
Here we divide numeric data into three equal-width bins ?
import pandas as pd # Example 1: Equal-width bins data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] bins = 3 categories = pd.cut(data, bins) print(categories)
[(9.91, 40.0], (9.91, 40.0], (9.91, 40.0], (9.91, 40.0], (40.0, 70.0], (40.0, 70.0], (40.0, 70.0], (70.0, 100.0], (70.0, 100.0], (70.0, 100.0]] Categories (3, interval[float64, right]): [(9.91, 40.0] < (40.0, 70.0] < (70.0, 100.0]]
Example 2: Custom Bin Edges and Labels
We can define custom bin edges and assign meaningful labels ?
import pandas as pd # Example 2: Custom bin edges and labels data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] bins = [0, 30, 60, 100] labels = ['Low', 'Medium', 'High'] categories = pd.cut(data, bins, labels=labels) print(categories)
['Low', 'Low', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'High', 'High', 'High'] Categories (3, object): ['Low' < 'Medium' < 'High']
The qcut() Function
The qcut() function splits the data based on quantiles or percentiles, ensuring each bin contains approximately the same number of data points. This is useful for creating evenly distributed groups.
Syntax
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
Parameters
x: The input data, which can be a Pandas Series or a NumPy array.
q: An integer value specifying the number of quantiles or a sequence of quantiles (values between 0 and 1).
labels (optional): An array-like object of labels to assign to each bin.
retbins (optional): A boolean value indicating whether to return the bin edges (default: False).
precision (optional): An integer value specifying the precision of quantile values (default: 3).
duplicates (optional): How to handle duplicate values (default: 'raise').
Example 1: Equal Number of Quantiles
Here we divide data into three equal-sized quantiles ?
import pandas as pd # Example 1: Equal number of quantiles data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] quantiles = 3 categories = pd.qcut(data, quantiles) print(categories)
[(9.999, 40.0], (9.999, 40.0], (9.999, 40.0], (9.999, 40.0], (40.0, 70.0], (40.0, 70.0], (40.0, 70.0], (70.0, 100.0], (70.0, 100.0], (70.0, 100.0]] Categories (3, interval[float64, right]): [(9.999, 40.0] < (40.0, 70.0] < (70.0, 100.0]]
Example 2: Custom Quantiles and Labels
We can define custom quantiles with specific percentile thresholds ?
import pandas as pd # Example 2: Custom quantiles and labels data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] quantiles = [0, 0.3, 0.6, 1] labels = ['Low', 'Medium', 'High'] categories = pd.qcut(data, quantiles, labels=labels) print(categories)
['Low', 'Low', 'Low', 'Medium', 'Medium', 'Medium', 'High', 'High', 'High', 'High'] Categories (3, object): ['Low' < 'Medium' < 'High']
Comparison
| Aspect | cut() | qcut() |
|---|---|---|
| Bin Creation | Equal-width intervals | Equal-frequency quantiles |
| Data Distribution | Bins may have different counts | Bins have approximately equal counts |
| Best For | Value-based categorization | Percentile-based ranking |
Conclusion
Use cut() when you need value-based intervals with specific ranges, and qcut() when you need equal-frequency bins based on data distribution. Both functions are essential for converting continuous data into categorical variables for analysis.
