Python Pandas - Unioning Categorical Data



Unioning categorical data refers to the process of combining multiple categorical Series or DataFrame objects into a single set while merging their categories. This operation is useful when combining categories from different data sources, and to handle scenarios where the categories do not exactly match.

In the Concatenating Categorical Data tutorial, we have seen some uncertainties in the memory management. Here, we will learn how to use the union_categoricals() function for consistent category management while unioning/combining the categorical data.

The union_categoricals() Function

The union_categoricals() function from pandas.api.types is used to combine multiple categorical data types into a single category. The resulting categories will be the union of all the categories from the data involved.

Syntax

Following is the syntax of this function −

pandas.api.types.union_categoricals(to_union, sort_categories=False, ignore_order=False)

Where,

  • to_union: List of Categorical, CategoricalIndex, or Series with dtype='category'.

  • sort_categories: It is a boolean parameter, if set to true, the resulting categories will be lexsorted. Otherwise, they remain in their original order.

  • ignore_order: If true, the ordered attribute of the categoricals is ignored. The result becomes an unordered categorical.

Example

Here is a basic example demonstrating how to merge different categorical data using the pandas.api.types.union_categoricals() function. In this example, Series s1 has categories 'cat' and 'dog', while s2 has 'cat', 'mouse' and 'dog'. The union_categoricals() function merges these categories into a single set, 'cat', 'dog', and 'mouse'.

import pandas as pd
from pandas.api.types import union_categoricals

# Creating categorical Series
s1 = pd.Series(["cat", "dog"], dtype="category")
s2 = pd.Series(["cat", "mouse", 'dog'], dtype="category")

# Display the Input Series objects
print("Input Series 1:")
print(s1)
print("\nInput Series 2:")
print(s2)

# Unioning the categorical Series
result = union_categoricals([s1, s2])

print("\nSeries after Unioning the categorical Series':")
print(result)

When we run above program, it produces following result −

Input Series 1:
0    cat
1    dog
dtype: category
Categories (2, object): ['cat', 'dog']

Input Series 2:
0      cat
1    mouse
2      dog
dtype: category
Categories (3, object): ['cat', 'dog', 'mouse']

Series after Unioning the categorical Series':
['cat', 'dog', 'cat', 'mouse', 'dog']
Categories (3, object): ['cat', 'dog', 'mouse']

Unioning and Sorting Categorical Data

By default, the categories in the resulting union are ordered as they appear in the data. However, if you want the categories to be sorted lexsorted, you can pass sort_categories=True parameter.

Example

The following example demonstrates unioning and sorting the categorical data using the union_categoricals() method with the sort_categories=True parameter.

import pandas as pd
from pandas.api.types import union_categoricals

# Creating categorical Series
s1 = pd.Series(["cat", "dog"], dtype="category")
s2 = pd.Series(["cat", "mouse", 'dog'], dtype="category")

# Display the Input Series objects
print("Input Series 1:")
print(s1)
print("\nInput Series 2:")
print(s2)

# Unioning with sorted categories
result = union_categoricals([s1, s2], sort_categories=True)

print("\nSeries after Unioning and Sorting the categorical Series':")
print(result)

Following is an output of the above code −

Input Series 1:
0    cat
1    dog
dtype: category
Categories (2, object): ['cat', 'dog']

Input Series 2:
0      cat
1    mouse
2      dog
dtype: category
Categories (3, object): ['cat', 'dog', 'mouse']

Series after Unioning and Sorting the categorical Series':
['cat', 'dog', 'cat', 'mouse', 'dog']
Categories (3, object): ['cat', 'dog', 'mouse']

Unioning Ordered Categorical Data

The union_categoricals() function works easily for combining ordered categorical data with identical categories. If the categories are not identical, then a TypeError will be raised.

Example

The following example shows how the union_categoricals() function combines ordered categorical Series seamlessly.

import pandas as pd
from pandas.api.types import union_categoricals

# Creating categorical Series
a = pd.Categorical(["cat", "dog"], ordered=True)
b = pd.Categorical(["cat", 'dog', "cat"], ordered=True)

s1 = pd.Series(a)
s2 = pd.Series(b)

# Display the Input Series objects
print("Input Series 1:")
print(s1)
print("\nInput Series 2:")
print(s2)

# Unioning ordered categoricals
result = union_categoricals([s1, s2])

print("\nSeries after Unioning the ordered categorical Series':")
print(result)

While executing the above code we get the following output −

Input Series 1:
0    cat
1    dog
dtype: category
Categories (2, object): ['cat' < 'dog']

Input Series 2:
0    cat
1    dog
2    cat
dtype: category
Categories (2, object): ['cat' < 'dog']

Series after Unioning the ordered categorical Series':
['cat', 'dog', 'cat', 'dog', 'cat']
Categories (2, object): ['cat' < 'dog']

Handling Different Orders while Unioning

If you try to union two ordered categorical variables with different categories, a TypeError is raised. To avoid this exception, you can use the ignore_order=True argument, which allows the union to proceed even if the order of categories differs.

Example

This example demonstrates unioning the two categorical data that have different categories by handling the TypeError exception.

import pandas as pd
from pandas.api.types import union_categoricals

# Ordered categoricals with different categories
a = pd.Categorical(["cat", "dog"], ordered=True)
b = pd.Categorical(["cat", 'mouse'], ordered=True)

s1 = pd.Series(a)
s2 = pd.Series(b)

# Display the Input Series objects
print("Input Series 1:")
print(s1)
print("\nInput Series 2:")
print(s2)

# Handling exception while unioning with different ordered categories
try:
    result = union_categoricals([a, b])
except TypeError as e:
    print("\nError:", e)

# Ignoring order to union
result = union_categoricals([a, b], ignore_order=True)
print("\nSeries after Unioning the different ordered categorical's':")
print(result)

When we run above program, it produces following result −

Input Series 1:
0    cat
1    dog
dtype: category
Categories (2, object): ['cat' < 'dog']

Input Series 2:
0      cat
1    mouse
dtype: category
Categories (2, object): ['cat' < 'mouse']

Error: to union ordered Categoricals, all categories must be the same

Ignoring order to union:
['cat', 'dog', 'cat', 'mouse']
Categories (3, object): ['cat', 'dog', 'mouse']
Advertisements