Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python – Remove Columns of Duplicate Elements
When working with lists of lists, you may need to remove columns that contain duplicate elements within each row. Python provides an elegant solution using sets to track duplicates and list comprehension to filter columns.
Understanding the Problem
Given a list of lists, we want to remove columns (same index positions) where any row has duplicate elements at that position within the same row.
Solution Using Set-Based Duplicate Detection
The approach involves creating a helper function that identifies duplicate positions and then filtering out those columns ?
from itertools import chain
def find_duplicate_positions(row):
seen = set()
for i, elem in enumerate(row):
if elem not in seen:
seen.add(elem)
else:
yield i
# Sample data - list of lists
data = [[5, 1, 6, 7, 9], [6, 3, 1, 9, 1], [4, 2, 9, 8, 9], [5, 1, 6, 7, 3]]
print("Original list:")
print(data)
# Find all duplicate positions across all rows
duplicate_positions = set(chain.from_iterable(find_duplicate_positions(row) for row in data))
# Remove columns at duplicate positions
result = [[elem for i, elem in enumerate(row) if i not in duplicate_positions] for row in data]
print("After removing duplicate columns:")
print(result)
Original list: [[5, 1, 6, 7, 9], [6, 3, 1, 9, 1], [4, 2, 9, 8, 9], [5, 1, 6, 7, 3]] After removing duplicate columns: [[5, 1, 6, 7], [6, 3, 1, 9], [4, 2, 9, 8], [5, 1, 6, 7]]
How It Works
The algorithm works in three main steps:
- Duplicate Detection: For each row, the function tracks seen elements in a set and yields positions where duplicates occur
-
Position Aggregation: All duplicate positions from all rows are collected into a single set using
chain.from_iterable() - Column Filtering: List comprehension creates new rows excluding elements at duplicate positions
Alternative Approach Using Pandas
For larger datasets, pandas provides a more efficient solution ?
import pandas as pd
# Convert to DataFrame
data = [[5, 1, 6, 7, 9], [6, 3, 1, 9, 1], [4, 2, 9, 8, 9], [5, 1, 6, 7, 3]]
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Find columns with no duplicates in any row
duplicate_cols = set()
for idx, row in df.iterrows():
duplicate_cols.update(row[row.duplicated()].index)
# Keep only columns without duplicates
clean_df = df.drop(columns=duplicate_cols)
print("\nAfter removing duplicate columns:")
print(clean_df)
Original DataFrame: 0 1 2 3 4 0 5 1 6 7 9 1 6 3 1 9 1 2 4 2 9 8 9 3 5 1 6 7 3 After removing duplicate columns: 0 1 2 3 0 5 1 6 7 1 6 3 1 9 2 4 2 9 8 3 5 1 6 7
Conclusion
Use set-based tracking with list comprehension for simple duplicate column removal. For larger datasets, pandas provides more efficient DataFrame operations with better performance.
