Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Count Duplicate Lines in a Text File on Linux
There are several reasons why you might want to count duplicate lines in a text file on Linux. You may need to identify data inconsistencies, optimize files by removing duplicates, or analyze log files for repeated entries. Linux provides multiple powerful command-line tools to accomplish this task efficiently.
Preparation
Let's create a sample text file to demonstrate the different methods. Open a terminal and create a test file ?
$ touch test.txt
Add the following content to the file using your preferred text editor ?
Hello World Hello Linux Linux
Method 1: Using uniq Command
The uniq command filters out duplicate adjacent lines and can count occurrences with the -c flag. However, uniq only works on adjacent duplicates, so the input must be sorted first for accurate results.
$ sort test.txt | uniq -c
2 Hello
2 Linux
1 World
The output shows each unique line prefixed with its occurrence count. Lines appearing only once are not duplicates.
Method 2: Using awk Command
The awk command provides a more flexible approach using associative arrays to track line occurrences ?
$ awk '{count[$0]++} END {for (line in count) if (count[line] > 1) print count[line], line}' test.txt
2 Hello 2 Linux
To count only the total number of duplicate lines ?
$ awk '{seen[$0]++} END {duplicates=0; for (line in seen) if (seen[line] > 1) duplicates += seen[line]-1; print duplicates}' test.txt
2
Method 3: Using sort, uniq, and wc Commands
To count only the lines that appear more than once, combine multiple commands ?
$ sort test.txt | uniq -d | wc -l
2
The uniq -d flag displays only duplicate lines (one copy of each), and wc -l counts them.
Method 4: Advanced awk for Detailed Analysis
For more detailed duplicate analysis, use this awk command ?
$ awk '{count[$0]++} END {
total_duplicates = 0
unique_duplicate_lines = 0
for (line in count) {
if (count[line] > 1) {
unique_duplicate_lines++
total_duplicates += count[line] - 1
print """ line "" appears " count[line] " times"
}
}
print "Total duplicate occurrences: " total_duplicates
print "Unique lines with duplicates: " unique_duplicate_lines
}' test.txt
"Hello" appears 2 times "Linux" appears 2 times Total duplicate occurrences: 2 Unique lines with duplicates: 2
Comparison of Methods
| Method | Advantages | Best Use Case |
|---|---|---|
| sort | uniq -c | Simple, shows all line counts | Quick overview of all line frequencies |
| awk | Flexible, programmable | Complex duplicate analysis |
| sort | uniq -d | wc -l | Returns simple count | Just need total number of duplicate types |
Conclusion
Linux offers multiple approaches to count duplicate lines in text files, each suited for different scenarios. The sort | uniq -c combination provides a quick overview, while awk offers maximum flexibility for complex analysis. Choose the method that best fits your specific duplicate counting needs.
