Count Duplicate Lines in a Text File on Linux


Introduction

There are several reasons why you might want to count the number of duplicate lines in a text file on a Linux system. For example, you may want to find out if there are any errors in your data or you may want to optimize your file by removing duplicates. Whatever the reason, Linux provides several tools and commands you can use to do this.

Preparation

Before we dive into the commands, let's first create a text file with a few duplicate lines that we can use for testing. Open a terminal and create a new file using the touch command −

$ touch "test.txt"

Next, open the file in your favorite text editor (nano, vim, etc.) and add the following lines −

Hello
World
Hello
Linux
Linux

Save and close the file but keep the terminal open.

Method 1: Use the Uniq Command

The uniq command is a utility that filters out duplicate adjacent lines from a text file. Itb can be used to count the number of duplicate lines by passing the “-c” flag, which causes each line to be prefixed with the number of times it appears in the input.

To count the number of duplicate lines in our test.txt file using uniq, we can use the following command −

$ uniq -c test.txt
   2 Hello
   1 World
   2 Linux

As you can see, the output shows that the "Hello" line appears twice, the "World" line appears once, and the "Linux" line appears twice.

Method 2: Use the Sort and Uniq Commands Together

Another way to count the number of duplicate lines in a text file is to use the sort and uniq commands together. The sort command sorts the lines in a text file, while the uniq command filters out duplicate adjacent lines. To count the number of duplicate lines using these commands, we can first sort the lines in our “test.txt” file using the sort command:

$ sort test.txt
Hello
Hello
Linux
Linux
World

We can then use the uniq command with the “-c” flag to count the number of duplicate lines −

$ sort test.txt | uniq -c
   2 Hello
   2 Linux
   1 World

As you can see, the output shows that the "Hello" line appears twice, the "Linux" line appears twice, and the "World" line appears once.

Method 3: Use the Awk Command

The awk command is a powerful tool for processing text files. It can be used to count the number of duplicate lines in a text file using the variable NR, which holds the number of records (lines) that have been read so far, and the display array, which holds a list of lines that have been seen already in it.

To count the number of duplicate lines using awk, we can use the following command −

$ awk '{ if (seen[$0]++) { count++; } } END { print count }' test.txt
2

As you can see, the output shows that there are 2 duplicate lines in the “test.txt” file.

Method 4: Using the Grep and wc Commands

Another way to count the number of duplicate lines in a text file is to use the grep and wc commands together. The grep command looks for lines that match a certain pattern, while the wc command counts the number of lines, words, and bytes in a file. To count the number of duplicate lines using these commands, we can first use grep to extract the duplicate lines from our “test.txt” file −

$ grep -w -f <(grep -w -o -e . test.txt | sort | uniq -d) test.txt
Hello
Linux

The grep command in parentheses looks for unique lines (-u flag) and displays only the matching part of the lines (-o flag). The output is then passed to sort, which sorts the rows, and uniq “-d”, which filters out non-duplicate rows. The resulting list of duplicate lines is then passed to the external grep command, which looks for those lines in the “test.txt” file.

We can then use the wc command with the “-l” flag to count the number of lines −

$ grep -w -f >(grep -w -o -e . test.txt | sort | uniq -d) test.txt | wc -l
2

As you can see, the output shows that there are 2 duplicate lines in the “test.txt” file.

Conclusion

In this article, we learned how to count the number of duplicate lines in a text file on a Linux system using the uniq, sort and grep, and awk commands. Each of these methods has its advantages and limitations and you can choose the one that best suits your needs. Whichever method you choose, it's important to remember that these commands are just a few of the many tools available for processing text files on Linux. There are many other commands and utilities you can use to manipulate and analyze text data, and learning how to use them effectively can greatly improve your productivity and efficiency as a Linux user.

Updated on: 17-Jan-2023

10K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements