Is There a Way to ‘uniq’ by Column on Linux?


Introduction

On the Linux operating system, the "uniq" command is used to remove duplicate lines from a sorted file. However, sometimes you may need to remove duplicates based on a specific column, rather than the entire row. This becomes particularly useful when working with column-based input files, such as CSV files. In this article, we'll explore several ways to do this per-column "uniq'ing" on Linux.

Method 1: Use sort command

The sort command is a simple and effective way to sort rows by a specific field and remove duplicates from the sort result. For duplicates, only the first instance will be kept. In this section, we will explore how to solve the problem of removing duplicate entries in a specific column using the sort command.

  • Using the tail command to remove the header line −

$ tail -n+2 price.csv | sort -u -t, -k1,1                        
Keybord,20,2019-02-02
Monitor,218,2019-01-01
Wireless Mouse,25,2019-02-02

The previous command uses the tail command to remove the first line of the file, which is usually a header line, and routes the rest of the file to the sort command. The sort command is then used to sort the lines and remove duplicates using the -u option. The “-t” option is used to set the field separator to a comma and the “-k1,1” option is used to sort by the first column.

  • Keeping the header line in the output −

$ head -n1 price.csv && tail -n+2 price.csv | sort -u -t, -k1,1 
Product,Price,Update
Keybord,20,2019-02-02
Monitor,218,2019-01-01
Wireless Mouse,25,2019-02-02

This command is similar to the previous one, but uses the head command to print the first line of the file, which is usually a header line, before using the tail command to remove the header line and sort the rest of the file.

Method 2: Use awk command

Another way to perform the "uniq" operation per column is to use the awk command. Awk is a powerful word processing tool that can be used to select and manipulate specific columns in a file. For "uniq" by column, we can use the following command −

$ awk -F, 'NR==1 || !a[$1]++' price.csv 
Product,Price,Update
Monitor,218,2019-01-01
Keybord,20,2019-02-02
Wireless Mouse,25,2019-02-02

The above command uses the awk command to set the field separator to a comma using the “-F” option. The expression “NR==1” is used to handle the header line, which is printed on the output. The expression “!a[$1]++” is used to remove duplicate entries in the first column by creating an associative array element for each unique entry in the first column and ignoring any duplicates.

Method 3: Use uniq command with the -f option

A third way to "uniq" per column on Linux is to use the uniq command with the “-f” option. The “-f” option can be used to ignore the specified number of fields before comparing rows. For "uniq" by column, we can use the following command −

$ sort file.txt | uniq -f 0

This command will sort the file "file.txt" and then use the uniq command with the “-f” option to ignore the first field (column) before comparing lines. This way, the uniq command will only consider the second and subsequent fields when removing duplicates, effectively "uniqing" by column.

Conclusion

In this article, we have discussed three different methods for running uniqing per column on Linux. The first method uses the sort command, the second method uses the awk command, and the third method uses the uniq command with the “-f” option. Each method can be used to select and remove duplicate values ​​from a specific column in a file. The choice of method will depend on the specific requirements of the business and user preference. It is important to note that all of the methods discussed here require sorting the input file, so it is recommended that you sort the file before using any of these commands. It's also important to mention that all the examples here use a comma as a field separator, in case your file has a different separator, adjust the command accordingly.

Updated on: 13-Feb-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements