Python Program to Find the Number of Unique Words in Text File


In this article, the given task is to find the number of unique words in a text file. In this Python article, using two different examples, the approaches to finding the unique words in a text file and their count are given. In the first example, given words from a text file are fetched, and then their unique set is made before counting these unique words. In example 2, first the list of words is created, then it is sorted. After this from this sorted list, the duplicates are removed and finally, the unique word left in the file are counted to give the final result.

Algorithm for Preprocessing

Step 1 − Login using Google account. Go to Google Colab. Open a new Colab Notebook and write the Python code in it.

Step 2 − First upload the txt file "file1.txt" to Google Colab.

Step 3 − Open the txt file for reading.

Step 4 − Convert the text file to lowercase.

Step 5 − To separate the words given in the txt file, use the split function.

Step 6 − Print the list called ‘words_in_file’ having the words from the text file.

Text File Used for these examples

The content in the file1.txt is given here…

This is a new file.
This is made for testing purposes only.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
Oh! No.. there are seven lines now.

Upload the file1.txt to colab

Figure: Uploading the file1.txt in Google Colab

Approach 1:- Finding the Number of Unique Words in Text File by using Python Sets

After the Preprocessing steps, the following steps are used for Approach 1

Step 1 − Start with the list ‘words_in_file’, after preprocessing steps.

Step 2 − Convert this list into a set. Here, the set will contain the unique words only.

Step 3 − Show the set with all the unique words using the print statement.

Step 4 − Find the set length.

Step 5 − Print the set length.

Step 6 − This will give the number of unique words in a given string.

Example

# Use open method to open the respective text file
file = open("file1.txt", 'r')

#Conversion of its content to lowercase
thegiventxtfile = file.read().lower()

#ALter the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe unique words given in this txt file are :\n")

#Convert to the python set
uniqueWords=set(words_in_file)

print(uniqueWords) 

#Find the number of words left in this list
numberofuniquewords=len(uniqueWords)

print("\nThe number of unique words given in this txt file are :\n")
print(numberofuniquewords)

Output

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The unique words given in this txt file are :

{'there', 'only.', 'testing', 'new', 'is', 'for', 'oh!', 'this', 'a', 'made', 'seven', 'are', 'purposes', 'in', 'file.', 'four', 'now.', 'no..', 'lines'}

The number of unique words given in this txt file are :

19

Approach 2:- Finding the Number of Unique Words in Text File by using Python Dictionary

Step 1 − Open the required file.

Step 2 − Sort this list and print this list. The alphabetically sorted list will also show the repeated words.

Step 3 − Now, for getting rid of the duplicate words and retain only the unique used dict.fromkeys(words_in_file)

Step 4 − This has to be converted back to the list now.

Step 5 − Finally print the list containing unique words.

Step 6 − Calculate the length of the final list and display its value. This will give the number of unique words in a given string.

Example

#Open the text file in read mode
file = open("file1.txt", 'r')

#Convert its content to lowercase
thegiventxtfile = file.read().lower()

#Change the sentences to the list of words
words_in_file = thegiventxtfile.split()

print("The given txt file content is :\n")
print(thegiventxtfile)
print("\nThe words given in the txt file are :\n")
print(words_in_file)
print("\nThe sorted words list from this txt file is :\n")

#Sort this words file now
words_in_file.sort()
  
print(words_in_file)
print("\nThe sorted words list after removing duplicates from this txt file is :\n")

#Get rid of the duplicate words
myuniquewordlist = list(dict.fromkeys(words_in_file))

#Count the number of words left
numberofuniquewords=len(uniqueWords)

print(myuniquewordlist) 
print("\nThe number of unique words given in this txt file are :\n") 

Output

The given txt file content is :

this is a new file.
this is made for testing purposes only.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
there are four lines in this file.
oh! no.. there are seven lines now.


The words given in the txt file are :

['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'there', 'are', 'four', 'lines', 'in', 'this', 'file.', 'oh!', 'no..', 'there', 'are', 'seven', 'lines', 'now.']

The sorted words list from this txt file is :

['a', 'are', 'are', 'are', 'are', 'are', 'file.', 'file.', 'file.', 'file.', 'file.', 'for', 'four', 'four', 'four', 'four', 'in', 'in', 'in', 'in', 'is', 'is', 'lines', 'lines', 'lines', 'lines', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'there', 'there', 'there', 'there', 'this', 'this', 'this', 'this', 'this', 'this']

The sorted words list after removing duplicates from this txt file is :

['a', 'are', 'file.', 'for', 'four', 'in', 'is', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'this']

The number of unique words given in this txt file are :

19

Conclusion

The two different approaches to show how to find the unique words in a given txt file. First, the txt file is uploaded in the colab notebook. Then this file is opened for reading. Then this file is split and words are separated and stored as a list. In this Python article, this word list is used in both examples.

In example 1, the concept of Python set is used. The list may contain duplicate words. When this list is converted to a set, only the unique words are left. To calculate the count of unique words len() function is used. In example 2, the word list obtained from the txt file is first sorted to see the numbers of duplicate words, that are put together after being sorted. Now this sorted list is used with dict.fromkeys(words_in_file) to remove the duplicate words. It is later used to find the count of duplicate words.

Updated on: 10-Jul-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements