Find the most Similar Sentence in the file to the Input Sentence | NLP


Natural Language Processing (NLP) allows computers to interpret and analyze human language. Finding the most identical word or sentence to a given input sentence is a prevalent NLP problem. In Python, there are various methods available to find identical sentences.

Required Resources

To get this done you have to install nltk library in your system. Therefore run the following command in your Python command prompt to install nltk.

pip install nltk

You may also run the following command in your Windows cmd if the above command fails to execute.

python --version
pip --version
pip install nltk

Once the library is successfully installed we can import it inside our code and use various modules from nltk to write a sentence finder program.

Example

We will create a Python program that takes input sentences from the user and finds the most similar sentence from a file. let’s explore how to do this using the Python NLTK package. We will specifically use the TF-IDF (Term Frequency-Inverse Document Frequency) method and various NLP preprocessing steps.

Algorithm

Step 1: Install and Import NLTK. you can use any method explained above.

Step 2: Write code to load sentences from the file. Load the sentences and then produce them to generate a list of preprocessed sentences, each being stripped of any leading or following whitespace.

Step 3: Process the input sentence and stripped sentences of the file.

Step 4: Perform Tokenization to break each sentence into words.

Step 5: Remove stop words from sentences to compare the main words.

Step 6: Compare the words and assign them weight to find the words with the highest weight. Doing so you can find the most similar sentence in the file.

Example

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the file containing sentences
def load_sentences(file_path):
   with open(file_path, 'r') as file:
    sentences = file.readlines()
   return [sentence.strip() for sentence in sentences]

# Preprocess the input sentence
def preprocess_sentence(sentence):
   # Tokenize
   tokens = word_tokenize(sentence.lower())
   
   # Remove stopwords
   stop_words = set(stopwords.words('english'))
   tokens = [token for token in tokens if token not in stop_words]
   
   # Lemmatize
   lemmatizer = WordNetLemmatizer()
   tokens = [lemmatizer.lemmatize(token) for token in tokens]
   
   return ' '.join(tokens)

# Get the most similar sentence
def get_most_similar_sentence(user_input, sentences):
   # Preprocess input sentence
   preprocessed_user_input = preprocess_sentence(user_input)
   
   # Preprocess sentences
   preprocessed_sentences = [preprocess_sentence(sentence) for sentence in 
sentences]
   
   # Create TF-IDF vectorizer
   vectorizer = TfidfVectorizer()
   
   # Generate TF-IDF matrix
   tfidf_matrix = vectorizer.fit_transform([preprocessed_user_input] + 
preprocessed_sentences)
   
   # Calculate similarity scores
   similarity_scores = (tfidf_matrix * tfidf_matrix.T).A[0][1:]
   
   # Find the index of the most similar sentence
   most_similar_index = similarity_scores.argmax()
   most_similar_sentence = sentences[most_similar_index]
   
   return most_similar_sentence

# Main program
def main():
   file_path = 'sentences.txt'  # Path to the file containing sentences
   sentences = load_sentences(file_path)
   
   user_input = 'hello I am a women' 
   
   most_similar_sentence = get_most_similar_sentence(user_input, sentences)
   print('Most similar sentence:', most_similar_sentence)

if __name__ == '__main__':
   main()

Text File Content : Sentences.txt

this is comedy movie.

this is horror movie.

hello I am a girl.

hello I am a boy.

Output

Conclusion

We have experienced the use of the NLTK library and NLP approaches to discover the sentence that is most similar to a given input text. We can efficiently compare sentences and find the closest match by using the TF-IDF method and preprocessing techniques like tokenization, stopword removal, and lemmatization.

You can use this approach in any application or program to add a sentence similarity check feature which can be used for relating useful information entered by the user.

Updated on: 10-Aug-2023

226 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements