Find the most Similar Sentence in the file to the Input Sentence | NLP

Natural Language Processing (NLP) allows computers to interpret and analyze human language. Finding the most similar sentence to a given input is a common NLP task. Python provides several methods to accomplish this using libraries like NLTK and scikit-learn.

Installation Requirements

First, install the required libraries ?

pip install nltk scikit-learn

Algorithm Overview

The sentence similarity algorithm follows these steps:

Step 1: Load sentences from a text file

Step 2: Preprocess both input sentence and file sentences

Step 3: Tokenize sentences into individual words

Step 4: Remove stop words to focus on meaningful content

Step 5: Apply lemmatization to normalize word forms

Step 6: Use TF-IDF vectorization to calculate similarity scores

Complete Implementation

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)

# Sample sentences for demonstration
sample_sentences = [
    "this is comedy movie.",
    "this is horror movie.", 
    "hello I am a girl.",
    "hello I am a boy."
]

def preprocess_sentence(sentence):
    """Preprocess a sentence by tokenizing, removing stopwords, and lemmatizing"""
    # Tokenize and convert to lowercase
    tokens = word_tokenize(sentence.lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return ' '.join(tokens)

def find_most_similar_sentence(user_input, sentences):
    """Find the most similar sentence using TF-IDF vectorization"""
    # Preprocess input and sentences
    preprocessed_input = preprocess_sentence(user_input)
    preprocessed_sentences = [preprocess_sentence(sentence) for sentence in sentences]
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Generate TF-IDF matrix for input + all sentences
    all_sentences = [preprocessed_input] + preprocessed_sentences
    tfidf_matrix = vectorizer.fit_transform(all_sentences)
    
    # Calculate similarity scores between input and each sentence
    similarity_scores = (tfidf_matrix * tfidf_matrix.T).A[0][1:]
    
    # Find index of most similar sentence
    most_similar_index = similarity_scores.argmax()
    similarity_score = similarity_scores[most_similar_index]
    
    return sentences[most_similar_index], similarity_score

# Test the implementation
user_input = "hello I am a woman"
most_similar, score = find_most_similar_sentence(user_input, sample_sentences)

print(f"Input sentence: {user_input}")
print(f"Most similar sentence: {most_similar}")
print(f"Similarity score: {score:.4f}")
Input sentence: hello I am a woman
Most similar sentence: hello I am a girl.
Similarity score: 0.7071

How It Works

TF-IDF Vectorization: Converts text into numerical vectors where each dimension represents a word's importance. Words that appear frequently in a document but rarely across all documents get higher weights.

Cosine Similarity: Measures the angle between two vectors. Values closer to 1 indicate higher similarity, while values closer to 0 indicate lower similarity.

Preprocessing Benefits: Removing stop words and lemmatizing ensures that the algorithm focuses on meaningful content words rather than common articles and different word forms.

Working with File Input

To load sentences from a file, use this helper function ?

def load_sentences_from_file(file_path):
    """Load sentences from a text file"""
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            sentences = [line.strip() for line in file.readlines() if line.strip()]
        return sentences
    except FileNotFoundError:
        print(f"File {file_path} not found. Using sample sentences.")
        return sample_sentences

# Example usage
sentences = load_sentences_from_file('sentences.txt')
result, score = find_most_similar_sentence("I love comedy films", sentences)
print(f"Best match: {result} (Score: {score:.4f})")
File sentences.txt not found. Using sample sentences.
Best match: this is comedy movie. (Score: 0.4472)

Conclusion

This NLP approach combines preprocessing techniques with TF-IDF vectorization to find sentence similarities effectively. The method works well for comparing semantic content and can be easily integrated into applications requiring text matching or recommendation systems.

Updated on: 2026-03-27T11:59:49+05:30

714 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements