Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Find the most Similar Sentence in the file to the Input Sentence | NLP
Natural Language Processing (NLP) allows computers to interpret and analyze human language. Finding the most similar sentence to a given input is a common NLP task. Python provides several methods to accomplish this using libraries like NLTK and scikit-learn.
Installation Requirements
First, install the required libraries ?
pip install nltk scikit-learn
Algorithm Overview
The sentence similarity algorithm follows these steps:
Step 1: Load sentences from a text file
Step 2: Preprocess both input sentence and file sentences
Step 3: Tokenize sentences into individual words
Step 4: Remove stop words to focus on meaningful content
Step 5: Apply lemmatization to normalize word forms
Step 6: Use TF-IDF vectorization to calculate similarity scores
Complete Implementation
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
# Sample sentences for demonstration
sample_sentences = [
"this is comedy movie.",
"this is horror movie.",
"hello I am a girl.",
"hello I am a boy."
]
def preprocess_sentence(sentence):
"""Preprocess a sentence by tokenizing, removing stopwords, and lemmatizing"""
# Tokenize and convert to lowercase
tokens = word_tokenize(sentence.lower())
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
# Lemmatize words
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
return ' '.join(tokens)
def find_most_similar_sentence(user_input, sentences):
"""Find the most similar sentence using TF-IDF vectorization"""
# Preprocess input and sentences
preprocessed_input = preprocess_sentence(user_input)
preprocessed_sentences = [preprocess_sentence(sentence) for sentence in sentences]
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Generate TF-IDF matrix for input + all sentences
all_sentences = [preprocessed_input] + preprocessed_sentences
tfidf_matrix = vectorizer.fit_transform(all_sentences)
# Calculate similarity scores between input and each sentence
similarity_scores = (tfidf_matrix * tfidf_matrix.T).A[0][1:]
# Find index of most similar sentence
most_similar_index = similarity_scores.argmax()
similarity_score = similarity_scores[most_similar_index]
return sentences[most_similar_index], similarity_score
# Test the implementation
user_input = "hello I am a woman"
most_similar, score = find_most_similar_sentence(user_input, sample_sentences)
print(f"Input sentence: {user_input}")
print(f"Most similar sentence: {most_similar}")
print(f"Similarity score: {score:.4f}")
Input sentence: hello I am a woman Most similar sentence: hello I am a girl. Similarity score: 0.7071
How It Works
TF-IDF Vectorization: Converts text into numerical vectors where each dimension represents a word's importance. Words that appear frequently in a document but rarely across all documents get higher weights.
Cosine Similarity: Measures the angle between two vectors. Values closer to 1 indicate higher similarity, while values closer to 0 indicate lower similarity.
Preprocessing Benefits: Removing stop words and lemmatizing ensures that the algorithm focuses on meaningful content words rather than common articles and different word forms.
Working with File Input
To load sentences from a file, use this helper function ?
def load_sentences_from_file(file_path):
"""Load sentences from a text file"""
try:
with open(file_path, 'r', encoding='utf-8') as file:
sentences = [line.strip() for line in file.readlines() if line.strip()]
return sentences
except FileNotFoundError:
print(f"File {file_path} not found. Using sample sentences.")
return sample_sentences
# Example usage
sentences = load_sentences_from_file('sentences.txt')
result, score = find_most_similar_sentence("I love comedy films", sentences)
print(f"Best match: {result} (Score: {score:.4f})")
File sentences.txt not found. Using sample sentences. Best match: this is comedy movie. (Score: 0.4472)
Conclusion
This NLP approach combines preprocessing techniques with TF-IDF vectorization to find sentence similarities effectively. The method works well for comparing semantic content and can be easily integrated into applications requiring text matching or recommendation systems.
