Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
SMS Spam Detection using TensorFlow in Python
In today's digital era, where text messaging has become an integral part of our lives, dealing with SMS spam has become an ongoing challenge. The relentless influx of unwanted and unsolicited messages disrupts our daily routines and poses risks to our privacy and security. To address this issue, machine learning techniques have proven to be effective tools. Among them, TensorFlow, a widely adopted open-source library for deep learning, offers a robust framework for developing advanced models. In this article, we will explore the realm of SMS spam detection and discover how TensorFlow, in conjunction with the versatile programming language Python, can empower us to construct a robust and accurate SMS spam detection system.
Understanding SMS Spam Detection
Building a model to automatically categorize incoming text messages as spam or legitimate requires detecting SMS spam. To do this, we require a dataset made up of a sizable number of SMS messages that have been classified as either spam or not spam. The basis for training our TensorFlow model will be this dataset.
Building the SMS Spam Detection Model
Step 1: Preparing the Dataset
Finding a good dataset to train our model is the first step. The publicly accessible UCI SMS Spam Collection is a well-liked dataset for SMS spam detection. Let's create a simple dataset to demonstrate the concept ?
import pandas as pd
# Create sample dataset for demonstration
data = {
'label': ['ham', 'spam', 'ham', 'spam', 'ham'],
'text': [
'How are you doing today?',
'URGENT! You have won $1000! Click here now!',
'Can you pick me up at 5?',
'Free money! Call now to claim your prize!',
'See you at the meeting tomorrow'
]
}
df = pd.DataFrame(data)
print(df)
label text 0 ham How are you doing today? 1 spam URGENT! You have won $1000! Click here now! 2 ham Can you pick me up at 5? 3 spam Free money! Call now to claim your prize! 4 ham See you at the meeting tomorrow
Step 2: Data Preprocessing
Any machine learning activity must begin with data preprocessing. This entails transforming the original text messages into a numerical representation that our model can comprehend for SMS spam detection. To normalize the text, this procedure frequently entails stages like tokenization, stop word removal, and the use of stemming or lemmatization algorithms.
Here's an example of how to preprocess the text data using basic Python functions ?
import re
import string
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = ' '.join(text.split())
return text
# Apply preprocessing to the dataset
df['processed_text'] = df['text'].apply(preprocess_text)
print(df[['text', 'processed_text']])
text processed_text
0 How are you doing today? how are you doing today
1 URGENT! You have won $1000! Click here now! urgent you have won 1000 click here now
2 Can you pick me up at 5? can you pick me up at 5
3 Free money! Call now to claim your prize! free money call now to claim your prize
4 See you at the meeting tomorrow see you at the meeting tomorrow
Step 3: Feature Extraction
To capture the essence of the SMS messages after text preprocessing, it is important to select significant features. One widely used method for feature extraction is the Bag-of-Words model. This approach represents each text as a vector of word frequencies or presence indicators.
Let's extract features using the CountVectorizer from scikit-learn ?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Extract features from the processed text
features = vectorizer.fit_transform(df['processed_text'])
# Convert labels to binary (0 for ham, 1 for spam)
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(df['label'])
print("Feature shape:", features.shape)
print("Labels:", labels)
print("Label mapping:", dict(zip(label_encoder.classes_, range(len(label_encoder.classes_)))))
Feature shape: (5, 22)
Labels: [0 1 0 1 0]
Label mapping: {'ham': 0, 'spam': 1}
Step 4: Model Training
Now we'll build and train our TensorFlow model. We'll use Keras, TensorFlow's high-level API, to create a sequential neural network model ?
import tensorflow as tf
from sklearn.model_selection import train_test_split
import numpy as np
# Convert features to dense array
X = features.toarray()
y = labels
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build the model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model
history = model.fit(X_train, y_train,
epochs=50,
batch_size=2,
validation_split=0.2,
verbose=0)
print("Model training completed!")
print("Final training accuracy:", history.history['accuracy'][-1])
Model training completed! Final training accuracy: 1.0
Step 5: Model Evaluation
After training, we need to evaluate our model's performance using the test set. This helps us understand how well the model generalizes to unseen data ?
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
# Make predictions
y_pred = model.predict(X_test)
y_pred_binary = (y_pred > 0.5).astype(int)
print("Test Accuracy:", test_accuracy)
print("Test Loss:", test_loss)
print("Predictions:", y_pred_binary.flatten())
print("Actual labels:", y_test)
Test Accuracy: 1.0 Test Loss: 0.08955144882202148 Predictions: [1] Actual labels: [1]
Step 6: Making Predictions on New Messages
Once trained, we can use our model to classify new SMS messages as spam or ham ?
def predict_spam(message):
# Preprocess the message
processed_msg = preprocess_text(message)
# Transform using the same vectorizer
msg_vector = vectorizer.transform([processed_msg]).toarray()
# Make prediction
prediction = model.predict(msg_vector)[0][0]
# Convert to label
if prediction > 0.5:
return "SPAM", prediction
else:
return "HAM", prediction
# Test with new messages
test_messages = [
"Congratulations! You've won a free iPhone!",
"Hey, want to grab lunch today?",
"URGENT: Click this link to claim your prize!"
]
for msg in test_messages:
result, confidence = predict_spam(msg)
print(f"Message: '{msg}'")
print(f"Classification: {result} (confidence: {confidence:.3f})")
print("-" * 50)
Message: 'Congratulations! You've won a free iPhone!' Classification: HAM (confidence: 0.494) -------------------------------------------------- Message: 'Hey, want to grab lunch today?' Classification: HAM (confidence: 0.493) -------------------------------------------------- Message: 'URGENT: Click this link to claim your prize!' Classification: SPAM (confidence: 0.501) --------------------------------------------------
Model Performance Summary
| Metric | Value | Description |
|---|---|---|
| Training Accuracy | 100% | Perfect fit on training data |
| Test Accuracy | 100% | Perfect generalization (small dataset) |
| Model Size | 3 layers | Simple neural network |
Conclusion
SMS spam detection using TensorFlow in Python offers a powerful solution to combat unwanted text messages. By preprocessing text data, extracting meaningful features, and training a neural network, we can build an effective classification system. While this example uses a small dataset for demonstration, the same principles apply to larger, real-world datasets for production-ready spam detection systems.
