On a late Sunday evening, July 6, 2025, I dove into a new machine learning adventure: creating a spam detector using Python. Inspired by my recent sentiment classifier project, this guide walks you through classifying emails or messages as “spam” or “ham” (not spam) in a practical, beginner-friendly way. Whether you’re a data enthusiast or a professional sharpening your skills, this five-step process will help you build and deploy your own ML model. Let’s get started!
Step 1: Understanding Model Training Basics
Machine learning is all about teaching models to spot patterns. For this spam detector, the goal is to analyze text and classify it as spam or ham. The workflow includes:
- Data Preparation: Gathering and cleaning text data.
- Choosing a Model: Selecting an algorithm like Naive Bayes.
- Training: Feeding data to learn spam patterns.
- Evaluation: Measuring the model’s accuracy.
- Deployment: Applying it to classify new messages.
What I Did:
- Installed Python libraries (scikit-learn, pandas, numpy) using:
pip install scikit-learn pandas numpy - Set up a clear plan to build a spam detector, leveraging text classification techniques.
This step laid the groundwork for a streamlined ML project.
Step 2: Preparing the Data
Data powers any ML model. I used a small sample dataset of messages for simplicity, but real-world applications benefit from larger datasets like the SMS Spam Collection.
What I Did:
- Created a dataset with five messages and their labels (spam/ham).
- Cleaned text by converting to lowercase and removing punctuation.
- Transformed text into numerical features using
CountVectorizer.
Here’s the code:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import re
# Sample dataset
data = {
'message': [
'Win a free iPhone now! Click here.',
'Meeting at 10 AM tomorrow, see you there.',
'Get rich quick with this offer!',
'Hi, just checking in about the project.',
'Claim your prize today, urgent!'
],
'label': ['spam', 'ham', 'spam', 'ham', 'spam']
}
# Create a DataFrame
df = pd.DataFrame(data)
# Clean text data
def clean_text(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
return text
df['message'] = df['message'].apply(clean_text)
# Convert text to numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']
print("Feature matrix shape:", X.shape)
print("Labels:", y)
This code converted raw text into a numerical format, creating a feature matrix (X) and labels (y) for the model.
Step 3: Choosing and Training the Model
I chose Multinomial Naive Bayes, a go-to algorithm for text classification due to its strength with word frequency data.
What I Did:
- Split data into 80% training and 20% testing sets.
- Trained the Naive Bayes model on the training data.
Here’s the code:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the model
model = MultinomialNB()
model.fit(X_train, y_train)
print("Model trained successfully!")
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)
The model learned to differentiate spam from ham based on word patterns.
Step 4: Evaluating the Model
To verify performance, I tested the model on the unseen test set and calculated key metrics.
What I Did:
- Made predictions on the test set.
- Computed accuracy and generated a classification report for precision, recall, and F1-score.
Here’s the code:
from sklearn.metrics import accuracy_score, classification_report
# Make predictions
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Print results
print("Test Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Note: The small dataset (five messages) limited the test set to one sample, reducing metric reliability. For robust results, use a larger dataset like the SMS Spam Collection.
Step 5: Deploying the Model in a Project
I created an interactive application to classify new messages, showcasing real-world applicability.
What I Did:
- Saved the model and vectorizer for reuse.
- Built a function to predict labels and an interactive script for user inputs.
Here’s the code:
import joblib
# Save the model and vectorizer
joblib.dump(model, 'spam_detector_model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
# Function to predict spam
def predict_spam(message, vectorizer, model):
message = clean_text(message)
message_vector = vectorizer.transform([message])
prediction = model.predict(message_vector)
return prediction[0]
# Interactive script
print("Spam Detector: Enter a message to check if it's spam or ham.")
while True:
user_message = input("Enter your message (or type 'exit' to quit): ")
if user_message.lower() == 'exit':
break
label = predict_spam(user_message, vectorizer, model)
print(f"Predicted Label: {label}\n")
# Example usage
sample_messages = [
"Win a free trip today! Click now.",
"Let’s schedule a call for tomorrow."
]
for message in sample_messages:
label = predict_spam(message, vectorizer, model)
print(f"Message: {message}\nPredicted Label: {label}\n")
This script allows users to input messages and receive instant spam/ham predictions, demonstrating practical deployment.
What’s Next?
This spam detector is just the beginning! Here are some ideas to take it further:
- Scale Up: Use a larger dataset for improved accuracy.
- Deploy as a Web App: Create a user-friendly interface with Flask or Streamlit.
- Enhance Features: Add keyword filtering or email header analysis.
Building this project fueled my excitement for applying ML to real-world challenges. I’d love to see your ML projects or hear your thoughts—connect with me on LinkedIn and let’s keep the conversation going! 🚀
#MachineLearning #Python #DataScience #AI #SpamDetection

Leave a Reply