Category: Artificial Intelligence

  • Day 5 of My Learning Journey: Building a Multilingual Sentiment Analysis Model

    Day 5 of My Learning Journey: Building a Multilingual Sentiment Analysis Model

    On Day 5 of my learning journey, I dove into the fascinating world of Natural Language Processing (NLP) by building a multilingual sentiment analysis model using Python. This project was an exciting step toward understanding how machine learning can interpret human emotions from text data, even across different languages. Below, I share the key components of this project, the challenges I faced, and the lessons I learned.

    Project Overview

    The goal was to create a system that analyzes movie reviews and predicts whether they express a positive or negative sentiment. What made this project particularly exciting was its ability to handle reviews in multiple languages, such as English, Spanish, French, German, Japanese, and Russian, by incorporating language detection and translation.

    The project was structured into five key steps:

    1. Data Preparation: Loading and cleaning the IMDB dataset.
    2. Model Training: Training a logistic regression model on the processed data.
    3. Multilingual Testing: Adding language detection and translation to handle non-English reviews.
    4. Model Evaluation: Assessing the model’s performance using accuracy and classification metrics.
    5. Interactive Application: Building a simple interface for users to input reviews and get sentiment predictions.

    Step-by-Step Breakdown

    1. Data Preparation

    I started by loading the IMDB dataset, a collection of movie reviews labeled as positive or negative. Using pandas, I read the CSV file and performed initial checks to ensure the dataset contained the expected columns (review and sentiment). To handle potential inconsistencies in column names, I implemented logic to dynamically identify relevant columns.

    The text data was cleaned by:

    • Converting reviews to lowercase.
    • Removing punctuation using regular expressions (re).
    • Transforming the text into numerical features using CountVectorizer from scikit-learn, which creates a bag-of-words representation.

    The processed data (X for features, y for labels) and the vectorizer were saved using pickle for later use.

    2. Model Training

    For the classification task, I chose Logistic Regression due to its simplicity and effectiveness for binary classification. The dataset was split into 80% training and 20% testing sets using train_test_split. After training the model on the training data, I saved the trained model and test data for evaluation.

    3. Multilingual Sentiment Analysis

    To make the model multilingual, I integrated langdetect for language detection and deep_translator for translating non-English reviews into English. This allowed the model to process reviews in languages like Spanish, French, German, Japanese, and Russian. The workflow was:

    • Detect the language of the input review.
    • If non-English, translate it to English using Google Translate.
    • Clean the text and transform it into numerical features using the saved vectorizer.
    • Predict sentiment using the trained model.

    4. Model Evaluation

    To evaluate the model’s performance, I used the test set to calculate:

    • Accuracy: The proportion of correct predictions.
    • Classification Report: Precision, recall, and F1-score for both positive and negative classes.
    • Confusion Matrix: To visualize true positives, true negatives, false positives, and false negatives.

    The model’s performance provided insights into its strengths and areas for improvement, such as handling imbalanced data or improving translation accuracy.

    5. Interactive Application

    Finally, I created an interactive script that allows users to input movie reviews and receive sentiment predictions in real-time. The script uses the saved model and vectorizer to process user input, detect the language, and predict sentiment. I also tested the system with sample reviews in multiple languages to demonstrate its multilingual capabilities.

    Challenges and Lessons Learned

    • Data Cleaning: Ensuring consistent text preprocessing was critical. For example, removing punctuation and handling special characters improved the model’s performance.
    • Multilingual Processing: Language detection occasionally failed for short or ambiguous texts, leading to a fallback to English. This highlighted the importance of robust language detection libraries.
    • Model Limitations: The bag-of-words approach with CountVectorizer is simple but may miss contextual nuances. Exploring more advanced techniques like word embeddings (e.g., BERT) could enhance performance.
    • Scalability: Saving and loading large datasets and models using pickle was efficient, but I learned about potential issues with pickle compatibility across Python versions.

    Key Takeaways

    • NLP Fundamentals: I gained hands-on experience with text preprocessing, feature extraction, and classification.
    • Multilingual NLP: Integrating language detection and translation opened up possibilities for global applications.
    • Evaluation Metrics: Understanding accuracy, precision, recall, and confusion matrices deepened my knowledge of model evaluation.
    • Practical Application: Building an interactive script showed me how to bridge the gap between a trained model and a user-facing application.

    Next Steps

    Moving forward, I plan to:

    • Experiment with advanced NLP models like BERT or TF-IDF for better text representation.
    • Improve language detection accuracy for short texts.
    • Deploy the model as a web application using frameworks like Flask or FastAPI to make it accessible to a broader audience.

    Code Highlight

    Below is a snippet of the interactive script for sentiment prediction: import pickle import re from utils import detect_language, translate_to_english

    Load trained model and vectorizer

    with open('trained_model.pkl', 'rb') as f:
    model = pickle.load(f)
    with open('vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)
    
    def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text
    
    def predict_sentiment(review, vectorizer, model):
    detected_lang = detect_language(review)
    if detected_lang != 'en' and detected_lang != 'unknown':
    review = translate_to_english(review)
    review = clean_text(review)
    review_vector = vectorizer.transform([review])
    return model.predict(review_vector)[0]

    Interactive loop

    print("Sentiment Classifier: Enter a movie review to predict its sentiment.")
    while True:
    user_review = input("Enter your review (or type 'exit' to quit): ")
    if user_review.lower() == 'exit':
    break
    sentiment = predict_sentiment(user_review, vectorizer, model)
    print(f"Predicted Sentiment: {sentiment}\n")
  • Day 4 of Our Learning Journey: Building an AI-Powered Website Design Generator

    Day 4 of Our Learning Journey: Building an AI-Powered Website Design Generator

    Welcome to Day 4 of our coding adventure! Today, we tackled an exciting project: an AI-Powered Website Design Generator that turns natural language prompts into custom HTML and CSS code. This tool makes web design accessible to everyone, allowing users to describe their vision—like “a modern portfolio with a dark theme and bold buttons”—and instantly get professional-grade code. As beginners, we’re thrilled to share our progress, the requirements behind this project, and a link to the code on GitHub.

    The Mission: Web Design for All

    Our goal was to create a tool that empowers anyone, from entrepreneurs to hobbyists, to generate website designs without coding. By combining AI with a user-friendly interface, we’re making web design fast, intuitive, and inclusive. Day 4 was the perfect opportunity to stretch our skills and build something impactful.

    The Tech Stack

    We used a beginner-friendly stack to bring this project to life:

    • Backend: Flask (Python) for the API, integrated with Google’s Gemini 1.5 Flash AI model to generate HTML and CSS.
    • Frontend: React for a dynamic, dark-themed interface that displays live previews and generated code.
    • AI: Google Gemini to process prompts and output structured JSON with HTML and CSS.
    • Deployment: Backend on a custom server, frontend hosted on Vercel for seamless access.

    Requirements

    To build and run the Website Design Generator, here’s what we needed:

    Backend Dependencies (requirements.txt)

    These Python packages power the Flask backend and AI integration:

    • Flask==3.0.3
    • blinker==1.9.0
    • click==8.2.1
    • colorama==0.4.6
    • itsdangerous==2.2.0
    • jinja2==3.1.6
    • markupsafe==3.0.2
    • werkzeug==3.1.3
    • idna==3.10
    • python-dotenv==1.1.1
    • requests==2.32.4
    • urllib3==2.5.0
    • charset_normalizer==3.4.2
    • certifi==2025.6.15
    • annotated-types==0.7.0
    • cachetools==5.5.2
    • google-ai-generativelanguage==0.6.15
    • google-api-core==2.25.1
    • google-api-python-client==2.175.0
    • google-auth==2.40.3
    • google-auth-httplib2==0.2.0
    • google-generativeai==0.8.5
    • googleapis-common-protos==1.70.0
    • grpcio==1.73.1
    • grpcio-status==1.71.2
    • httplib2==0.22.0
    • proto-plus==1.26.1
    • protobuf==5.29.5
    • pyasn1==0.6.1
    • pyasn1-modules==0.4.2
    • pydantic==2.11.7
    • pydantic-core==2.33.2
    • pyparsing==3.2.3
    • rsa==4.9.1
    • tqdm==4.67.1
    • typing-extensions==4.14.1
    • typing-inspection==0.4.1
    • uritemplate==4.2.0
    • flask-cors==6.0.1

    Frontend Dependencies

    The React frontend relies on:

    • React (v18.x)
    • TypeScript for type safety
    • Vercel for deployment
    • Basic HTML/CSS for styling (inline styles in the component)

    Additional Requirements

    • Gemini API Key: A Google API key for accessing the Gemini 1.5 Flash model, stored in a .env file.
    • Node.js: For running the React frontend locally.
    • Python 3.8+: For the Flask backend.
    • Internet Access: For API calls to Gemini and frontend-backend communication.

    What We Learned on Day 4

    This project was a whirlwind of new skills:

    • Backend Development: Setting up Flask routes, handling JSON, and integrating with an AI model taught us about APIs and server logic.
    • Frontend Development: Building a React interface with state management and live previews showed us how to create dynamic UIs.
    • AI Integration: Crafting prompts for Gemini and parsing its output helped us understand AI’s potential and quirks.
    • Deployment: Hosting on Vercel and configuring CORS gave us hands-on experience with production environments.
    • Problem-Solving: Handling errors, like inconsistent AI responses, pushed us to write robust code.

    Check Out the Code!

    We’ve shared the full project on GitHub for you to explore, run, or contribute to: Website Design Generator GitHub Repository. Try it out, experiment with prompts, and let us know what you think!

    Frontend Code : https://github.com/manojtsx/Website-Component-Design-Generator-Frontend

    Backend Code: https://github.com/manojtsx/Website-Component-Design-Generator-Backend

    What’s Next?

    On Day 5, we plan to enhance the generator with features like:

    • Customizable design tweaks via sliders or additional prompts.
    • Support for JavaScript to add interactivity.
    • A component library for reusable elements like navbars or footers.

    Let’s Connect!

    Day 4 has been a game-changer, showing us how AI can transform web development. If you’re learning to code, passionate about AI, or curious about web design, let’s connect! Share your thoughts in the comments, try our tool, or reach out to collaborate. Here’s to more learning and building!

    #WebDevelopment #AI #CodingJourney #Day4 #React #Flask

  • Day 3 of Learning OCR: Building a Modular Python OCR System

    Day 3 of Learning OCR: Building a Modular Python OCR System

    On Day 3 of my journey into Optical Character Recognition (OCR), I took a significant step forward by organizing a Python-based OCR project into a modular, scalable structure. Using powerful libraries like OpenCV and Tesseract, I built a system capable of extracting text from images with improved preprocessing techniques. Below, I’ll share the project structure, the complete code for each file, and the key lessons I learned along the way.

    Why Modularize?

    As my OCR project grew, I realized the importance of keeping code organized and reusable. By splitting the functionality into separate files—each handling a specific task like image loading, preprocessing, or text extraction—I made the codebase easier to maintain, debug, and extend. This approach mirrors real-world software engineering practices, making it a valuable lesson for building production-ready applications.

    The Project Structure

    I designed a clean folder structure to keep everything tidy:

    universal_ocr/
    ├── images/               # Folder for input images
    ├── main.py               # Entry point of the application
    ├── ocr/
    │   ├── __init__.py       # Makes ocr a Python package
    │   ├── loader.py         # Handles image loading
    │   ├── processor.py      # Manages image preprocessing
    │   └── reader.py         # Performs text extraction
    

    Below is the complete code for each file, along with explanations of what I learned while building them.

    1. ocr/loader.py

    This module handles loading images from a specified folder, filtering for common image formats like PNG, JPG, and more.

    import os
    
    def load_images_from_folder(folder):
        images = []
        for filename in os.listdir(folder):
            if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
                images.append(os.path.join(folder, filename))
        return images
    

    Key Learning: Using os.listdir() and os.path.join() makes file handling platform-independent. The case-insensitive check with filename.lower() ensures robustness across different image formats.

    2. ocr/processor.py

    This module preprocesses images to improve OCR accuracy. It includes steps like converting to grayscale, resizing, applying Gaussian blur, sharpening, adaptive thresholding, and skew correction.

    import cv2
    import numpy as np
    
    def preprocess_image(image):
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
        scale_percent = 150
        width = int(gray.shape[1] * scale_percent / 100)
        height = int(gray.shape[0] * scale_percent / 100)
        gray = cv2.resize(gray, (width, height), interpolation=cv2.INTER_LINEAR)
    
        blur = cv2.GaussianBlur(gray, (5,5), 0)
    
        kernel_sharpen = np.array([[0,-1,0], [-1,5,-1], [0,-1,0]])
        sharpened = cv2.filter2D(blur, -1, kernel_sharpen)
    
        thresh = cv2.adaptiveThreshold(
            sharpened, 255, 
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
            cv2.THRESH_BINARY, 31, 10)
    
        coords = np.column_stack(np.where(thresh > 0))
        angle = cv2.minAreaRect(coords)[-1]
        if angle < -45:
            angle = -(90 + angle)
        else:
            angle = -angle
    
        (h, w) = thresh.shape[:2]
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, angle, 1.0)
        rotated = cv2.warpAffine(thresh, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    
        return rotated
    

    Key Learning: Preprocessing is the backbone of effective OCR. Each step—grayscale conversion, resizing, blurring, sharpening, thresholding, and skew correction—addresses specific challenges like noise, low resolution, or text rotation. Tuning parameters like the thresholding block size (31) and constant (10) was critical for handling diverse image qualities.

    3. ocr/reader.py

    This module uses Tesseract to extract text from preprocessed images, leveraging the preprocessing function from processor.py.

    import cv2
    import pytesseract
    from .processor import preprocess_image
    
    # Optional: specify path if not in PATH
    # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    
    def extract_text_from_image(image_path):
        image = cv2.imread(image_path)
        preprocessed = preprocess_image(image)
        text = pytesseract.image_to_string(preprocessed)
        return text
    

    Key Learning: Tesseract’s performance heavily depends on image quality, making preprocessing essential. I also learned that specifying the Tesseract executable path is necessary in some environments, like Windows, if it’s not in the system PATH.

    4. main.py

    The main script ties everything together, loading images and extracting text while incorporating basic error handling.

    from ocr.loader import load_images_from_folder
    from ocr.reader import extract_text_from_image
    
    def main():
        image_folder = 'images'
        image_paths = load_images_from_folder(image_folder)
        
        for path in image_paths:
            print(f"\nExtracting from: {path}")
            try:
                text = extract_text_from_image(path)
                print("Text:\n", text.strip())
            except Exception as e:
                print("Failed to process image:", e)
    
    if __name__ == "__main__":
        main()
    

    Key Learning: A clean entry point simplifies execution and testing. Using try-except blocks ensures the program doesn’t crash on problematic images, and the if __name__ == "__main__": construct allows the script to be imported as a module without running the main logic.

    Running the Project

    To run the project, I placed images in the images/ folder and executed:

    python main.py
    

    The script processes each image, applies preprocessing, and prints the extracted text. This setup is simple yet flexible, allowing for future enhancements like logging or image previews.

    Challenges and Takeaways

    • Challenge: Finding the right preprocessing parameters was tricky. For example, adjusting the adaptive thresholding parameters (31 and 10) required experimentation to handle different image qualities effectively.
    • Takeaway: Modular design not only improves code readability but also simplifies debugging and testing. By isolating preprocessing, I could refine it independently without affecting other components.
    • Next Steps: I plan to add logging to track errors and successes, implement image previews for visual debugging, and explore advanced preprocessing techniques for handling noisy or multilingual documents.

    Why This Matters

    This project is more than a learning exercise—it’s a step toward building real-world applications like document digitization, automated form processing, or assistive technologies for visually impaired users. Mastering OCR equips me with skills that have practical impact across industries, from finance to healthcare.

    Final Thoughts

    Day 3 taught me the power of modular design and the critical role of preprocessing in OCR. I’m excited to continue this journey, building on this foundation to tackle more complex challenges like multilingual text extraction or optimizing for low-quality images. If you’re working on OCR or computer vision projects, I’d love to hear your experiences and tips in the comments!

    #Python #OCR #ComputerVision #MachineLearning #Day3

  • Day 2 of Building a Spam Detector with Python: A Hands-On Machine Learning Journey

    Day 2 of Building a Spam Detector with Python: A Hands-On Machine Learning Journey

    On a late Sunday evening, July 6, 2025, I dove into a new machine learning adventure: creating a spam detector using Python. Inspired by my recent sentiment classifier project, this guide walks you through classifying emails or messages as “spam” or “ham” (not spam) in a practical, beginner-friendly way. Whether you’re a data enthusiast or a professional sharpening your skills, this five-step process will help you build and deploy your own ML model. Let’s get started!


    Step 1: Understanding Model Training Basics

    Machine learning is all about teaching models to spot patterns. For this spam detector, the goal is to analyze text and classify it as spam or ham. The workflow includes:

    • Data Preparation: Gathering and cleaning text data.
    • Choosing a Model: Selecting an algorithm like Naive Bayes.
    • Training: Feeding data to learn spam patterns.
    • Evaluation: Measuring the model’s accuracy.
    • Deployment: Applying it to classify new messages.

    What I Did:

    • Installed Python libraries (scikit-learn, pandas, numpy) using:pip install scikit-learn pandas numpy
    • Set up a clear plan to build a spam detector, leveraging text classification techniques.

    This step laid the groundwork for a streamlined ML project.


    Step 2: Preparing the Data

    Data powers any ML model. I used a small sample dataset of messages for simplicity, but real-world applications benefit from larger datasets like the SMS Spam Collection.

    What I Did:

    • Created a dataset with five messages and their labels (spam/ham).
    • Cleaned text by converting to lowercase and removing punctuation.
    • Transformed text into numerical features using CountVectorizer.

    Here’s the code:

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    import re
    
    # Sample dataset
    data = {
        'message': [
            'Win a free iPhone now! Click here.',
            'Meeting at 10 AM tomorrow, see you there.',
            'Get rich quick with this offer!',
            'Hi, just checking in about the project.',
            'Claim your prize today, urgent!'
        ],
        'label': ['spam', 'ham', 'spam', 'ham', 'spam']
    }
    
    # Create a DataFrame
    df = pd.DataFrame(data)
    
    # Clean text data
    def clean_text(text):
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        return text
    
    df['message'] = df['message'].apply(clean_text)
    
    # Convert text to numerical features
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['message'])
    y = df['label']
    
    print("Feature matrix shape:", X.shape)
    print("Labels:", y)
    

    This code converted raw text into a numerical format, creating a feature matrix (X) and labels (y) for the model.


    Step 3: Choosing and Training the Model

    I chose Multinomial Naive Bayes, a go-to algorithm for text classification due to its strength with word frequency data.

    What I Did:

    • Split data into 80% training and 20% testing sets.
    • Trained the Naive Bayes model on the training data.

    Here’s the code:

    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import MultinomialNB
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize and train the model
    model = MultinomialNB()
    model.fit(X_train, y_train)
    
    print("Model trained successfully!")
    print("Training data shape:", X_train.shape)
    print("Testing data shape:", X_test.shape)
    

    The model learned to differentiate spam from ham based on word patterns.


    Step 4: Evaluating the Model

    To verify performance, I tested the model on the unseen test set and calculated key metrics.

    What I Did:

    • Made predictions on the test set.
    • Computed accuracy and generated a classification report for precision, recall, and F1-score.

    Here’s the code:

    from sklearn.metrics import accuracy_score, classification_report
    
    # Make predictions
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Print results
    print("Test Accuracy:", accuracy)
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    

    Note: The small dataset (five messages) limited the test set to one sample, reducing metric reliability. For robust results, use a larger dataset like the SMS Spam Collection.


    Step 5: Deploying the Model in a Project

    I created an interactive application to classify new messages, showcasing real-world applicability.

    What I Did:

    • Saved the model and vectorizer for reuse.
    • Built a function to predict labels and an interactive script for user inputs.

    Here’s the code:

    import joblib
    
    # Save the model and vectorizer
    joblib.dump(model, 'spam_detector_model.pkl')
    joblib.dump(vectorizer, 'vectorizer.pkl')
    
    # Function to predict spam
    def predict_spam(message, vectorizer, model):
        message = clean_text(message)
        message_vector = vectorizer.transform([message])
        prediction = model.predict(message_vector)
        return prediction[0]
    
    # Interactive script
    print("Spam Detector: Enter a message to check if it's spam or ham.")
    while True:
        user_message = input("Enter your message (or type 'exit' to quit): ")
        if user_message.lower() == 'exit':
            break
        label = predict_spam(user_message, vectorizer, model)
        print(f"Predicted Label: {label}\n")
    
    # Example usage
    sample_messages = [
        "Win a free trip today! Click now.",
        "Let’s schedule a call for tomorrow."
    ]
    for message in sample_messages:
        label = predict_spam(message, vectorizer, model)
        print(f"Message: {message}\nPredicted Label: {label}\n")
    

    This script allows users to input messages and receive instant spam/ham predictions, demonstrating practical deployment.


    What’s Next?

    This spam detector is just the beginning! Here are some ideas to take it further:

    • Scale Up: Use a larger dataset for improved accuracy.
    • Deploy as a Web App: Create a user-friendly interface with Flask or Streamlit.
    • Enhance Features: Add keyword filtering or email header analysis.

    Building this project fueled my excitement for applying ML to real-world challenges. I’d love to see your ML projects or hear your thoughts—connect with me on LinkedIn and let’s keep the conversation going! 🚀

    #MachineLearning #Python #DataScience #AI #SpamDetection

  • Day 1 of Building a Sentiment Classifier with Python: A Beginner-Friendly Machine Learning Project

    Day 1 of Building a Sentiment Classifier with Python: A Beginner-Friendly Machine Learning Project

    Machine learning (ML) is revolutionizing data analysis and predictive modeling. In this article, I’ll guide you through creating a simple sentiment classifier to predict whether a movie review is positive or negative. This beginner-friendly project uses Python and scikit-learn, and by the end, you’ll have a functional model ready for real-world applications. Let’s explore the five steps I took to build this project from scratch!


    Step 1: Understanding Model Training Basics

    Model training involves teaching an ML model to identify patterns in data. For this project, we aim to classify movie reviews as “positive” or “negative” based on their text. The process includes:

    • Data Preparation: Collecting and cleaning data.
    • Choosing a Model: Selecting an algorithm (e.g., Logistic Regression).
    • Training: Feeding data to the model to learn.
    • Evaluation: Testing the model’s performance.
    • Deployment: Integrating the model into a project.

    What I Did:

    • Installed Python (3.8+) and required libraries: scikit-learn, pandas, and numpy.
    • Ran this command to set up my environment:pip install scikit-learn pandas numpy
    • Chose to build a sentiment classifier for movie reviews, a classic ML task.

    This step established the foundation, ensuring I had the tools and a clear objective.


    Step 2: Preparing the Data

    Data is the backbone of any ML model. For simplicity, I used a small sample dataset of movie reviews, but you can scale up with datasets like IMDb from Kaggle.

    What I Did:

    • Created a sample dataset with five reviews and their sentiments.
    • Cleaned the text by converting it to lowercase and removing punctuation.
    • Converted text into numerical features using a bag-of-words model with CountVectorizer.

    Here’s the code:

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    import re
    
    # Sample dataset
    data = {
        'review': [
            'This movie was amazing and I loved it!',
            'Terrible film, really boring.',
            'Great acting and wonderful story.',
            'Awful, I hated this movie.',
            'Fantastic experience, highly recommend!'
        ],
        'sentiment': ['positive', 'negative', 'positive', 'negative', 'positive']
    }
    
    # Create a DataFrame
    df = pd.DataFrame(data)
    
    # Clean text data
    def clean_text(text):
        text = text.lower()  # Convert to lowercase
        text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
        return text
    
    df['review'] = df['review'].apply(clean_text)
    
    # Convert text to numerical features
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['review'])  # Features
    y = df['sentiment']  # Labels
    
    print("Feature matrix shape:", X.shape)
    print("Labels:", y)
    

    This code transformed raw text into a numerical format, creating a feature matrix (X) and labels (y) for the model.


    Step 3: Choosing and Training the Model

    I selected Logistic Regression, a robust algorithm for binary classification (positive vs. negative).

    What I Did:

    • Split the data into training (80%) and testing (20%) sets for later evaluation.
    • Trained the Logistic Regression model on the training data.

    Here’s the code:

    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize and train the model
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    print("Model trained successfully!")
    print("Training data shape:", X_train.shape)
    print("Testing data shape:", X_test.shape)
    

    The model learned patterns from the training data, preparing it for predictions.


    Step 4: Evaluating the Model

    To assess performance, I tested the model on the unseen test set and calculated metrics like accuracy.

    What I Did:

    • Made predictions on the test set.
    • Computed accuracy and generated a classification report for detailed metrics.

    Here’s the code:

    from sklearn.metrics import accuracy_score, classification_report
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Print results
    print("Test Accuracy:", accuracy)
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    

    Note: With only five reviews, the test set had one sample, making metrics less reliable. For robust results, use a larger dataset like IMDb with thousands of reviews.


    Step 5: Deploying the Model in a Project

    I built a simple application to predict sentiments for new reviews, showcasing real-world usability.

    What I Did:

    • Saved the model and vectorizer for reuse.
    • Created a function to preprocess and predict sentiment for new reviews.
    • Developed an interactive script for user inputs.

    Here’s the code:

    import joblib
    
    # Save the model and vectorizer
    joblib.dump(model, 'sentiment_model.pkl')
    joblib.dump(vectorizer, 'vectorizer.pkl')
    
    # Function to predict sentiment
    def predict_sentiment(review, vectorizer, model):
        review = clean_text(review)
        review_vector = vectorizer.transform([review])
        prediction = model.predict(review_vector)
        return prediction[0]
    
    # Interactive script
    print("Sentiment Classifier: Enter a movie review to predict its sentiment.")
    while True:
        user_review = input("Enter your review (or type 'exit' to quit): ")
        if user_review.lower() == 'exit':
            break
        sentiment = predict_sentiment(user_review, vectorizer, model)
        print(f"Predicted Sentiment: {sentiment}\n")
    
    # Example usage
    sample_reviews = [
        "This movie was fantastic and thrilling!",
        "I didn’t enjoy the plot, it was confusing."
    ]
    for review in sample_reviews:
        sentiment = predict_sentiment(review, vectorizer, model)
        print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")
    

    This script enables users to input reviews and receive instant sentiment predictions, demonstrating practical deployment.


    What’s Next?

    This project is a great starting point! Here are ways to enhance it:

    • Improve the Model: Experiment with algorithms like Naive Bayes or use larger datasets.
    • Deploy as a Web App: Use Flask or Streamlit for a user-friendly interface.
    • Explore New Domains: Apply the workflow to predict stock trends, customer feedback, or spam.

    Building this sentiment classifier was an exciting way to dive into the ML workflow. Whether you’re new to ML or advancing your skills, I hope this inspires your next project!

    Try It Yourself: Grab the code, install the libraries, and experiment with your dataset. Share your projects or questions on LinkedIn—let’s connect and learn together!

    #MachineLearning #Python #DataScience #SentimentAnalysis #ScikitLearn