Building detecting SMS Spam Using Text Classification Project
In today’s world, spam messages flood our inboxes, making it difficult to distinguish between important and irrelevant texts. Companies often lose customers because they can't properly filter spam, which frustrates users. What if we could build a system that automatically identifies spam messages and flags them?
This project solves that problem by building an SMS Spam Classifier that uses Natural Language Processing (NLP) techniques to automatically label SMS messages as spam or ham (not spam).
Solution: Text Classification Using Bag of Words
To solve this problem, we’ll break down the solution into several key steps:
Data Preprocessing:
- We’ll clean the text data by converting everything to lowercase, removing unnecessary characters, and applying stopwords filtering (to remove common English words that don’t add much meaning, like “the”, “is”, “in”).
- Stemming will be used to reduce words to their root forms (e.g., "running" becomes "run").
Bag of Words Model:
- We’ll use Bag of Words (BoW) to transform the text data into numerical format that the machine learning algorithm can process. BoW treats each word in the text as a separate feature.
- This model will convert each SMS into a vector based on word frequencies or presence/absence of specific words.
Model Training:
- Once the text data is transformed, we’ll use a classification algorithm (like Naive Bayes, Logistic Regression, or others) to train the model to classify messages as spam or ham.
Here is a code of my Program:
# %% [markdown]
# Importing all the Necessary libraries and play with them
# %%
import pandas as pd
import numpy as np
# %%
messages = pd.read_csv('smsspamcollection.csv')
messages
# %% [markdown]
# ##Data Clearning and Preprocessing
# (Lower the sentences,applying stopwords and lemmatization)
# %%
import re
import nltk
nltk.download('stopwords')
# %%
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps= PorterStemmer()
# %%
corpus =[]
for i in range(0,len(messages)):
review =re.sub('[^a-zA-z]',' ',messages['message'][i])
review=review.lower()
review =review.split()
[ps.stem(word) for word in review if not word in stopwords.words('english')]
review =' '.join(review)
corpus.append(review)
# %%
corpus
# %% [markdown]
# ##Bagg of Words
# %%
from sklearn.feature_extraction.text import CountVectorizer
##for Binary Bagg Of Words enable binary =True
cv=CountVectorizer(max_features=100,binary=True)
# %%
x=cv.fit_transform(corpus).toarray()
x
# %%
x.shape
# %% [markdown]
# N_Grams
# %%
cv.vocabulary_
# %%
##Create the Bag of Words model with ngram
from sklearn.feature_extraction.text import CountVectorizer
##for Binary Bagg Of Words enable binary =True
cv=CountVectorizer(max_features=100,binary=True,ngram_range=(1,1))
x=cv.fit_transform(corpus).toarray()
# %%
cv.vocabulary_
Comments
Post a Comment