Author ORCID Identifier
0000-0001-9990-2968
Date of Award
17-8-2025
Document Type
Thesis
First Advisor
Dr.V.Ramaswamy
Keywords
Spam Email Classification, Phishing Detection, Cyber Attacks, Machine Learning, DDoS Attacks Detection
Abstract
E-mail is the fastest mode of communication. It is amongst the most commonly used modes of communication and can be used for both legal and illegal purposes. Many elements that could be useful in detecting email fraud are constantly being investigated. Phishing assaults in which thieves use Internet to trick consumers into visiting fake websites are very common and are causing significant harm to victims. Several methods of filtering phishing emails have been devised but a solution to the problem still evades. One of the attacks addressed in this research is email phishing. Machine learning is a popular and efficient technique for classifying emails for tasks such as spam detection, email prioritization or email routing to the proper folders. An email contains various fields such as subject, body of email, To, From, Bcc, Cc, Date, Time etc. For classifying emails into spam or ham, subject and body of emails are considered. Words of subject and body of the email are taken as features after preprocessing. The algorithm Rule Based Subject Analysis is applied to the subject of email for computing weight from which the existence of spam words is detected.
Another algorithm Semantic Based Feature Selection is applied on the body of text to reduce the number of words by removing meaningless words.Resultant textual features are converted into numerical values since machine learning algorithms work only on numerical data. Bag Of Words (BOW) model is applied for converting textual features into numerical values. Machine learning algorithms such as Support Vector Machine, Multinomial Naïve Bayes, Gaussian Naïve Bayes and Bernoulli Naïve Bayes are used to build models. The performances of these algorithms are compared in terms of precision, recall, F1-score, False Positive rate, error rate and accuracy. It is observed that SVM alongwith RBSA and SBFS algorithms gives the highest accuracy of 97% and the lowest FP rate of 0.01. Without using RBSA and SBFS, SVM gives 95% accuracy when compared with the existing methodology.Rule Based Subject Analysis (RBSA) algorithm is applied to the subject feature of emails of different user groups.
Researchers, Students, Business, IT Professionals, Professors, Readers and Customers etc. are considered as user groups for evaluating the proposed algorithm. Weight Distribution on User Group (WDUG) algorithm has been proposed to compute the weight of each email which is used to classify emails. We consider real dataset. The proposed algorithm works based on the probability of spam words for each user group. The probability of spam and ham words for each user group is computed using Find_Probability_Spam( ) and Find_Probability_Ham( ) algorithms. The output of Weight Distribution on User Groups algorithm along with associated labels are fed as inputs for various machine learning algorithms for classification. Multinomial Naive Bayes, Gaussian Naive Bayes, Support Vector Machine, Logistic Regression and Random Forest applied are applied to classify emails into spam or ham. The classification performance of all machine learning algorithms are compared with RBSA algorithm.
Various performance metrics such as precision, recall, F1-score, FP rate, error rate and accuracy of all algorithms have been measured. It is observed that WDUG with Random Forest yields 91% on real dataset. WDUG with Multinomial Naïve Bayes yields 93.26% of accuracy on real dataset.A Language Pack-based Tuned Transformer Language (LPTTL) framework has been proposed to detect phishing emails by analysing the structure and content of email body text. LPTTL framework consists of cacography algorithms such as Language Pack Tuned Bidirectional Encoder Representation Transformer (LPT-BERT) and Text-Text Transfer Transformer (LPT-T5) algorithms for detecting phishing attacks through tokenization of embedded email body text. These cacography algorithms have been tuned using variouslanguage packs for finding cacographic words in the email body. These words are corrected by BERT and T5 models and word embeddings are created. Vectors are created for these word embeddings and fed as inputs to Recurrent Neural Network (RNN). The proposed classification algorithms Lyrebird Optimization Algorithm - Long Short-Term Memory (LOA-LSTM), Hippopotamus Optimization (HO)-Gated Recurrent Unit (HO-GRU) and Meerkat Optimization Algorithm-Bidirectional Long Short-Term Memory (MOA-BiLSTM) are used for phishing email classification.
LPTTL framework has an accuracy of 95.47%, precision of 96.8%, recall of 95.63%, and F1-score of 96.21% with LPT-T5-HO-GRU, which is the highest compared to LSTM and BiLSTM.An efficient framework for Email Phishing Attack Detection using Adaboost Classification Algorithm has been proposed and tested on Kaggle dataset. After preprocessing, a document term matrix is built for converting textual features into numerical values. Ensemble Adaboost Classification Algorithm (EACA) is applied on document term matrix for classification. The performance of EACA is compared with the results of Artificial Neural Network, Recurrent Neural Network and Long Short Term Memory. It is observed that EACA attains the highest accuracy of 98.84%.Distributed Denial-of-Service (DDoS) attacks have become a critical issue in cybersecurity. This can lead to a temporary or even prolonged loss of service for users. These attacks mainly target e-commerce platforms, online services, and financial institutions. DDoS attacks need to be detected since they cause serious problems.
Even though various mechanisms are available to detect DDoS attacks, they continue to dominate the world of internet. Analyzing features and characteristics that differentiate malicious traffic from legitimate activity in DDoS attacks can be carried out using machine learning models, rule-based detection and statistical techniques. In this work, Principal Component Analysis (PCA) based Enhanced DDoS Attack Detection (EDAD) methodology has been proposed fordetecting attacks in the network traffic datasets. PCA has been applied on datasets for selecting important features. Machine learning algorithms Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), K-Nearest Neighbours (KNN) and Decision Tree (DT) have been used. Three datasets, namely CICIDS2017, CICIDS2018 and CICDDoS-2019 have been used to evaluate the performance of EDAD. Various performance metrics have also been studied.
Recommended Citation
S, Abiramasundari Ms, "Design and Development of Machine Learning and Deep Learning based Algorithms for Cyberattacks Detection" (2025). Theses and Dissertations. 177.
https://knowledgeconnect.sastra.edu/theses/177