Volume 18, Number 1
Optimized Naive Bayes for Phishing Website Detection using Hybrid TF - IDF and Character Level URL Features
Authors
Hieu Ngo Van, Tin Trinh Quang, Phuong Nguyen Thi Thanh, Huong Mai Quoc and Dung Nguyen Thi Thuy, Duy Tan University, Vietnam
Abstract
The rapid growth of phishing websites poses significant challenges to users and online systems, particularly as many existing detection approaches rely on webpage content analysis or computationally expensive deep learning models. This paper proposes a lightweight phishing URL detection method that integrates token-level Term Frequency–Inverse Document Frequency (TF–IDF) and character-level n-grams within a Multinomial Naive Bayes classifier. The proposed approach is evaluated on three public datasets, including the UCI Phishing Website dataset, PhishTank, and URLNet. Experimental results show that the model achieves F1-scores ranging from 0.904 to 0.940 when trained and tested on individual datasets, indicating robust detection performance. When the three datasets are merged for training, the model attains an F1-score of 0.931, with Recall improving by 2.0 percent compared to the average single-dataset results, reflecting enhanced generalization across diverse data sources. The lightweight nature of the proposed method enables fast URL classification and practical deployment under hardware constraints.
Keywords
Phishing detection; URL analysis; Machine learning; Naive Bayes; TF–IDF
