Photo by Mediocre Studio on Unsplash
During a software development project, I collaborated closely with developers to optimize a spam detection system using machine learning techniques. After preprocessing 62,000 email metadata with SQL and Python, I employed cross-validation and a confusion matrix to identify Random Forest as the most effective model. The chosen model achieved an 80% reduction in the error rate and demonstrated a remarkable 99% accuracy in classifying spam messages. This successful effort not only enhanced the performance of the spam detection system but also highlighted the potential for similar improvements in systems utilizing machine learning techniques.
Web forms have also become a new target for malware. Spammers often send bulk messages that can overwhelm one's inbox, steal data, or harm websites by setting bugs or displaying harmful messages on web pages. These factors can result in problematic outcomes that negatively impact a business's reputation, digital presence, and financial loss.
FormCheck – is a spam classifier that detects spam form submissions through websites using a rule-based approach. Each rule is measured, and a score is calculated based on statistical calculations. However, it has a high false-positive rate due to manual modifications and requires a more efficient way to handle spam messages with increasing data.
To improve the efficiency of the spam filter, I decided to develop a model by using machine-learning algorithm that can be incorporated with the existing rule-based system. Literature review of relevant studies was conducted to find suitable algorithms. Later, a series of data analytics techniques have done to generate spam patterns and insights from current features. Finally, important features that can improve the performance of the spam filter should be listed, along with recommendations for further adjustments.
Here are the process of developing spam filter:
There are 60021 rows and 19 columns in the original dataset, and the target variable label is binary data with two classes, spam, and ham. The image below shows the spam and ham messages distribution in the original dataset.
As suggested by Bhowmick and Hazarika (2016), Sheu et al. (2016), and Arras, Horn, Montavon, Muller, and Samuel (2017), header information is an essential feature as it provides sender data that can significantly impact the performance of spam recognition. Moreover, header analysis is relatively simpler to implement compared to language processing and text tokenization
Analyzing sender behaviour is a spam detection technique that involves identifying spam patterns based on user activity, past activity, and user connections, using features such as name, phone, and email. The occurrence value of each categorical feature is calculated as a percentage, and the results are presented in a boxplot below, illustrating the distribution of occurrence values for each category.
Based on the density graph analysis, the following observations can be made:
These observations suggest that analyzing sender behaviour, including activity patterns and feature occurrences, can be a valuable technique for detecting spam web forms and identifying spammer behaviour.
The table below summarised the selected classification algorithms for this research.
When building a machine learning model, having too many features can lead to issues such as the curse of dimensionality, increased memory usage, longer processing times, and higher computational requirements. In this case, Boruta (Miron, 2020) is applied as it can work with both categorical and numeric predictors and provide better outcomes compared to using a correlation matrix alone. It can also serve as a supplementary method for selecting important features to be used in fitting the logistic regression model
The Table below compares three methods with a complete model Logit.all
AIC is used to select a logistic model, as the fewer features, the better due to modelling efficiency; hence, Logit.b72 is chosen to compete with tree-based models
Decision Tree Algorithms The decision tree algorithms tree(), rpart(), and ctree() are utilized in this stage. The complexity parameter (CP) value and the lowest cross-validation error are determined to create an optimal decision tree. The predictors used as terminal nodes for decision-making in this model are server_protocol, MESSAGE_URL_SPAM, accept_language, is_cookies, and IP_REPUTATION.
The RF model initially fits all predictors with default settings using the randomForest() package. Pruning of trees is crucial for improving the false positive rate. A significant drop in error rate is observed with 100 trees, but further improvement is not seen beyond 300 trees. The tuneRF function and parameter entry are used for tree tuning in the model.
Three models: Logistic Regression, Decision Tree, and Radom Forest are evaluated with Accuracy, error rate, and false-positive rate.
Here, I used the varImpPlot() function to output the essential features of the proposed models. Variable importance is measured by assessing the mean decrease in accuracy, where lower values indicate less contribution to the model. The mean decreased Gini measures the purity of the end nodes of branches in the model, with lower Gini values indicating a better impact of the feature on decision-making.
Features used in the logistic regression are very different from the other two models. This would require further investigation and experiments to test with different combinations of variables. However, this approach allows us to confirm variables: server_protocol, is_cookies, accept_language, ACCEPT_LANG_NULL, flag_count, and MESSAGE_URLS are essential for spam detection in this research.
This research shows improved performance of all proposed classifiers in terms of accuracy, error rate, and specificity, validating the effectiveness of a machine learning-based spam filter. The random forest model exhibited the best performance among the tested classifiers. However, as spam detection techniques evolve, advanced techniques and influential factors may need to be considered. Three recommendations are provided as follows:
In this project, I used three fundamental classification algorithms to construct classifiers. It's worth noting that other widely recognized models, such as Naïve Bayes, Support Vector Machines, and Neural Networks, are frequently employed for solving classification problems but were not included in this study. Therefore, one might question the generalizability of the research results.
Nonetheless, through this analysis, I identified five crucial features that led to an 80% reduction in the error rate of the spam system. My research significantly enhanced the project's overall effectiveness, serving as a testament to my ability to apply data science knowledge and techniques in real-world scenarios. This experience was invaluable.