Machine Learning - Spam Filter Development by Carol Hsu

Background

Web forms have also become a new target for malware. Spammers often send bulk messages that can overwhelm one's inbox, steal data, or harm websites by setting bugs or displaying harmful messages on web pages. These factors can result in problematic outcomes that negatively impact a business's reputation, digital presence, and financial loss.

Challenge

FormCheck – is a spam classifier that detects spam form submissions through websites using a rule-based approach. Each rule is measured, and a score is calculated based on statistical calculations. However, it has a high false-positive rate due to manual modifications and requires a more efficient way to handle spam messages with increasing data.

Approach

To improve the efficiency of the spam filter, I decided to develop a model by using machine-learning algorithm that can be incorporated with the existing rule-based system. Literature review of relevant studies was conducted to find suitable algorithms. Later, a series of data analytics techniques have done to generate spam patterns and insights from current features. Finally, important features that can improve the performance of the spam filter should be listed, along with recommendations for further adjustments.

Here are the process of developing spam filter:

Evaluated Data Quality and Current System Performance: assessed the quality of the data and the existing system's performance. Based on my analysis, I developed an action plan to prepare the data for model fitting.
Data Preprocessing & Feature Engineering: cleaned and preprocessed a dataset consisting of 62,000 email metadata using SQL, and applied Regex and conditional formatting to generate new features based on literature review and secondary research
Data Analysis: conducted graphical analysis (Python & R) to check data distribution, and find meaningful features for modelling
Automation with Python: To streamline the data cleaning process and increase efficiency, I designed a Python script that automates these tasks, resulting in a 50% reduction in work time.‍
Building Classification Models & Evaluation: In R, I constructed classification models employing Logistic Regression, Decision Tree, and Random Forest techniques and used Confusion Matrix, Cross-Validation, ROC, and AUROC curve to assess classifiers.

Data Source

There are 60021 rows and 19 columns in the original dataset, and the target variable label is binary data with two classes, spam, and ham. The image below shows the spam and ham messages distribution in the original dataset.

Data Analysis

Header Analysis

As suggested by Bhowmick and Hazarika (2016), Sheu et al. (2016), and Arras, Horn, Montavon, Muller, and Samuel (2017), header information is an essential feature as it provides sender data that can significantly impact the performance of spam recognition. Moreover, header analysis is relatively simpler to implement compared to language processing and text tokenization

Server protocol: The usage of HTTP/1.0 server protocol is not common nowadays, making it a critical indicator for spam detection. Web forms with server_protocol values of HTTP/1.0, HTTP/1.1, and unknown tend to have a higher occurrence of spam, while 95% of legitimate users have server_protocol values of HTTP/2.0. The image below illustrates the number of web forms in each class and their corresponding server_protocol values.

Accept-Language: The "accept_language" field in the header refers to the browser language used by the user. In spam messages, this field often does not exist. When "accept_language = 0" is observed in the metadata, it indicates a 100% likelihood of spam. This can be a strong indicator for detecting spam web forms. On the other hand, when "accept_language = 1" is present, it indicates a 59% likelihood of spam and a 41% likelihood of ham (legitimate messages). The image below presents the number of web forms in each class and their corresponding "accept_language" values.

acceptlanguage — Predict variable: accept_language

Cookies: The "is_cookie" field in the request header refers to the presence of stored HTTP cookies. In the majority of legitimate message headers, this field can be found. However, if "is_cookie" is not present in the header field, the message is likely spam. Almost 100% of cases where "is_cookie = 0" indicate spam, which can be used as a reliable indicator for separating forms. The image below displays the number of web forms in each class and their corresponding "is_cookie" values.

flag_count: refers to the number of spam checks applied in each observation. As seen in the graph below, the histogram on the left shows that the number of flag counts ranges from 4 to 14; in contracts, the number of flag counts in spam messages is 23, widespread from 3 to 25. More, the boxplot on the right indicates the quantile distribution of flag_count. It is visible that once a web form is checked with more than 14 rules, it is more likely to be spam. Therefore flag_count can be a good indicator for detecting spam forms.

Behaviour Analysis

Analyzing sender behaviour is a spam detection technique that involves identifying spam patterns based on user activity, past activity, and user connections, using features such as name, phone, and email. The occurrence value of each categorical feature is calculated as a percentage, and the results are presented in a boxplot below, illustrating the distribution of occurrence values for each category.

continous variables — Box Plot: Continous variables

Based on the density graph analysis, the following observations can be made:

The number of flags applied to spam messages varies widely, ranging from 0 to 25. This indicates that spammers may use multiple techniques to try to evade detection.
Spammers tend to be active throughout the day, with no distinct patterns of activity. In contrast, legitimate users are more active during typical business hours, from 7 am to 8 pm. This may suggest that spammers operate continuously, while legitimate users tend to use web forms during regular working hours.
The occurrence of certain features such as name, domain name, and hostname tends to be higher in spam web forms compared to legitimate users. This may indicate that spammers tend to reuse certain information across multiple web forms, resulting in higher occurrences of these features in spam messages.
In contrast, legitimate users show fewer repetitions across all activity features, indicating that they may only use contact forms a few times to reach out to a web host. Spammers, on the other hand, are more likely to repeatedly send junk messages, resulting in higher occurrences of certain activity features.

These observations suggest that analyzing sender behaviour, including activity patterns and feature occurrences, can be a valuable technique for detecting spam web forms and identifying spammer behaviour.

density — Density Plot: Continous variables

Model Building

The table below summarised the selected classification algorithms for this research.

A comparison of select classification Algorithms

Feature Selection with Bourta

When building a machine learning model, having too many features can lead to issues such as the curse of dimensionality, increased memory usage, longer processing times, and higher computational requirements. In this case, Boruta (Miron, 2020) is applied as it can work with both categorical and numeric predictors and provide better outcomes compared to using a correlation matrix alone. It can also serve as a supplementary method for selecting important features to be used in fitting the logistic regression model

Fitting Models

Experiment 1: Logistic Regression, and use three Features Selection Methods

Model 1 Logit.all: fitting a full model with all features
Model 2 Logit.82: use p-value select features which are statistic significant for fitting model
Model 3 Logit.73: use the drop1 method to get the lowest AIC
Model 4``Logit.b72: use Boruta to select features to fit a logistic regression model

The Table below compares three methods with a complete model Logit.all

AIC is used to select a logistic model, as the fewer features, the better due to modelling efficiency; hence, Logit.b72 is chosen to compete with tree-based models

Experiment 2: Decision Tree

Decision Tree Algorithms The decision tree algorithms tree(), rpart(), and ctree() are utilized in this stage. The complexity parameter (CP) value and the lowest cross-validation error are determined to create an optimal decision tree. The predictors used as terminal nodes for decision-making in this model are server_protocol, MESSAGE_URL_SPAM, accept_language, is_cookies, and IP_REPUTATION.

pruning_tree — Classification Tree Pruning

Experiment 3: Random Forest

The RF model initially fits all predictors with default settings using the randomForest() package. Pruning of trees is crucial for improving the false positive rate. A significant drop in error rate is observed with 100 trees, but further improvement is not seen beyond 300 trees. The tuneRF function and parameter entry are used for tree tuning in the model.

‍

Result

Three models: Logistic Regression, Decision Tree, and Radom Forest are evaluated with Accuracy, error rate, and false-positive rate.

Evaluation FormCheck and proposed classifiers' performance

Identify important Features for Spam Filtering

Here, I used the varImpPlot() function to output the essential features of the proposed models. Variable importance is measured by assessing the mean decrease in accuracy, where lower values indicate less contribution to the model. The mean decreased Gini measures the purity of the end nodes of branches in the model, with lower Gini values indicating a better impact of the feature on decision-making.

The result of the Variable Importance Measure of the proposed models

Features used in the logistic regression are very different from the other two models. This would require further investigation and experiments to test with different combinations of variables. However, this approach allows us to confirm variables: server_protocol, is_cookies, accept_language, ACCEPT_LANG_NULL, flag_count, and MESSAGE_URLS are essential for spam detection in this research.

Conclusion & Recommendation

This research shows improved performance of all proposed classifiers in terms of accuracy, error rate, and specificity, validating the effectiveness of a machine learning-based spam filter. The random forest model exhibited the best performance among the tested classifiers. However, as spam detection techniques evolve, advanced techniques and influential factors may need to be considered. Three recommendations are provided as follows:

‍Establish Data Collection and processing strategies During the data preprocessing stage, we identified missing data fields due to the lack of guidelines for data collection. Hence, it is essential to establish clear guidelines and data collection strategies to improve the data quality of the spam filter system.
Adjustment Essential Features and Rules We found that header features, such as server_protocol, is_cookies, accept_language, ACCEPT_LANG_NULL, flag_count, and MESSAGE_URLS, play significant roles in the spam recognition process, consistent with the literature and previous research. These features are recommended to be added to the existing rule-based system. However, caution should be taken to address issues of data overlapping and collinearity with existing rules.
Exploring Alternative ML Methods Although the ultimate goal of the spam filter is to output binary values, having a machine that can determine uncertain messages could reduce the negative impact of mislabeling legitimate users as spammers. Therefore, exploring machine learning methods such as deep learning or recursive learning techniques could be ideal for handling multi-class tasks.

Reflection

In this project, I used three fundamental classification algorithms to construct classifiers. It's worth noting that other widely recognized models, such as Naïve Bayes, Support Vector Machines, and Neural Networks, are frequently employed for solving classification problems but were not included in this study. Therefore, one might question the generalizability of the research results.

Nonetheless, through this analysis, I identified five crucial features that led to an 80% reduction in the error rate of the spam system. My research significantly enhanced the project's overall effectiveness, serving as a testament to my ability to apply data science knowledge and techniques in real-world scenarios. This experience was invaluable.

FromCheck - Spam filter development

An effective spam filter alternative to reCapture that helps website owners reduce 99% of spam messages.

Summary

Background

Challenge

Approach