Spam filtering is a classification problem. In a classification problem, the following are the common metrics used to measure efficacy :
True positives : Those data points where the outcome is spam and the document is actually spam.
True Negatives: Those data points where the outcome is not spam and the document is actually not spam.
False positives: Those data points where the outcome is spam and the document is actually not spam.
False negatives: Those data points where the outcome is not spam and the document is actually spam.
- Accuracy : (True positives + True negatives) / # of all data points
While this is the most intuitive metric, this could work very badly for highly imbalanced datasets – like for the spam filtering problem, where most emails are not likely to be spam. Let’s say you have a corpus of 1000 emails where you have 20 spam emails. An over simplistic model could classify all emails as not spam and still be 98% accurate
- Precision : True positive / (True positive + False positive) : Out of all emails you identified as spam how many are really spam
- Recall : True positive / (True positive + false negative) : Out of all emails that are actually spam, how many emails have you identified as spam.
- F1 score : 2* precision * recall / (precision + recall) : Seeks a balance between precision and recall. F1 score is high implies we have a good classifier where both precision and recall are high. Ideal value =1
- AUC / ROC : If we chose a model that gives us probability of a document being spam instead of binary spam / non spam result, we can compute AUC. Typically thresholding is used such that all documents with score above threshold are spam and the rest are non spam. The AUC curve is obtained by plotting TPR on X axis and FPR on Y axis for different threshold values (each threshold value leads to a different point and you get a curve represented by these points). The Area under this curve is the AUC metric. A perfect classifier has AUC 1, 0.5 AUC means no separation capability. More on AUC : https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5