Abstract
This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.
Article PDF
Similar content being viewed by others
References
Aha WD(1992)Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies, 36:267-287.
Aha WD, Kibler D and Albert MK (1991) Instance-based learning algorithms. Machine Learning, 6:37-66.
Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G and Spyropoulos CD (2000a) An evaluation of Naive Bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9-17.
Androutsopoulos I, Koutsias J, Chandrinos KV and Spyropoulos CD(2000b)An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In: Proceedings of the 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, pp. 160-167.
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD and Stamatopoulos P (2000c) Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In: Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France, pp. 1-3.
Bailey T and Jain AK(1978)A note on distance-weighted k-nearest neighbor rules, IEEE Transactions on Systems, Man, and Cybernetics, 8(4):311-313.
Cohen WW (1996) Learning rules that classify e-mail. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Palo Alto, US, pp. 18 -25.
Cohen WW and Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2):141-173.
Cover T and Hart P (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21-27.
Cranor LF and LaMacchia BA (1998) Spam!, Communications of ACM, 41(8):74-83.
Cristianini N and Shawe-Taylor J (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press.
Daelemans W, Van den Bosch A and Weijters A (1997) IGTREE: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review, 11:407-423.
Daelemans W, Zavrel J, van der Sloot K and van den Bosch A (2000) TiMBL: Tilburg Memory Based Learner, version 3.0, Reference Guide. ILK, Computational Linguistics, Tilburg University. http:/ilk.kub.nl/∼ilk/papers.
Drucker HD, Wu D and Vapnik V (1999) Support vector machines for spam categorization. IEEE Transactions On Neural Networks, 10(5):1048-1054.
Duda RO and Hart PE (1973) Bayes decision theory. Chapter 2 in Pattern Classification and Scene Analysis, John Wiley, pp. 10-43.
Dudani AS (1976) The distance-weighted k-nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 6(4):325-327.
Giraud-Carrier C and Martinez RT (1995) An efficient metric for heterogeneous inductive learning applications in the attribute-value language. Intelligent Systems, pp. 341-350.
Gómez Hidalgo JM, Maña López M and Puertas Sanz E (2000) Combining text and heuristics for costsensitive spam filtering. In: Fourth Computational Natural Language LearningWorkshop, CoNLL-2000, Lisbon, Portugal, pp. 99-102.
Hall RJ (1998) How to avoid unwanted email. Communications of ACM, 41(3):88-95.
Hull D and Robertson S (2000) The TREC-8 filtering track final report. In: Proceedings of the Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 35-56.
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 143-151.
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Morgan Kaufmann, pp. 1137-1143.
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, Stanford, CA, pp. 331-339.
Lewis D (1995) Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), New York, pp. 246-254.
MacLeod ESJ, Luk A and Titterington DM (1987) A re-examination of the distance-weighted k-nearest neighbor classification rule. IEEE Transactions on Systems, Man, and Cybernetics, 17(4):689-696.
Mitchell TM (1997) Machine Learning. McGraw-Hill.
Pantel P and Lin D (1998) SpamCop: A spam classification and organization program. Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin, pp. 95-98. AAAI Technical Report WS-98-05.
Payne TR and Edwards P (1997) Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1):1-32.
Quinlan JR (1986) Induction of Decision Trees. Machine learning, 1(1):81-106.
Quinlan JR (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.
Rocchio, J. (1971). Relevance feedback information retrieval. In: Salton G, ed., The Smart Retrieval System- Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, NJ, pp. 313-323.
Sahami M, Dumais S, Heckerman D and Horvitz E (1998) A Bayesian approach to filtering junk e-mail. Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin, pp. 55-62. AAAI Technical Report WS-98-05.
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD and Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Proceedings of the Sixth Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Carnegie Mellon University, Pittsburgh, PA, USA, pp. 44-50.
Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(3):513-523.
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill.
Schapire RE and Singer Y(2000) BoosTexter:Aboosting-based system for text categorization. Machine Learning, 39(2/3):135-168.
Sebastiani F (2001) Machine Learning in Automated Text Categorization. Revised version of Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy.
Wettschereck D (1994) A Study of Distance-Based Machine Learning Algorithms. PhD thesis, Oregon State University.
Wettschereck D, Aha WD and Mohri T (1995) A Review and Comparative Evaluation of Feature Weighting Methods for Lazy Learning Algorithms, Technical Report AIC-95-012, Washington, DC: Naval Research Laboratory, Navy Center for Applied Research in Artificial Intelligence.
Wilson DR (1997) Advances in Instance-Based Learning Algorithms. PhD thesis, Brigham Young University.
Wilson DR and Martinez RT (1997) Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1):1-34.
Wolpert D (1992) Stacked generalization. Neural Networks, 5(2):241-260.
Yang Y and Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 412-420.
Zavrel J (1997) An empirical re-examination of weighted voting for k-NN. In: Proceedings of the 7th Belgian-Dutch Conference on Machine Learning (BENELEARN-97), Tilburg, The Netherlands.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sakkis, G., Androutsopoulos, I., Paliouras, G. et al. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6, 49–73 (2003). https://doi.org/10.1023/A:1022948414856
Issue Date:
DOI: https://doi.org/10.1023/A:1022948414856