A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Sakkis, Georgios; Androutsopoulos, Ion; Paliouras, Georgios; Karkaletsis, Vangelis; Spyropoulos, Constantine D.; Stamatopoulos, Panagiotis

doi:10.1023/A:1022948414856

A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Published: January 2003

Volume 6, pages 49–73, (2003)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Download PDF

Georgios Sakkis¹,
Ion Androutsopoulos²,
Georgios Paliouras¹,
Vangelis Karkaletsis¹,
Constantine D. Spyropoulos¹ &
…
Panagiotis Stamatopoulos²

449 Accesses
121 Citations
Explore all metrics

Abstract

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

Article PDF

E-Mail Spam Filtering: A Review of Techniques and Trends

The Improved Bayesian Algorithm to Spam Filtering

Spam E-Mail Classification Based on the IFWB Algorithm

References

Aha WD(1992)Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies, 36:267-287.
Google Scholar
Aha WD, Kibler D and Albert MK (1991) Instance-based learning algorithms. Machine Learning, 6:37-66.
Google Scholar
Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G and Spyropoulos CD (2000a) An evaluation of Naive Bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9-17.
Androutsopoulos I, Koutsias J, Chandrinos KV and Spyropoulos CD(2000b)An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In: Proceedings of the 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, pp. 160-167.
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD and Stamatopoulos P (2000c) Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In: Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France, pp. 1-3.
Bailey T and Jain AK(1978)A note on distance-weighted k-nearest neighbor rules, IEEE Transactions on Systems, Man, and Cybernetics, 8(4):311-313.
Google Scholar
Cohen WW (1996) Learning rules that classify e-mail. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Palo Alto, US, pp. 18 -25.
Cohen WW and Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2):141-173.
Google Scholar
Cover T and Hart P (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21-27.
Google Scholar
Cranor LF and LaMacchia BA (1998) Spam!, Communications of ACM, 41(8):74-83.
Google Scholar
Cristianini N and Shawe-Taylor J (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press.
Daelemans W, Van den Bosch A and Weijters A (1997) IGTREE: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review, 11:407-423.
Google Scholar
Daelemans W, Zavrel J, van der Sloot K and van den Bosch A (2000) TiMBL: Tilburg Memory Based Learner, version 3.0, Reference Guide. ILK, Computational Linguistics, Tilburg University. http:/ilk.kub.nl/∼ilk/papers.
Drucker HD, Wu D and Vapnik V (1999) Support vector machines for spam categorization. IEEE Transactions On Neural Networks, 10(5):1048-1054.
Google Scholar
Duda RO and Hart PE (1973) Bayes decision theory. Chapter 2 in Pattern Classification and Scene Analysis, John Wiley, pp. 10-43.
Dudani AS (1976) The distance-weighted k-nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 6(4):325-327.
Google Scholar
Giraud-Carrier C and Martinez RT (1995) An efficient metric for heterogeneous inductive learning applications in the attribute-value language. Intelligent Systems, pp. 341-350.
Gómez Hidalgo JM, Maña López M and Puertas Sanz E (2000) Combining text and heuristics for costsensitive spam filtering. In: Fourth Computational Natural Language LearningWorkshop, CoNLL-2000, Lisbon, Portugal, pp. 99-102.
Hall RJ (1998) How to avoid unwanted email. Communications of ACM, 41(3):88-95.
Google Scholar
Hull D and Robertson S (2000) The TREC-8 filtering track final report. In: Proceedings of the Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 35-56.
Google Scholar
Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 143-151.
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Morgan Kaufmann, pp. 1137-1143.
Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, Stanford, CA, pp. 331-339.
Lewis D (1995) Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), New York, pp. 246-254.
MacLeod ESJ, Luk A and Titterington DM (1987) A re-examination of the distance-weighted k-nearest neighbor classification rule. IEEE Transactions on Systems, Man, and Cybernetics, 17(4):689-696.
Google Scholar
Mitchell TM (1997) Machine Learning. McGraw-Hill.
Pantel P and Lin D (1998) SpamCop: A spam classification and organization program. Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin, pp. 95-98. AAAI Technical Report WS-98-05.
Payne TR and Edwards P (1997) Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1):1-32.
Google Scholar
Quinlan JR (1986) Induction of Decision Trees. Machine learning, 1(1):81-106.
Google Scholar
Quinlan JR (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.
Google Scholar
Rocchio, J. (1971). Relevance feedback information retrieval. In: Salton G, ed., The Smart Retrieval System- Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, NJ, pp. 313-323.
Google Scholar
Sahami M, Dumais S, Heckerman D and Horvitz E (1998) A Bayesian approach to filtering junk e-mail. Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin, pp. 55-62. AAAI Technical Report WS-98-05.
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD and Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Proceedings of the Sixth Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Carnegie Mellon University, Pittsburgh, PA, USA, pp. 44-50.
Google Scholar
Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(3):513-523.
Google Scholar
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill.
Schapire RE and Singer Y(2000) BoosTexter:Aboosting-based system for text categorization. Machine Learning, 39(2/3):135-168.
Google Scholar
Sebastiani F (2001) Machine Learning in Automated Text Categorization. Revised version of Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy.
Google Scholar
Wettschereck D (1994) A Study of Distance-Based Machine Learning Algorithms. PhD thesis, Oregon State University.
Wettschereck D, Aha WD and Mohri T (1995) A Review and Comparative Evaluation of Feature Weighting Methods for Lazy Learning Algorithms, Technical Report AIC-95-012, Washington, DC: Naval Research Laboratory, Navy Center for Applied Research in Artificial Intelligence.
Google Scholar
Wilson DR (1997) Advances in Instance-Based Learning Algorithms. PhD thesis, Brigham Young University.
Wilson DR and Martinez RT (1997) Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1):1-34.
Google Scholar
Wolpert D (1992) Stacked generalization. Neural Networks, 5(2):241-260.
Google Scholar
Yang Y and Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 412-420.
Zavrel J (1997) An empirical re-examination of weighted voting for k-NN. In: Proceedings of the 7th Belgian-Dutch Conference on Machine Learning (BENELEARN-97), Tilburg, The Netherlands.

Download references

Author information

Authors and Affiliations

National Centre for Scientific Research (NCSR) “Demokritos”, Institute of Informatics and Telecommunications, GR-153 10 Ag. Paraskevi, Athens, Greece
Georgios Sakkis, Georgios Paliouras, Vangelis Karkaletsis & Constantine D. Spyropoulos
Department of Informatics, Athens University of Economics and Business, Patission 76, GR-104 34, Athens, Greece
Ion Androutsopoulos & Panagiotis Stamatopoulos

Authors

Georgios Sakkis
View author publications
You can also search for this author in PubMed Google Scholar
Ion Androutsopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Paliouras
View author publications
You can also search for this author in PubMed Google Scholar
Vangelis Karkaletsis
View author publications
You can also search for this author in PubMed Google Scholar
Constantine D. Spyropoulos
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Stamatopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sakkis, G., Androutsopoulos, I., Paliouras, G. et al. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6, 49–73 (2003). https://doi.org/10.1023/A:1022948414856

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1022948414856

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Abstract

Article PDF

Similar content being viewed by others

E-Mail Spam Filtering: A Review of Techniques and Trends

The Improved Bayesian Algorithm to Spam Filtering

Spam E-Mail Classification Based on the IFWB Algorithm

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Abstract

Article PDF

Similar content being viewed by others

E-Mail Spam Filtering: A Review of Techniques and Trends

The Improved Bayesian Algorithm to Spam Filtering

Spam E-Mail Classification Based on the IFWB Algorithm

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation