Skip to main content
Log in

Finding Frequent Patterns in a Large Sparse Graph*

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Graph-based modeling has emerged as a powerful abstraction capable of capturing in a single and unified framework many of the relational, spatial, topological, and other characteristics that are present in a variety of datasets and application areas. Computationally efficient algorithms that find patterns corresponding to frequently occurring subgraphs play an important role in developing data mining-driven methodologies for analyzing the graphs resulting from such datasets. This paper presents two algorithms, based on the horizontal and vertical pattern discovery paradigms, that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods for determining the number of edge-disjoint embeddings of a subgraph and employ novel algorithms for candidate generation and frequency counting, which allow them to operate on datasets with different characteristics and to quickly prune unpromising subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 120,000 vertices or 110,000 edges, and significantly outperform previously developed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1
Figure 2
Figure 3
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Figure 4
Algorithm 5
Figure 5
Figure 6
Table 3
Figure 7

Similar content being viewed by others

Notes

  1. The symbol v i in the figure is a vertex ID, not a vertex label, and blank elements in the adjacency matrix means there is no edge between the corresponding pair of vertices.

  2. SiGraM stands for Single Graph Miner.

  3. http://cygnus.uta.edu/subdue/databases/index.html

  4. http://www.cs.cornell.edu/projects/kddcup/datasets.html

  5. DTP 2D and 3D Structural Information. http://dtp.nci.nih.gov/docs/3d_database/structural_information/structural_data.html

  6. http://dip.doe-mbi.ucla.edu/

  7. http://vlsicad.cs.ucla.edu/~cheese/ispd98.html

  8. http://www.google.com/programming-contest/

  9. Although this version is not the latest one, it runs significantly faster than the current latest version, 5.0.8.

  10. ftp://ftp.comlab.ox.ac.uk/pub/Packages/ILP/Datasets/carcinogenesis/progol/carcinogenesis.tar.Z

References

  • Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of 1998 ACM SIGMOD International Conference on Management of Data, pp. 94–105.

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the 20th International Conference on Very Large Data Bases (VLDB), J.B. Bocca, M. Jarke, and C. Zaniolo (Eds.), Morgan Kaufmann, pp. 487–499.

  • Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proc. of the 11th International Conference on Data Engineering (ICDE), P.S. Yu and A.L.P. Chen (Eds.), IEEE Press, pp. 3–14.

  • Berendt, B., Hotho, A., and Stumme, G. 2002. Towards semantic web mining. In International Semantic Web Conference (ISWC), pp. 264–278.

  • Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H. et al. 2000. The protein data bank. Nucleic Acids Research, 28:235–242.

    Article  PubMed  Google Scholar 

  • Berman, P. and Fujito, T. 1995. On the approximation properties of independent set problem in degree 3 graphs. In Proc. of Workshop on Algorithms and Data Structures, pp. 449–460.

  • Blake, C.L. and Merz, C.J. 1998. UCI Repository of Machine Learning Databases.

  • Borgelt, C. and Berthold, M.R. 2002. Mining molecular fragments: Finding relevant substructures of molecules. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 51–58.

  • Chew, L.P., Huttenlocher, D., Kedem, K., and Kleinberg, J. 1999. Fast detection of common geometric substructure in proteins. In Proc. of the 3rd ACM RECOMB International Conference on Computational Molecular Biology, pp. 104–114.

  • Cohen, M. and Gudes, E. 2004. Diagonally subgraphs pattern mining. In Proc. of the 9th ACM SIGMOD workshop on research issues in Data Mining and Knowledge Discovery DMKD '04, pp. 51–58.

  • Cook, D.J. and Holder, L.B. 1994. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231–255.

    Google Scholar 

  • Cook, D.J. and Holder, L.B. 2000. Graph-based data mining. IEEE Intelligent Systems, 15(2):32–41.

    Article  Google Scholar 

  • Cook, D.J., Holder, L.B., and Djoko, S. 1995. Knowledge discovery from structural data. Journal of Intelligent Information Systems, 5(3):229–245.

    Article  Google Scholar 

  • Dehaspe, L., Toivonen, H., and King, R.D. 1998. Finding frequent substructures in chemical compounds. In Proc. of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-98), R. Agrawal, P. Stolorz, and G. Piatetsky-Shapiro (Eds.), AAAI Press, pp. 30–36.

  • De Raedt, L. and Kramer, S. 2001. The level-wise version space algorithm and its application to molecular fragment finding. In Proc. of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), pp. 853–862.

  • Deshpande, M., Kuramochi, M., and Karypis, G. 2002. Automated approaches for classifying structures. In Proc. of the 2nd Workshop on Data Mining in Bioinformatics (BIOKDD '02) pp. 11–18.

  • Deshpande, M., Kuramochi, M., and Karypis, G. 2003. Frequent sub-structure based approaches for classifying chemical compounds. In Proc. of 2003 IEEE International Conference on Data Mining (ICDM), pp. 35–42.

  • Feige, U., Goldwasser, S., Lovasz, L., Safra, S., and Szegedy, M. 1991. Approximating clique is almost NP-complete. In Proc. of the 32nd IEEE Symposium on Foundations of Computer Science (FOCS), pp. 2–12.

  • Fortin, S. 1996. The Graph Isomorphism Problem (Tech. Rep. No. TR96-20). Department of Computing Science, University of Alberta.

  • Garey, M.R. and Johnson, D.S. 1979. Computers and Intractability: A Guide to the Theory of np-completeness. New York: W. H. Freeman and Company.

    MATH  Google Scholar 

  • Ghazizadeh, S. and Chawathe, S. 2002a. Discovering Freuqent Structures Using Summaries (Tech. Rep. No. CS-TR-4364). Department of Computer Science, University of Maryland.

  • Ghazizadeh, S. and Chawathe, S. 2002b. SEuS: Structure extraction using summaries. In Proc. of the 5th International Conference on Discovery Science, pp. 71–85.

  • Gonzalez, J., Holder, L.B. and Cook, D.J. 2001. Application of graph-based concept learning to the predictive toxicology domain. In Proc. of the Predictive Toxicology Challenge Workshop.

  • Grindley, H.M., Artymiuk, P.J., Rice, D.W., and Willett, P. 1993. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229:707–721.

    Article  PubMed  Google Scholar 

  • Guralnik, V. and Karypis, G. 2001. A scalabale algorithm for clustering sequence datasets. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM).

  • Halldórsson, M.M., and Radhakrishnan, J. 1997. Greed is good: Approximating independent sets in sparse and bounded-degree graphs. Algorithmica, 18(1):145–163.

    Article  MATH  MathSciNet  Google Scholar 

  • Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD International Conference on Management of Data Dallas, TX, pp. 1–12.

  • Hochbaum, D.S. 1983. Efficient bounds for the stable set, vertex cover, and set packing problems. Discrete Applied Mathematics, 6:243–254.

    Article  MATH  MathSciNet  Google Scholar 

  • Holder, L.B., Cook, D.J., and Djoko, S. 1994. Substructure discovery in the SUBDUE system. In Proc. of the AAAI Workshop on Knowledge Discovery in Databases, pp. 169–180.

  • Hong, M., Zhou, H., Wang, W., and Shi, B. 2003. An efficient algorithm of frequent connected subgraph extraction. In Proc. of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-03), Springer-Verlag, Vol. 2637, pp. 40–51.

  • Huan, J., Wang, W., and Prins, J. 2003. Efficient mining of frequent subgraph in the presence of isomophism. In Proc. of 2003 IEEE International Conference on Data Mining (ICDM'03), pp. 549–552.

  • Inokuchi, A., Washio, T., and Motoda, H. 2000. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'00), Lyon, France, pp. 13–23.

  • Inokuchi, A., Washio, T., and Motoda, H. 2003. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321–354.

    Article  MATH  Google Scholar 

  • Inokuchi, A., Washio, T., Nishimura, K., and Motoda, H. 2002. A Fast Algorithm for Mining Frequent Connected Subgraphs (Tech. Rep. No. RT0448). IBM Research, Tokyo Research Laboratory.

  • Jensen, D. and Goldberg, H. (Eds.). 1998. Artificial Intelligence and Link Analysis Papers from the 1998 Fall Symposium. AAAI Press.

  • Jonyer, I., Cook, D.J., and Holder, L.B. 2001a. Discovery and evaluation of graph-based hierarchical conceptual clusters. Journal of Machine Learning Research, 2:19–43.

    Article  Google Scholar 

  • Jonyer, I., Holder, L.B., and Cook, D.J. 2001b. Hierarchical conceptual structural clustering. International Journal on Artificial Intelligence Tools, 10(1–2):107–136.

    Article  Google Scholar 

  • Khanna, S., Motwani, R., Sudan, M., and Vazirani, U.V. 1994. On syntactic versus computational views of approximability. In Proc. of IEEE Symposium on Foundations of Computer Science, pp. 819–830.

  • Kleinberg, J.M. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632.

    Article  MATH  MathSciNet  Google Scholar 

  • Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A.S. 1999. The Web as a graph: Measurements, models and methods. Lecture Notes in Computer Science, 1627:1–17.

    MathSciNet  Google Scholar 

  • Ko, C. 2000. Logic induction of valid behavior specifications for intrusion detection. In IEEE Symposium on Security and Privacy (S&P), pp. 142–155.

  • Koch, I., Lengauer, T., and Wanke, E. 1996. An algorithm for finding maximal common subtopoloties in a set of protein structures. Journal of Computational Biology, 3(2):289–306.

    Article  PubMed  Google Scholar 

  • Kramer, S., De Raedt, L., and Helma, C. 2001. Molecular feature mining in HIV data. In Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-01), pp. 136–143.

  • Kuramochi, M., Deshpande, M., and Karypis, G. 2004. Data mining: Next generation challenges and future directions. In H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (eds.), AAAI Press (in press), pp. 315–334.

  • Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM), pp. 313–320.

  • Kuramochi, M. and Karypis, G. 2002. An Efficient Algorithm for Discovering Frequent Subgraphs (Tech. Rep. No. 02-026). University of Minnesota, Department of Computer Science.

  • Kuramochi, M. and Karypis, G. 2004a. An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering, 16(9):1038–1051.

    Article  Google Scholar 

  • Kuramochi, M. and Karypis, G. 2004b. Finding frequent patterns in a large sparse graph. In Proc. of the 2004 SIAM International Conference on Data Mining (SDM04).

  • Lee, W. and Stolfo, S.J. 2000. A framework for constructing features and models for intrusion detection systems. ACM Transactions on Information and System Security, 3(4):227–261.

    Article  Google Scholar 

  • Leibowitz, N., Fligelman, Z.Y., Nussinov, R., and Wolfson, H.J. 1999. Multiple structural alignment and core detection by geometric hashing. In Proc. of the 7th international conference on intelligent systems in molecular biology, Heidelberg: Germany, pp. 169–177.

  • Leibowitz, N., Nussinov, R., and Wolfson, H.J. 2001. MUSTA—a general, efficient, automated method for multiple structure alignment and detection of common motifs: Application to proteins. Journal of Computational Biology, 8(2):93–121.

    Article  PubMed  Google Scholar 

  • Li, W., Han, J., and Pei, J. 2001. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proc. of 2001 IEEE International Conference on Data Mining (ICDM), pp. 369–376.

  • Liu, B., Hsu, W., and Ma, Y. 1998. Integrating classification and association rule mining. In Proc. of the 4th Internation Conference on Knowledge Discovery and Data Mining (kdd-98), pp. 80–86.

  • McKay, B.D. (n.d.). Nauty users guide. http://cs.anu.edu.au/~bdm/nauty/

  • McKay, B.D. 1981. Practical graph isomorphism. Congressus Numerantium, 30:45–87.

    MathSciNet  Google Scholar 

  • Mitchell, E.M., Artymiuk, P.J., Rice, D.W., and Willett, P. 1989. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212:151–166.

    Article  Google Scholar 

  • Mooney, R.J., Melville, P., Tang, L.R., Shavlik, J., Castro Dutra, I. de, Page, D. et al. 2004. Relational data mining with inductive logic programming for link discovery. In AAAI Press/The MIT Press, pp. 239–255.

  • Muggleton, S.H. 1999. Scientific knowledge discovery using Inductive Logic Programming. Communications of the ACM, 42(11):42–46.

    Article  Google Scholar 

  • Östergård, P.R.J. 2002. A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, 120:195–205.

    Article  Google Scholar 

  • Palmer, C.R., Gibbons, P.B., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proc. of the 8th ACM SIGKDD Internal Conference on Knowlege Discovery and Data Mining (KDD-2002) Edmonton, AB, Canada, pp. 81–90.

  • Pennec, X. and Ayache, N. 1998. A geometric algorithm to find small but highly simialar 3D substructures in proteins. Bioinformatics, 14(6):516–522.

    Article  PubMed  Google Scholar 

  • Raymond, J.W. 2002. Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J. Chem. Inf. Comput. Sci., 42:305–316.

    Article  PubMed  Google Scholar 

  • Read, R.C. and Corneil, D.G. 1977. The graph isomorph disease. Journal of Graph Theory, 1:339–363.

    Article  MATH  MathSciNet  Google Scholar 

  • Robson, J.M. 1986. Algorithms for maximum independent sets. Journal of Algorithms, 7:425–440.

    Article  MATH  MathSciNet  Google Scholar 

  • Srinivasan, A., King, R.D., Muggleton, S.H., and Sternberg, M.J.E. 1997. Carcinogenesis predictions using ILP. In Proc. of the 7th International Workshop on Inductive Logic Programming S. Džeroski and N. Lavrač (Eds.), Springer-Verlag, (Vol. 1297, pp. 273–287).

  • Vanetik, N., Gudes, E., and Shimony, S.E. 2002. Computing frequent graph patterns from semistructured data. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 458–465.

  • Wang, X., Wang, J.T.L., Shasha, D., Shapiro, B.A., Rigoutsos, I., and Zhang, K. 2002. Finding patterns in three dimensional graphs: Algorithms and applications to scientific data mining. IEEE Transactions on Knowledge and Data Engineering, 14(4):731–749.

    Article  Google Scholar 

  • Wasserman, S., Faust, K., and Iacobucci, D. 1994. Social network analysis : Methods and applications. Cambridge University Press.

  • Yan, X. and Han, J. 2002. gSpan: Graph-based substructure pattern mining. In Proc. of 2002 IEEE International Conference on Data Mining (ICDM), pp. 721–724.

  • Yan, X. and Han, J. 2003. CloseGraph: Mining closed frequent graph patterns. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 286–295.

  • Yoshida, K. and Motoda, H. 1995. CLIP: Concept learning from inference patterns. Artificial Intelligence, 75(1):63–92.

    Article  Google Scholar 

  • Yoshida, K., Motoda, H., and Indurkhya, N. 1994. Graph-based induction as a unified learning framework. Journal of Applied Intelligence, 4:297–328.

    Article  Google Scholar 

  • Zaki, M.J. and Gouda, K. 2003. Fast vertical mining using diffsets. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 326–335.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michihiro Kuramochi.

Additional information

*This work was supported in part by NSF CCR-9972519, EIA-9986042, ACI-9982274, ACI-0133464, and ACI-0312828; the Digital Technology Center at the University of Minnesota; and by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory (ARL) under Cooperative Agreement number DAAD19-01-2-0014. The content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuramochi, M., Karypis, G. Finding Frequent Patterns in a Large Sparse Graph* . Data Min Knowl Disc 11, 243–271 (2005). https://doi.org/10.1007/s10618-005-0003-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0003-9

Keywords

Navigation