SUBDUE is a graph based knowledge discovery system that finds structural, relational patterns in data representing entities and relationships. SUBDUE represents data using a labeled, directed graph in which entities are represented by labeled vertices or subgraphs, and relationships are represented by labeled edges between the entities.  SUBDUE uses the minimum description length (MDL) principle to identify patterns that minimize the number of bits needed to describe the input graph after being compressed by the pattern. SUBDUE can perform several learning tasks, including unsupervised learning, supervised learning, clustering and graph grammar learning. SUBDUE has been successfully applied in a number of areas, including bioinformatics, web structure mining, counter-terrorism, social network analysis, aviation and geology.

Graph-based Unsupervised Learning (Discovery)



Related Work

Related Links



In discovery mode SUBDUE uses heuristic search guided by MDL to find patterns minimizing the description length of the entire graph compressed with the pattern.  Once a pattern is found, SUBDUE can compress the graph using this pattern and repeat the process on the compressed graph to look for more abstract patterns, possibly defined in terms of previously-discovered patterns. The illustration below shows how SUBDUE finds four instances of a pattern S1 in the input graph, the graph compressed with pattern S1, the pattern S2 found in the next iteration, and the graph compressed with pattern S2.


Graph Based Supervised Learning

If graphs depicting both positive and negative examples of a phenomenon are available for input, then SUBDUE enters supervised-learning mode, searching for a pattern that compresses the positive graphs, but not the negative graphs. For example, given positive graphs describing criminal networks and negative graphs describing normal social networks, SUBDUE can learn patterns distinguishing the two, and these patterns can be used as a predictive model to identify emerging criminal networks.





Graph Based Hierarchical Clustering

The ability of SUBDUE to iteratively discover patterns and compress the graph can be used to generate a clustering of the input graph.  Essentially, clustering mode forces SUBDUE to iterate until the input graph can be compressed no further.  The resulting patterns form a cluster lattice, such that if a pattern S is defined in terms of one or more previously-discovered patterns, then these patterns are parents of S in the lattice. In the above example, S2 is defined in terms of S1. Each cluster in the lattice is defined conceptually by the graphical pattern discovered by SUBDUE, producing a hierarchical, conceptual clustering of the input data.



Graph Grammar Learning

Graph grammars are similar to string grammars, but where terminals and non-terminals can represent arbitrary graphs. SUBDUE learns context-free, node-replacement graph grammars by looking for common connections between the instances of a substructure S.