Test datasets for the QUARTET METHOD OF HIERARCHICAL CLUSTERING

by Sergio Consoli

 

 

 The Quartet method (Cilibrasi and Vitanyi, 2005, 2006).is a novel hierarchical clustering method which, given a set of objects to be classiffied, does not require the number of clusters to be produced as input, but halts when the optimal hierarchy of the objects, according to a specific cost evaluation, is obtained. Given n objects to cluster, the Quartet method accepts as input a distance matrix, which is a matrix containing the distances, taken pair wise, among the n objects. It is therefore a symmetric n x n matrix containing non-negative real numbers, normalized between 0 and 1, as entries. The value 1 represents the highest distance between two objects; as close to 1 the distance between two objects is, as far the objects are. Obviously, all the elements in the diagonal of the distance matrix are equal to 0. The Quartet method produces as output a dendrogram with a special topology, called full unrooted binary tree with n leaves, that agrees with the distance matrix in input as much as possible according to the cost evaluation. A dendrogram is a full unrooted binary tree if all the internal nodes have degree exactly three and there is no distinction between parent and child nodes. In order to agree with the distance matrix, the Quartet method places the n objects to be clustered as leaves of the full unrooted binary tree, such as objects that have a short relative distance will be represented in near positions in the tree. A full unrooted binary tree with n leaves will have exactly n - 2 internal nodes, for a total of 2n - 2 nodes. This special dendrogram is sometimes called boron tree (or ternary tree), since such a tree, with 2n - 2 total nodes, has n - 2 nodes of valency 3 (corresponding to boron atoms) and n nodes of valency (corresponding to hydrogen atoms). Boron trees are of primary interest in clustering contexts because, of all trees on a fixed number of nodes, they have the richest internal structure (most differentiated paths between nodes). They are therefore the most sensitive for representing the structure of a set of objects.

In order to construct the boron tree which approximates the optimal hierarchy according to the distance matrix in input, the Quartet method is based on an NP-hard graph optimization problem, called minimum quartet tree cost (MQTC) problem (Cilibrasi and Vitanyi, 2005, 2006). Therefore, any practical approach to obtain or approximate the optimal solution requires heuristics.

 

 

LITERATURE REVIEW

 

The Quartet method of hierarchical clustering was first introduced in (Cilibrasi et al., 2004). This paper proposed a robust automating music classification procedure consisting of two steps. The first step consisted of extracting the Normalized Compression Distances (NCD) (Li and Vitanyi, 1997) among some considered pieces of music. The Normalized Compression Distance is a new similarity metric based on string compression which mimics the ideal performance of Kolmogorov complexity (Li and Vitanyi, 1997). NCD is able to extract consistent pair wise distances among the pieces of music without considering numerical features related to pitch, rhythm, harmony, or other intrinsic information, as in the other popular automatic music classification methods in the literature. The second step consisted of efficiently visualize the extracted pair wise distances by means a novel and efficient hierarchical clustering method: the Quartet method.

To substantiate the claims of universality and robustness of this automating classification method, evidence of other successful applications in areas as diverse as genomics, virology, languages, literature, handwritten, astronomy and combinations of objects from completely different domains, were reported in (Cilibrasi and Vitanyi, 2005). In particular, Cilibrasi and Vitanyi (2007) reported an interesting application of this theory, consisting of the automatic extraction of similarities among words and phrases from WWW using the Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality.

In (Cilibrasi and Vitanyi, 2006), the authors presented the Quartet method of hierarchical clustering in a more formal way. They showed the main concepts, components, advantages and disadvantages of the method, particularly underlying the similarities and differences with respect to other methods from biological phylogeny. Cilibrasi and Vitanyi (2006) also proved that the Quartet method is based on an NP-hard graph optimization problem, called minimum quartet tree cost (MQTC) problem, and provided a Randomized Hill Climbing metaheuristic to obtain approximate solutions. Several experiments with natural data, like genomic and phylogenetic data, texts or music, and data of completely different types, were further presented.

 

 

ACKNOWLEDGES

 

Eng. Sergio Consoli is supported by an E.U. Marie Curie Fellowship for Early Stage Researcher Training (EST-FP6) under grant number MEST-CT-2004-006724 at Brunel University (project NET-ACE). He gratefully acknowledges this support.

 

 

REFERENCES

 

[1]        Cilibrasi, R. 2007. Statistical inference through data compression. Ph.d. thesis, Institute for Logic, Language and Computation, Universiteit van Amsterdam, The Netherlands.

[2]      Cilibrasi, R., A. L. Cruz, S. de Rooij, M. Keijzer, M. Li, P. M. B. Vitanyi. 2007. Complearn. [online]. URL http://www.complearn.org.

[3]      Cilibrasi, R., P. M. B. Vitanyi. 2005. Clustering by compression. IEEE Transactions on Information Theory 51 (4) 1523-1545.

[4]      Cilibrasi, R., P. M. B. Vitanyi. 2006. A new quartet tree heuristic for hierarchical clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics, submitted, URL: http://arxiv.org/abs/cs.DS/0606048.

[5]      Cilibrasi, R., P. M. B. Vitanyi. 2007. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19 (3) 370-383.

[6]      Cilibrasi, R., P. M. B. Vitanyi, R. de Wolf. 2004. Algorithmic clustering of music based on string compression. Computer Music Journal 28 (4) 49-67.

[7]      Li, M., P. M. B. Vitanyi. 1997. An introduction to Kolmogorov complexity and its applications. 2nd ed. Springer-Verlag, New York.

 

 


DOWNLOADS: INSTANCES FOR THE QUARTET METHOD

·        artificial data.

·        natural data.

·        geographical data.

·        data extracted from the web.

 


USEFUL LINKS

·        COMPLEARN Home Page.

·        RUDI CLIBRASI Home Page.

·        PAUL VITANYI home Page

·        MING LI home Page

 

 

 

 

 

 

 

Created by Sergio Consoli, august 2008, Last modification: august 2008.

All Rights reserved to Sergio Consoli