Menu:

Cool news:

Oct 04, 2011:
Nobel physics prize honours accelerating Universe find

Three researchers behind the discovery that our Universe's expansion is accelerating have been awarded this year's Nobel prize for physics.

Read more...

View Sergio Consoli's profile on LinkedIn

Info on my city

Catania is located in the Mediterranean area at the base of the active volcano “Mount Etna”, in Sicily. Here you can find the most popular Sicilian beach resorts, like Taormina, amazing tours to historic sites by Greeks, Romans, and Arabs, and exquisite examples of Baroque architecture! See some pictures of Catania!

Useful links:

- Philips Research
Ken Darby-Dowman's homepage
- CompLearn Home
- Rudi Cilibrasi's homepage
- Paul Vitanyi's homepage
- Ming Li's homepage

Version: 1.1
Last modification:
December 2011

Clustering

The Quartet method of hierarchical clustering

The Quartet method (Cilibrasi and Vitanyi, 2005; 2006) is a novel hierarchical clustering method which, given a set of objects to be classified, does not require the number of clusters to be produced as input, but halts when the optimal hierarchy of the objects, according to a specific cost evaluation, is obtained.

Given n objects to cluster, the Quartet method accepts as input a distance matrix, which is a matrix containing the distances, taken pairwise, among the n objects. It is therefore a symmetric n x n matrix containing non-negative real numbers, normalized between 0 and 1, as entries. The value 1 represents the highest distance between two objects; as close to 1 the distance between two objects is, as far the objects are. Obviously, all the elements in the diagonal of the distance matrix are equal to 0.

The Quartet method produces as output a dendrogram with a special topology, called full unrooted binary tree with n leaves, that agrees with the distance matrix in input as much as possible according to the cost evaluation. A dendrogram is a full unrooted binary tree if all the internal nodes have degree exactly three and there is no distinction between parent and child nodes. In order to agree with the distance matrix, the Quartet method places the n objects to be clustered as leaves of the full unrooted binary tree, such as objects that have a short relative distance will be represented in near positions in the tree. A full unrooted binary tree with n leaves will have exactly n - 2 internal nodes, for a total of 2n - 2 nodes.

This special dendrogram is sometimes called boron tree (or ternary tree), since such a tree, with 2n - 2 total nodes, has n - 2 nodes of valency 3 (corresponding to boron atoms) and n nodes of valency (corresponding to hydrogen atoms). Boron trees are of primary interest in clustering contexts because, of all trees on a fixed number of nodes, they have the richest internal structure (i.e. most differentiated paths between nodes). They are therefore the most sensitive for representing the structure of a set of objects.

Figure 1 shows an example of an input distance matrix created arbitrarily by the authors of (Consoli et al., 2010) between a set of n = 8 objects from completely different domains.

Kaka
Jan Korst
Ancelotti
Seedorf
Sergio Consoli
Radiohead
Gijs Geleijnse
Metallica
Kaka
0.00
1.00
0.30
0.20
0.80
1.00
0.90
1.0
Jan Korst
1.00
0.00
1.00
1.00
0.20
0.80
0.05
0.90
Ancelotti
0.30
1.00
0.00
0.40
0.60
1.00
0.80
0.95
Seedorf
0.20
1.00
0.40
0.00
0.89
0.99
0.90
1.00
Sergio Consoli
0.80
0.20
0.60
0.89
0.00
0.93
0.30
1.00
Radiohead
1.00
0.80
1.00
0.99
0.93
0.00
0.90
0.05
Gijs Geleijnse
0.90
0.05
0.80
0.90
0.30
0.90
0.00
0.55
Metallica
1.00
0.90
0.95
1.00
1.00
0.05
0.55
0.00

Figure 1: Example of distance matrix between n = 8 objects (Consoli et al., 2010).

Figure 2 shows the full unrooted binary tree of the optimal hierarchy of the n = 8 objects. The two famous bands Metallica and Radiohead form a cluster. Then Kaka, Seedorf, and Ancelotti, who belong to the world of soccer form another cluster, with Kaka and Seedorf closer together as football players, and coach Ancelotti further away. The final cluster is that of Sergio Consoli, Gijs Geleijnse, and Jan Korst, who are research scientists, co-authors of (Consoli et al., 2010).

distance matrix

Figure 2: Boron tree of the optimal hierarchy of the example (Consoli et al., 2010).

In order to construct the boron tree which approximates the optimal hierarchy according to the distance matrix in input, the Quartet method is based on an NP-hard graph optimization problem, called minimum quartet tree cost (MQTC) problem (Cilibrasi and Vitanyi, 2005, 2006). Given a set N of n ≥ 4 objects, for each set of four objects {a; b; c; d}N, there exist exactly three different dendrograms with four leaves (i.e. two internal nodes), also known as simple quartet topologies: ab|cd, ac|bd, ad|bc (Figure 3).

distance matrix

Figure 3: The three different simple quartet topologies for the set {a; b; c; d}.

The vertical bar in a simple quartet topology divides the two pairs of objects, where each pair is represented by two leaf nodes, labelled by the corresponding objects and attached to the same internal node. For example, in the simple quartet topology ab|cd, the objects a and b are connected to the same internal node, differently from c and d which are connected to another internal node.

Each simple quartet topology has associated a real valued cost by means of a cost function C : Q → ℜ+, where Q is the set of simple quartet topologies. The cost assigned to each simple quartet topology is defined as the sum of the distances (taken from the distance matrix) between each pair of neighbouring leaves (Cilibrasi and Vitanyi, 2005, 2006). For example, the cost associated with the simple quartet topology ab|cd is Cab|cd = d(a,b) + d(c,d), where d(a,b) and d(c,d) indicate, respectively, the distances between the two neighbouring objects (a and b) and (c and d), obtained from the distance matrix.

Consider the set Θ of full unrooted binary trees with 2n - 2 nodes (i.e. n leaves and n - 2 internal nodes), obtained by placing the n objects to cluster as leaf nodes of the trees. For each boron tree t ∈ Θ, precisely one of the three possible simple quartet topologies for any set of four leaves is consistent (Cilibrasi and Vitanyi, 2005, 2006). Let Qt be the set of consistent quartet topologies embedded in t. Then, the cost Ct associated with a boron tree t ∈ Θ is defined as the sum of the costs of its consistent simple quartet topologies, i.e. Ct = ∑ Cab|cd, ∀{a; b; c; d}Qt.

In most cases, it is not possible to create a boron tree which embeds all the simple quartet topologies with the minimum cost for all the sets of four objects (especially for a large number of objects n), due to inconsistency. Thus, it is a matter of making the most balanced choice of the quartet topologies to embed. This is the goal of the quartet method of hierarchical clustering: trying to find (or approximate as closely as possible) the boron tree t ∈ Θ with the minimum total cost. This boron tree t will embed the combination of consistent simple quartet topologies of Q with the minimum costs, with respect to a full unrooted binary tree representation of the distance matrix. This is the optimization problem on which the Quartet method is based and, as already stated, it is called minimum quartet tree cost (MQTC) problem (Cilibrasi and Vitanyi, 2005, 2006).

The MQTC problem is formally defined as follows. Given a set N of n ≥ 4 objects to be clustered, and a symmetric distance matrix n x n containing their pairwise distances, find the full unrooted binary tree t ∈ Θ with the minimum total cost Ct, i.e. min Ct = min (∑ Cab|cd, ∀{a; b; c; d}Qt).

In a hierarchical clustering context, we do not even have a priori knowledge that certain simple quartet topologies are objectively true and must be embedded. Thus, the quartet method of hierarchical clustering assigns a cost value to each simple quartet topology, in order to express the relative importance of the simple quartet topologies to be embedded in the full unrooted binary tree having the n objects as leaves. The boron tree t ∈ Θ with the minimum cost Ct, produced by the quartet method, balances the importance of embedding different quartet topologies against others, leading to a boron tree that visually represents the symmetric distance matrix n x n as well as possible.

As shown in (Cilibrasi and Vitanyi, 2005, 2006), the minimum quartet tree cost problem is an NP-hard optimization problem by reduction from the maximum quartet consistency problem (Steel, 1992). Therefore, any practical approach to obtain or approximate the optimal solution requires heuristics. In the next section it is reported the literature review on the Quartet method.


Literature review

The Quartet method of hierarchical clustering was first introduced in (Cilibrasi et al., 2004). This paper proposed a robust automating music classification procedure consisting of two steps. The first step consisted of extracting the Normalized Compression Distances (NCD) (Li and Vitanyi, 1997) among some considered pieces of music. The Normalized Compression Distance is a new similarity metric based on string compression which mimics the ideal performance of Kolmogorov complexity (Li and Vitanyi, 1997). NCD is able to extract consistent pairwise distances among the pieces of music without considering numerical features related to pitch, rhythm, harmony, or other intrinsic information, as in the other popular automatic music classification methods in the literature. The second step consisted of efficiently visualize the extracted pairwise distances by means a novel and efficient hierarchical clustering method: the Quartet method.

To substantiate the claims of universality and robustness of this automating classification method, evidence of other successful applications in areas as diverse as genomics, virology, languages, literature, handwritten, astronomy and combinations of objects from completely different domains, were reported in (Cilibrasi and Vitanyi, 2005). In particular, Cilibrasi and Vitanyi (2007) reported an interesting application of this theory, consisting of the automatic extraction of similarities among words and phrases from WWW using the Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality.

In (Cilibrasi and Vitanyi, 2006), the authors presented the Quartet method of hierarchical clustering in a more formal way. They showed the main concepts, components, advantages and disadvantages of the method, particularly underlying the similarities and differences with respect to other methods from biological phylogeny. Cilibrasi and Vitanyi (2006) also proved that the Quartet method is based on an NP-hard graph optimization problem, the minimum quartet tree cost problem, and provided a Randomized Hill Climbing metaheuristic to obtain approximate solutions. Several experiments with natural data, like genomic and phylogenetic data, texts or music, and data of completely different types, were further presented.

Consoli et al. (2010) presented several metaheuristic approaches to solve the minimum quartet tree cost problem and approximate the optimal hierarchy of the Quartet method. In particular, a Greedy Randomized Adaptive Search Procedure, a Simulated Annealing approach, a Variable Neighbourhood Search, and a Reduced Variable Neighbourhood Search have been proposed for the problem. Considering a wide range of problem instances, the authors compared these heuristics with the Randomized Hill Climbing by Cilibrasi and Vitanyi (2006), the most popular heuristic in the literature for the Quartet method of hierarchical clustering. In the experiments, data from different fields were considered in order to evaluate how the algorithms are influenced by the nature of the objects. Some examples from nature concerning a study in genomics with DNA sequences of different placental mammalian species have been considered, along with other examples with real geographic distances between famous cities, and other data obtained by mining of the World Wide Web through an automatic web information extraction method. It has been showed, for example, how to build ontologies of famous historical persons or popular musical artists by means of data extracted from the World Wide Web. Based on this experimental analysis, it was shown that the considered algorithms are able to obtain high quality solutions in short computational running times and, in particular, the best performance was obtained by Reduced Variable Neighbourhood Search. Reduced Variable Neighbourhood Search was shown to be a fast, simple, and particularly effective heuristic to solve the minimum quartet tree cost problem.


Test datasets for the Quartet method

In the following, it is possibile to download some instances of the Quartet method of hierarchical clustering. These are the instances used in the computations by the authors of (Consoli et al., 2010). They considered 26 different datasets with a number of objects to cluster (n) from 10 up to 224. Data from different fields have been considered in order to evaluate how the algorithms are influenced by the nature of the objects.

The following link allows to download some instances of the Quartet method from artificial data.

Artificial data for the Quartet method

These artificial data are some instances without inconsistency, that is data for which the exact solution is known and have a normalized tree benefit score equal to one, in order to test the accuracy of the quartet-based tree reconstruction. These data were produced artificially as described in (Consoli et al., 2010).

The following link allows to download some instances of the Quartet method from natural data.

Natural data for the Quartet method

These natural data are some instances obtained from nature (Cilibrasi et al., 2007), concerning a study in genomics with DNA sequences of different placental mammalian species. The distance matrices from the genomic data were computed by using the automated software method by Cilibrasi and Vitanyi (2005; 2006). Three sets of data with n = 10, n = 24, and n = 34 are considered.

The following link allows to download some instances of the Quartet method from geographical data.

Geographical data for the Quartet method

These geographical data are instances obtained by real geographic distances between famous cities, normalized in the interval [0, 1]. We considered data instances with a number of objects n from 13 to 37.

The following link allows to download some instances of the Quartet method from data extracted from the World Wide Web.

Data extracted from the WWW for the Quartet method

These data extracted from the World Wide Web have been obtained by mining of the WWW through an automatic web information extraction method by Geleijnse et al. (2006). Specifically, it has been focussed on data concerning musical artists, in order to easily show subjective artist categories such as genre of their music. We considered data instances with a number of artists n from 15 to 224.


References

Cilibrasi, R., 2007. Statistical inference through data compression. Ph.d. thesis, Institute for Logic, Language and Computation, Universiteit van Amsterdam, The Netherlands.

Cilibrasi, R., A. L. Cruz, S. de Rooij, M. Keijzer, M. Li, P. M. B. Vitanyi, 2007. The CompLearn toolkit [online].

Cilibrasi, R., P. M. B. Vitanyi, 2007. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19 (3) 370-383.

Cilibrasi, R., P. M. B. Vitanyi, 2006. A new quartet tree heuristic for hierarchical clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics, submitted.

Cilibrasi, R., P. M. B. Vitanyi, 2005. Clustering by compression. IEEE Transactions on Information Theory 51 (4) 1523-1545.

Cilibrasi, R., P. M. B. Vitanyi, R. de Wolf, 2004. Algorithmic clustering of music based on string compression. Computer Music Journal 28 (4) 49-67.

Consoli, S., K. Darby-Dowman, G. Geleijnse, J. Korst, S. Pauws, 2010. Heuristic approaches for the quartet method of hierarchical clustering, IEEE Transactions on Knowledge and Data Engineering, 22(10), 1428-1443, doi:http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.188.

Geleijnse, G., J. Korst, V. de Boer, 2006. Instance classification using co-occurrences on the web, in Proceedings of the ISWC 2006 workshop on Web Content Mining (WebConMine), Athens, GA.

Li, M., P. M. B. Vitanyi, 1997. An introduction to Kolmogorov complexity and its applications (2nd edition). Springer-Verlag, New York.

Steel M. A., 1992. The complexity of reconstructiong trees from qualitative characters and subtrees, Journal of Classification, 9, 91-116.

Quartet clustering