|
Test datasets for the
QUARTET METHOD OF HIERARCHICAL CLUSTERING
The Quartet
method (Cilibrasi and Vitanyi, 2005, 2006).is
a novel hierarchical clustering method which, given
a set of objects to be classiffied, does not
require the number of clusters to be produced as input, but halts when the
optimal hierarchy of the objects, according to a specific cost evaluation, is
obtained. Given n objects to cluster, the Quartet method accepts as
input a distance matrix, which is a matrix containing the distances,
taken pair wise, among the n objects. It is therefore a symmetric n
x n matrix containing non-negative real numbers, normalized between 0 and
1, as entries. The value 1 represents the highest distance between two
objects; as close to 1 the distance between two objects is, as far the
objects are. Obviously, all the elements in the diagonal of the distance
matrix are equal to 0. The Quartet method produces as output a dendrogram with a special topology, called full unrooted binary tree with n leaves,
that agrees with the distance matrix in input as much as possible
according to the cost evaluation. A dendrogram is a
full unrooted binary tree if all the internal nodes
have degree exactly three and there is no distinction between parent and
child nodes. In order to agree with the distance matrix, the Quartet method
places the n objects to be clustered as leaves of the full unrooted binary tree, such as objects that have a short
relative distance will be represented in near positions in the tree. A full unrooted binary tree with n leaves will have
exactly n - 2 internal nodes, for a total of 2n - 2 nodes. This
special dendrogram is sometimes called boron
tree (or ternary tree), since such a tree, with 2n - 2
total nodes, has n - 2 nodes of valency 3
(corresponding to boron atoms) and n nodes of valency
(corresponding to hydrogen atoms). Boron trees are of primary interest in
clustering contexts because, of all trees on a fixed number of nodes, they
have the richest internal structure (most differentiated paths between
nodes). They are therefore the most sensitive for representing the structure
of a set of objects. In order to construct the boron tree which
approximates the optimal hierarchy according to the distance matrix in input,
the Quartet method is based on an NP-hard graph optimization problem, called minimum quartet tree cost (MQTC)
problem (Cilibrasi and Vitanyi,
2005, 2006). Therefore, any practical approach to obtain or approximate the
optimal solution requires heuristics. LITERATURE REVIEW The Quartet method of
hierarchical clustering was first introduced in (Cilibrasi
et al., 2004). This paper proposed a robust automating music classification
procedure consisting of two steps. The first step consisted of extracting the
Normalized Compression Distances
(NCD) (Li and Vitanyi, 1997) among some considered
pieces of music. The Normalized Compression Distance is a new similarity
metric based on string compression which mimics the ideal performance of Kolmogorov complexity (Li and Vitanyi,
1997). NCD is able to extract consistent pair wise distances among the pieces
of music without considering numerical features related to pitch, rhythm,
harmony, or other intrinsic information, as in the other popular automatic
music classification methods in the literature. The second step consisted of
efficiently visualize the extracted pair wise distances by means a novel and
efficient hierarchical clustering method: the Quartet method. To substantiate the claims
of universality and robustness of this automating classification method,
evidence of other successful applications in areas as diverse as genomics,
virology, languages, literature, handwritten, astronomy and combinations of
objects from completely different domains, were reported in (Cilibrasi and Vitanyi, 2005).
In particular, Cilibrasi and Vitanyi
(2007) reported an interesting application of this theory, consisting of the
automatic extraction of similarities among words and phrases from WWW using
the Google page counts. The WWW is the largest database on earth, and the
context information entered by millions of independent users averages out to
provide automatic semantics of useful quality. In (Cilibrasi
and Vitanyi, 2006), the authors presented the
Quartet method of hierarchical clustering in a more formal way. They showed
the main concepts, components, advantages and disadvantages of the method,
particularly underlying the similarities and differences with respect to other
methods from biological phylogeny. Cilibrasi and Vitanyi (2006) also proved that the Quartet method is
based on an NP-hard graph optimization problem, called minimum quartet tree
cost (MQTC) problem, and provided a Randomized Hill Climbing metaheuristic to obtain approximate solutions. Several
experiments with natural data, like genomic and phylogenetic
data, texts or music, and data of completely different types, were further
presented. ACKNOWLEDGES Eng. Sergio Consoli is
supported by an E.U. Marie Curie Fellowship for Early Stage Researcher
Training (EST-FP6) under grant number MEST-CT-2004-006724 at REFERENCES
[1]
Cilibrasi, R. 2007. Statistical inference through data compression. Ph.d. thesis, Institute for Logic, Language and
Computation, Universiteit van Amsterdam, The
Netherlands. [2]
Cilibrasi, R., A. L. Cruz, S. de Rooij, M. Keijzer, M. Li, P. M. B. Vitanyi. 2007. Complearn. [online].
URL http://www.complearn.org. [3]
Cilibrasi, R., P. M. B. Vitanyi. 2005. Clustering by
compression. IEEE Transactions on
Information Theory 51 (4) 1523-1545. [4]
Cilibrasi, R., P. M. B. Vitanyi. [5]
Cilibrasi, R., P. M. B. Vitanyi. 2007. The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19 (3)
370-383. [6]
Cilibrasi, R., P. M. B. Vitanyi, R. de Wolf. 2004. Algorithmic clustering
of music based on string compression. Computer
Music Journal 28 (4) 49-67. [7]
Li, M., P. M. B. Vitanyi. 1997. An introduction
to Kolmogorov complexity and its applications.
2nd ed. DOWNLOADS:
INSTANCES FOR THE QUARTET METHOD ·
natural data. ·
data
extracted from the web. USEFUL
LINKS |