User Tools

Site Tools


partitioningclustering

This is an old revision of the document!


Partitioning Clustering Algorithms

This section gives a brief overview of the characteristics of partitioning clustering algorithms. Yet as the following paragraphs will show, the process of conceptual clarification is still unfinished. The following survey of partitioning clustering algorithms will point out unclear concepts.

Partitioning clustering algorithms partition a set of tuples in several subsets of tuples, i.e. clusters. The number of clusters to be found typically is a user-given parameter—although there exist approaches to determine the optimal number of clusters automatically such as . Typical partitioning clustering algorithms are $k$-means , CLARANS , and $k$-modes .

Partitioning clustering algorithm try to find partitions according to a given criterion of optimality. Minimising the sum of squares distances to the clusters’ centre is a popular criterion that partitioning clustering algorithms use. $k$-means is one of them, see section .

Each partition of the clustered data set must contain at least one tuple. Hence, an empty cluster would be an illegal partition.

Ester and Sander additionally demand that a tuple must be contained in exactly one cluster . Yet, according to this demand the EM clustering algorithm would be no partitioning clustering algorithm because tuples in an EM cluster can belong to several clusters, although Ester and Sander categorise EM as a partitioning clustering algorithm . According to Han’s definition of model-based clustering algorithms , EM would be a model-based clustering algorithm. However, several authors consider $k$-means and EM as closely related algorithms such as , , and . The description of both, $k$-means and EM, can be found in the following subsections.

Due to the similarity of EM with partitioning clustering algorithms, we categorise EM as a partitioning clustering algorithm and weaken the condition of Ester and Sander by omitting the necessity of one tuple having exactly one associated cluster.

According to the argumentation above, a partitioning clustering algorithm assigns each tuple to at least one cluster. This requirement excludes tuples remaining unassigned. Some non-partitioning clustering algorithms such as density-based clustering algorithms consider only a subset of the data set as relevant. The un-considered data is called noise . Omitting clustering tuples is tolerable if the purpose of a given cluster analysis is only to find concentrations in data. Yet, if the purpose of a cluster analysis is to partition a data set for finding better classifiers it is not.

Partitioning clustering algorithms present the result of the clustering as a set of features of the clusters they found. Yet, the type of feature depends on the algorithm that was used. Therefore, the following subsection surveys the different types of partitioning clustering algorithms which are available at the moment.

Available Types of Partitioning Clustering Algorithms

The type of feature a partitioning clustering algorithm is returning is the most-discriminating factor of partitioning clustering algorithms.

For instance $k$-means returns the centroids of all clusters—the centroid of a cluster is a vector that is the arithmetic mean of the vectors that represent the tuples of that cluster. As $k$-means assigns each tuple to its nearest mean, the set of means is sufficient to determine the affiliation of a tuple to a specific cluster.

In analogous way, $k$-median algorithms such as PAM and CLARANS return the clusters’ medoids that can be used to determine the cluster affiliation of a tuple. A medoid of a cluster is a tuple of that cluster that minimises the total distances of all non-medoid tuples to the medoid tuple. In contrast to a centroid, a medoid is an existing tuple.

$k$-modes is a partitioning clustering algorithm using the modus of a cluster as feature of the cluster. The modus of a cluster is the tuple that has the most-frequent combination of attribute values of all tuples in a cluster.

The Expectation Maximisation (EM) algorithm returns a statistical model consisting of $k$ statistical variables with mean and standard deviation. Typically, these variables are assumed to be normally distributed but other distributions could be used alternatively. EM is very similar to $k$-means but EM also takes the deviation of a cluster into account. Where $k$-means assigns a tuple to the cluster with the nearest mean, EM assigns a tuple to the cluster having the greatest prior probability. If one would assume that the deviation of all clusters would be identical, $k$-means and EM would return the same means—given, that both algorithms use the same initialisation. If deviation is identical for all clusters, the condition “greatest probability” and “nearest cluster” are equal.

The scale of attributes typically determines the type of partitioning clustering algorithm to use. As an arithmetic mean of a cluster is only available for continuously scaled attributes, $k$-means is limited to data sets with continuously scaled attributes. Due to EM’s similarity with $k$-means the same argumentation is true for EM. Again, determining the distances to a medoid requires attributes being scaled at least ordinal. Thus, $k$-medoid algorithms are limited to ordinally or continuously scaled attributes. $k$-modes is applicable to data sets having any type of scale but the expressiveness of the modus of a cluster is too low for many applications.

General Functioning of Partitioning Clustering Algorithms

Although partitioning clustering algorithms differ in the type of result they return, they share a common pattern how they compute results. This section presents the common functioning of partitioning clustering algorithms.

Partitioning clustering algorithms improve an initial solution in several iterations. Algorithm shows the pseudo code of a generalised prototype of a partitioning clustering algorithm. A set of features represents the current solution of a partitioning clustering algorithm. Depending on the type of algorithm as presented in the previous subsection, these features are either centroids, medoids, modes, or statistical variables of clusters. In each iteration, the algorithm assigns tuple to clusters and re-computes the features of the current solution.

The way a partitioning algorithm finds an initial solution (line of Algorithm ) is not fixed. Hence, there exist several heuristics to find good initial solutions. Applying the algorithm on a small sample is a very common technique to find an initial solution . Here, the resulting solution of the run with the sample becomes the initial solution of the second run with the original data set.

Some algorithms re-order the steps shown in Algorithm to enhance the algorithm’s performance. For instance, the MacQueen variant of $k$-means performs the update of features (line ) right after re-assigning a tuple to a different cluster (line ).

ALGORITHM

Partitioning clustering algorithms iterate until a stop criterion is met. Typical stop criteria of partitioning clustering algorithms are

  • a given number of iterations is complete—CLARANS has this type of stop criterion; additionally, several implementations of algorithms in commercial systems use a maximum number of iterations to guarantee termination in a reasonable amount of time
  • the quality of the current iteration and the former iteration does not improve—$k$-means, $k$-modes, and EM have this type of stop criterion
  • the estimated cost of a further iteration exceeds the estimated benefit of a further iteration.
partitioningclustering.1497174490.txt.gz · Last modified: 2017/06/11 11:48 by mgoller