User Tools

Site Tools


partitioningclustering

Partitioning Clustering Algorithms

This section gives a brief overview of the characteristics of partitioning clustering algorithms. Yet as the following paragraphs will show, the process of conceptual clarification is still unfinished. The following survey of partitioning clustering algorithms will point out unclear concepts.

Partitioning clustering algorithms partition a set of tuples in several subsets of tuples, i.e. clusters. The number of clusters to be found typically is a user-given parameter—although there exist approaches to determine the optimal number of clusters automatically such as . Typical partitioning clustering algorithms are $k$-means , CLARANS , and $k$-modes .

Partitioning clustering algorithm try to find partitions according to a given criterion of optimality. Minimising the sum of squares distances to the clusters’ centre is a popular criterion that partitioning clustering algorithms use. $k$-means is one of them, see section .

Each partition of the clustered data set must contain at least one tuple. Hence, an empty cluster would be an illegal partition.

Ester and Sander additionally demand that a tuple must be contained in exactly one cluster . Yet, according to this demand the EM clustering algorithm would be no partitioning clustering algorithm because tuples in an EM cluster can belong to several clusters, although Ester and Sander categorise EM as a partitioning clustering algorithm . According to Han’s definition of model-based clustering algorithms , EM would be a model-based clustering algorithm. However, several authors consider $k$-means and EM as closely related algorithms such as , , and . The description of both, $k$-means and EM, can be found in the following subsections.

Due to the similarity of EM with partitioning clustering algorithms, we categorise EM as a partitioning clustering algorithm and weaken the condition of Ester and Sander by omitting the necessity of one tuple having exactly one associated cluster.

According to the argumentation above, a partitioning clustering algorithm assigns each tuple to at least one cluster. This requirement excludes tuples remaining unassigned. Some non-partitioning clustering algorithms such as density-based clustering algorithms consider only a subset of the data set as relevant. The un-considered data is called noise . Omitting clustering tuples is tolerable if the purpose of a given cluster analysis is only to find concentrations in data. Yet, if the purpose of a cluster analysis is to partition a data set for finding better classifiers it is not.

Partitioning clustering algorithms present the result of the clustering as a set of features of the clusters they found. Yet, the type of feature depends on the algorithm that was used. Therefore, the following subsection surveys the different types of partitioning clustering algorithms which are available at the moment.

Available Types of Partitioning Clustering Algorithms

The type of feature a partitioning clustering algorithm is returning is the most-discriminating factor of partitioning clustering algorithms.

For instance $k$-means returns the centroids of all clusters—the centroid of a cluster is a vector that is the arithmetic mean of the vectors that represent the tuples of that cluster. As $k$-means assigns each tuple to its nearest mean, the set of means is sufficient to determine the affiliation of a tuple to a specific cluster.

In analogous way, $k$-median algorithms such as PAM and CLARANS return the clusters’ medoids that can be used to determine the cluster affiliation of a tuple. A medoid of a cluster is a tuple of that cluster that minimises the total distances of all non-medoid tuples to the medoid tuple. In contrast to a centroid, a medoid is an existing tuple.

$k$-modes is a partitioning clustering algorithm using the modus of a cluster as feature of the cluster. The modus of a cluster is the tuple that has the most-frequent combination of attribute values of all tuples in a cluster.

The Expectation Maximisation (EM) algorithm returns a statistical model consisting of $k$ statistical variables with mean and standard deviation. Typically, these variables are assumed to be normally distributed but other distributions could be used alternatively. EM is very similar to $k$-means but EM also takes the deviation of a cluster into account. Where $k$-means assigns a tuple to the cluster with the nearest mean, EM assigns a tuple to the cluster having the greatest prior probability. If one would assume that the deviation of all clusters would be identical, $k$-means and EM would return the same means—given, that both algorithms use the same initialisation. If deviation is identical for all clusters, the condition “greatest probability” and “nearest cluster” are equal.

The scale of attributes typically determines the type of partitioning clustering algorithm to use. As an arithmetic mean of a cluster is only available for continuously scaled attributes, $k$-means is limited to data sets with continuously scaled attributes. Due to EM’s similarity with $k$-means the same argumentation is true for EM. Again, determining the distances to a medoid requires attributes being scaled at least ordinal. Thus, $k$-medoid algorithms are limited to ordinally or continuously scaled attributes. $k$-modes is applicable to data sets having any type of scale but the expressiveness of the modus of a cluster is too low for many applications.

General Functioning of Partitioning Clustering Algorithms

Although partitioning clustering algorithms differ in the type of result they return, they share a common pattern how they compute results. This section presents the common functioning of partitioning clustering algorithms.

Partitioning clustering algorithms improve an initial solution in several iterations. Algorithm shows the pseudo code of a generalised prototype of a partitioning clustering algorithm. A set of features represents the current solution of a partitioning clustering algorithm. Depending on the type of algorithm as presented in the previous subsection, these features are either centroids, medoids, modes, or statistical variables of clusters. In each iteration, the algorithm assigns tuple to clusters and re-computes the features of the current solution.

The way a partitioning algorithm finds an initial solution (line of Algorithm ) is not fixed. Hence, there exist several heuristics to find good initial solutions. Applying the algorithm on a small sample is a very common technique to find an initial solution . Here, the resulting solution of the run with the sample becomes the initial solution of the second run with the original data set.

Some algorithms re-order the steps shown in Algorithm to enhance the algorithm’s performance. For instance, the MacQueen variant of $k$-means performs the update of features (line ) right after re-assigning a tuple to a different cluster (line ).

ALGORITHM

Partitioning clustering algorithms iterate until a stop criterion is met. Typical stop criteria of partitioning clustering algorithms are

  • a given number of iterations is complete—CLARANS has this type of stop criterion; additionally, several implementations of algorithms in commercial systems use a maximum number of iterations to guarantee termination in a reasonable amount of time
  • the quality of the current iteration and the former iteration does not improve—$k$-means, $k$-modes, and EM have this type of stop criterion
  • the estimated cost of a further iteration exceeds the estimated benefit of a further iteration.

$k$-means and EM are part of many commercial data mining tools such as SAS Enterprise Miner, SPSS Clementine, or Teradata Warehouse Miner. Additionally, both algorithms are broadly discussed in literature. As both algorithms require an Euclidian vector space to operate within, many improvements exploiting conditions given in an Euclidean vector space are applicable to both algorithms. Thus, it is sufficient to discuss only one of them in detail.

Due to the usage of $k$-means in commercial products and the large set of articles discussing improvements of $k$-means, we will focus on $k$-means to show how to implement the concepts presented in this dissertation for $k$-means application.

The significant contribution of this dissertation is to pre-process more tasks than in the conventional way for potential future applications such as $k$-means clustering applications. $k$-means fulfills the function of a running example demonstrating the concepts presented in this dissertation. It is left to the reader to traduce the herein shown way of implementing these concepts to other clustering algorithms.

The k-means algorithm

The $k$-means clustering algorithm is a partitioning clustering algorithm that partitions a set of data into $k$ subsets and returns the centroid of each subset. Each tuple is represented by a vector in a Euclidean vector space. Consequentially, only attributes with continuous, numerical scale are available for $k$-means clustering because other attributes are improper to span an Euclidean vector space. The centroid of a cluster is the arithmetic mean of all vectors of a cluster.

Parameter $k$, which denotes the number of clusters to be found, is the only parameter of $k$-means. It is also the parameter that gives $k$-means its name.

The general proceeding of $k$-means is like the proceeding of partitioning clustering algorithms as presented in the previous subsection. The following paragraphs describe how $k$-means implements the generic pattern of a partitioning clustering algorithm.

Like other partitioning clustering algorithms, $k$-means starts with an initial solution which it improves in several iterations. Hereby, the initial solution consists of a set of $k$ centroids.

The distance of a tuple to a cluster’s centroid determines the affiliation to a specific cluster: $k$-means assigns each tuple to that cluster that minimises the distance between tuple and centroid.

$k$-means terminates when there were no re-assignments of tuples from one cluster to another cluster within an iteration. Hence, the minimum number of iterations is two: One iteration to assign tuples to clusters and a second iteration to detect that the initial assignment was optimal.

There are two major variants of $k$-means, namely Forgy’s variant of $k$-means and MacQueen’s variant of $k$-means . Both are identical except for the time of updating a cluster’s centroid.

Forgy’s variant of $k$-means updates centroids at the end of each iteration. In other words, the location of a centroid is constant during an iteration. Not so MacQueen’s variant of $k$-means.

MacQueen’s variant of $k$-means updates a centroid each time the cluster of that centroid either gains or looses a tuple, i.e. the location of a centroid can change during an iteration. Consequentially, updating centroids at the end of an iteration is unnecessary. The continuous update of centroids has a single exception.

During the first iteration there is no update of a cluster’s initial centroid when that cluster gains tuples. This exception is necessary to avoid that the tuple that is read first biases the result of the clustering. If the algorithm would perform the first iteration in the same way as it does in succeeding iterations, after reading the first tuple it would update the centroid of that tuple in a way that the centroid and the location of the first tuple coincide in the same point—making any initialisation obsolete.

MacQueen’s variant of $k$-means is said to require only a few iterations, which are typically five up to ten iterations, confer . However, some of our tests contradict this observation when the number of tuples exceeds a million and there is no significant concentration of tuples, i.e. the concentration of tuples is only somewhat higher in the vicinity of a centroid than in the rest of the data set. In such a case we observed few tuples that permanently changed their affiliation to clusters from iteration to iteration. As their contribution to the resulting clusters was minor we terminated $k$-means after a fixed number of iterations when $k$-means did not terminate otherwise.

The number of iterations needed by MacQueen’s variant of $k$-means is less or equal the number of iterations needed by Forgy’s variant of $k$-means because the centroids move faster to their position. However, updating a centroid at the end of an iteration can also be beneficial under some circumstances. Schaubschläger has demonstrated in his master’s thesis that Forgy’s variant of $k$-means is easier to compute in a distributed environment because it needs less messages between participating computers for communicating intermediate results .

partitioningclustering.txt · Last modified: 2017/06/11 11:50 by mgoller