This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
partitioningclustering [2017/06/11 11:48] mgoller |
partitioningclustering [2017/06/11 11:50] (current) mgoller |
||
---|---|---|---|
Line 48: | Line 48: | ||
* the quality of the current iteration and the former iteration does not improve—$k$-means, $k$-modes, and EM have this type of stop criterion | * the quality of the current iteration and the former iteration does not improve—$k$-means, $k$-modes, and EM have this type of stop criterion | ||
* the estimated cost of a further iteration exceeds the estimated benefit of a further iteration. | * the estimated cost of a further iteration exceeds the estimated benefit of a further iteration. | ||
+ | |||
+ | $k$-means and EM are part of many commercial data mining tools such as SAS Enterprise Miner, SPSS Clementine, or Teradata Warehouse Miner. Additionally, both algorithms are broadly discussed in literature. As both algorithms require an Euclidian vector space to operate within, many improvements exploiting conditions given in an Euclidean vector space are applicable to both algorithms. Thus, it is sufficient to discuss only one of them in detail. | ||
+ | |||
+ | Due to the usage of $k$-means in commercial products and the large set of articles discussing improvements of $k$-means, we will focus on $k$-means to show how to implement the concepts presented in this dissertation for $k$-means application. | ||
+ | |||
+ | The significant contribution of this dissertation is to pre-process more tasks than in the conventional way for potential future applications such as $k$-means clustering applications. $k$-means fulfills the function of a running example demonstrating the concepts presented in this dissertation. It is left to the reader to traduce the herein shown way of implementing these concepts to other clustering algorithms. | ||
+ | |||
+ | ==== The k-means algorithm ==== | ||
+ | |||
+ | The $k$-means clustering algorithm is a partitioning clustering algorithm that partitions a set of data into $k$ subsets and returns the centroid of each subset. Each tuple is represented by a vector in a Euclidean vector space. Consequentially, only attributes with continuous, numerical scale are available for $k$-means clustering because other attributes are improper to span an Euclidean vector space. The centroid of a cluster is the arithmetic mean of all vectors of a cluster. | ||
+ | |||
+ | Parameter $k$, which denotes the number of clusters to be found, is the only parameter of $k$-means. It is also the parameter that gives $k$-means its name. | ||
+ | |||
+ | The general proceeding of $k$-means is like the proceeding of partitioning clustering algorithms as presented in the previous subsection. The following paragraphs describe how $k$-means implements the generic pattern of a partitioning clustering algorithm. | ||
+ | |||
+ | Like other partitioning clustering algorithms, $k$-means starts with an initial solution which it improves in several iterations. Hereby, the initial solution consists of a set of $k$ centroids. | ||
+ | |||
+ | The distance of a tuple to a cluster’s centroid determines the affiliation to a specific cluster: $k$-means assigns each tuple to that cluster that minimises the distance between tuple and centroid. | ||
+ | |||
+ | $k$-means terminates when there were no re-assignments of tuples from one cluster to another cluster within an iteration. Hence, the minimum number of iterations is two: One iteration to assign tuples to clusters and a second iteration to detect that the initial assignment was optimal. | ||
+ | |||
+ | There are two major variants of $k$-means, namely Forgy’s variant of $k$-means and MacQueen’s variant of $k$-means . Both are identical except for the time of updating a cluster’s centroid. | ||
+ | |||
+ | Forgy’s variant of $k$-means updates centroids at the end of each iteration. In other words, the location of a centroid is constant during an iteration. Not so MacQueen’s variant of $k$-means. | ||
+ | |||
+ | MacQueen’s variant of $k$-means updates a centroid each time the cluster of that centroid either gains or looses a tuple, i.e. the location of a centroid can change during an iteration. Consequentially, updating centroids at the end of an iteration is unnecessary. The continuous update of centroids has a single exception. | ||
+ | |||
+ | During the first iteration there is no update of a cluster’s initial centroid when that cluster gains tuples. This exception is necessary to avoid that the tuple that is read first biases the result of the clustering. If the algorithm would perform the first iteration in the same way as it does in succeeding iterations, after reading the first tuple it would update the centroid of that tuple in a way that the centroid and the location of the first tuple coincide in the same point—making any initialisation obsolete. | ||
+ | |||
+ | MacQueen’s variant of $k$-means is said to require only a few iterations, which are typically five up to ten iterations, confer . However, some of our tests contradict this observation when the number of tuples exceeds a million and there is no significant concentration of tuples, i.e. the concentration of tuples is only somewhat higher in the vicinity of a centroid than in the rest of the data set. In such a case we observed few tuples that permanently changed their affiliation to clusters from iteration to iteration. As their contribution to the resulting clusters was minor we terminated $k$-means after a fixed number of iterations when $k$-means did not terminate otherwise. | ||
+ | |||
+ | The number of iterations needed by MacQueen’s variant of $k$-means is less or equal the number of iterations needed by Forgy’s variant of $k$-means because the centroids move faster to their position. However, updating a centroid at the end of an iteration can also be beneficial under some circumstances. Schaubschläger has demonstrated in his master’s thesis that Forgy’s variant of $k$-means is easier to compute in a distributed environment because it needs less messages between participating computers for communicating intermediate results . |