User Tools

Site Tools


clusteringpurposes

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
clusteringpurposes [2017/06/11 11:04]
mgoller
clusteringpurposes [2017/06/11 11:06] (current)
mgoller
Line 23: Line 23:
  
 When clustering is used for hypothesis generation, it returns those hypotheses that best fit the given data—which might cause overfitting. Hence, it is necessary to validate the resulting hypotheses by testing each hypothesis with a set of data that has not been used for clustering. For instance, when one has used the data of the past to determine a set of hypotheses, one can use current data in order to try to falsify those hypotheses. When clustering is used for hypothesis generation, it returns those hypotheses that best fit the given data—which might cause overfitting. Hence, it is necessary to validate the resulting hypotheses by testing each hypothesis with a set of data that has not been used for clustering. For instance, when one has used the data of the past to determine a set of hypotheses, one can use current data in order to try to falsify those hypotheses.
 +
 +==== Clustering to Improve Quality of Other Techniques ====
 +
 +A basic assumption of clustering is that objects in a cluster are similar in structure and behaviour—or,​ if not, then the behaviour and structure of objects in a cluster are at least more similar to each other than an object of that cluster is to an object of another cluster.
 +
 +According to this basic assumption it is valid to assume the same distribution of objects in a cluster.
 +
 +Partitioning a data set in several subsets with similar distribution reduces the expected error of classification. Assume that attribute $Y$ represents the affinity to a specific class. Further, let $X$ denote an attribute that is used to predict the class attribute $Y$. The ranges of both attributes are arbitrary—they might be discrete or continuous, limited or unlimited. Then, a classifier is a deterministic function $f:​X\rightarrow Y$ that assigns each value of attribute $X$ an according value of attribute $Y$. Thus, a classifier assumes a deterministic association between the classifying attribute $X$ and the class attribute $Y$.
 +
 +Yet, the association between classifying attribute and class attribute is rarely deterministic but uncertain in most times. One can use a stochastic model to express the degree of this uncertainty. Assume that the statistical variable $\Theta$ represents external influences on the association between attributes $X$ and $Y$. Thus, it summarises all non-observable influences that determine the class of a tuple.
 +
 +The probability function of of the triple $(X,​Y,​\Theta)$ represents the association between class, classifying attributes, and unknown parameters. The outcome of this function denotes how likely it is to have a tuple $(x,y)\in X\times Y$ in a data set. Thus, one can determine the likelihood that a tuple with attribute value $x$ is part of class $y$ using the probability function.
 +
 +Yet, the distribution of the probability function can be very complex. Moreover, the distribution can be an arbitrary function and no known function such as binomial or normal distribution. Thus, a classifier can only approximate it.
 +
 +If one partitions the data set into several clusters which are very similar, then approximating is more promising because one can assume the same distribution for all tuples of a cluster. Or, one can search approximations for each cluster individually—for instance, the best approximation for a the probability function of a cluster might be binomial distribution while the best one of another cluster might be uniform distribution. As these clusters have only a small deviation compared with the total data set, the deviation of a classifier is also smaller. Yet, the lower the deviation of a classifier is the lower is the error. For instance, assume that the deviation of a class $y$ is $\sigma$. As clustering decreases the deviation of attributes in clusters because it groups those tuples into the same cluster which are similar to each other, the deviation $\sigma'​$ of class $y$ is typically smaller, too. Decreasing deviation is no necessary condition but a commonly-observed phenomenon.
 +
 +A classification algorithm that has a smaller deviation of error than another classification algorithm is superior to this other classification algorithm in terms of quality.
 +
 +If there is a clustering of a data set that significantly reduces the distances within a cluster, then clustering can improve quality of classification algorithms classifying the data set cluster by cluster.
  
clusteringpurposes.txt · Last modified: 2017/06/11 11:06 by mgoller