This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
clusteringpurposes [2017/06/11 11:03] mgoller created |
clusteringpurposes [2017/06/11 11:06] (current) mgoller |
||
---|---|---|---|
Line 14: | Line 14: | ||
Assuming that two customers that are in a similar living condition—i.e. they have the same sex, similar age and similar educational and cultural background, also share some of their interests concerning specific type of products. Yet, the tuples representing customers in the “customer” table are unique combinations of attribute values in most times—i.e. usually, no two customers have same name, sex, birthday and address. But when clustering the data in the “customer” table, the company receives a set of typical customers, e.g. the female teenager. The resulting clusters can be combined with the result of the association rule analysis. By doing so, the company is now able to predict the set of products a potential customer is interested in. The company might use this knowledge for several purposes such as marketing events customised to the interests of specific customer segments, e.g. enclosing a catalogue with books that are primarily read by female teenagers into a girls’ magazine. | Assuming that two customers that are in a similar living condition—i.e. they have the same sex, similar age and similar educational and cultural background, also share some of their interests concerning specific type of products. Yet, the tuples representing customers in the “customer” table are unique combinations of attribute values in most times—i.e. usually, no two customers have same name, sex, birthday and address. But when clustering the data in the “customer” table, the company receives a set of typical customers, e.g. the female teenager. The resulting clusters can be combined with the result of the association rule analysis. By doing so, the company is now able to predict the set of products a potential customer is interested in. The company might use this knowledge for several purposes such as marketing events customised to the interests of specific customer segments, e.g. enclosing a catalogue with books that are primarily read by female teenagers into a girls’ magazine. | ||
+ | |||
+ | ==== Clustering for Hypothesis Generation ==== | ||
+ | |||
+ | Some features describing a cluster are suited for being interpreted as a statistical hypothesis. Assume that the application of a partitioning clustering algorithm, see Section for details, such as $k$-means has returned three clusters with means $\vec{\mu}_1, | ||
+ | \vec{\mu}_2,$ and $\vec{\mu}_3$. $k$-means is a partitioning clustering algorithm that partitions a set of data in $k$ disjoint subsets and returns the mean of each subset—$k$ is a user-given parameter. When the best value of parameter $k$ is unknown, $k$-means is iterated several times to determine the best value of parameter $k$. A potential hypothesis is that there are three independent statistical variables $\vec{X}_1, \vec{X}_2,$ and $\vec{X}_3$ with means $\vec{\mu}_1, \vec{\mu}_2,$ and $\vec{\mu}_3$. Frequent co-occurrences of attribute values can be source of hypothesis, too. For instance, if the mean vector $\vec{\mu}_1$ has the dimensions “income” and “total sales per month”, then a small value of the mean in dimension “income” and a high value of the mean in dimension “total sales per month” means that persons with low own income buy a lot of things. | ||
+ | |||
+ | It is common practise in statistics to generate a hypothesis before testing whether data supports it or not. Otherwise, the hypotheses would be trimmed to fit a specific data set that might cause the so-called overfitting problem. Overfitting occurs when the underlying statistical model of a hypothesis fits almost exactly to the data that has been used to generate that model but only poorly fits to other data having the same schema. | ||
+ | |||
+ | When clustering is used for hypothesis generation, it returns those hypotheses that best fit the given data—which might cause overfitting. Hence, it is necessary to validate the resulting hypotheses by testing each hypothesis with a set of data that has not been used for clustering. For instance, when one has used the data of the past to determine a set of hypotheses, one can use current data in order to try to falsify those hypotheses. | ||
+ | |||
+ | ==== Clustering to Improve Quality of Other Techniques ==== | ||
+ | |||
+ | A basic assumption of clustering is that objects in a cluster are similar in structure and behaviour—or, if not, then the behaviour and structure of objects in a cluster are at least more similar to each other than an object of that cluster is to an object of another cluster. | ||
+ | |||
+ | According to this basic assumption it is valid to assume the same distribution of objects in a cluster. | ||
+ | |||
+ | Partitioning a data set in several subsets with similar distribution reduces the expected error of classification. Assume that attribute $Y$ represents the affinity to a specific class. Further, let $X$ denote an attribute that is used to predict the class attribute $Y$. The ranges of both attributes are arbitrary—they might be discrete or continuous, limited or unlimited. Then, a classifier is a deterministic function $f:X\rightarrow Y$ that assigns each value of attribute $X$ an according value of attribute $Y$. Thus, a classifier assumes a deterministic association between the classifying attribute $X$ and the class attribute $Y$. | ||
+ | |||
+ | Yet, the association between classifying attribute and class attribute is rarely deterministic but uncertain in most times. One can use a stochastic model to express the degree of this uncertainty. Assume that the statistical variable $\Theta$ represents external influences on the association between attributes $X$ and $Y$. Thus, it summarises all non-observable influences that determine the class of a tuple. | ||
+ | |||
+ | The probability function of of the triple $(X,Y,\Theta)$ represents the association between class, classifying attributes, and unknown parameters. The outcome of this function denotes how likely it is to have a tuple $(x,y)\in X\times Y$ in a data set. Thus, one can determine the likelihood that a tuple with attribute value $x$ is part of class $y$ using the probability function. | ||
+ | |||
+ | Yet, the distribution of the probability function can be very complex. Moreover, the distribution can be an arbitrary function and no known function such as binomial or normal distribution. Thus, a classifier can only approximate it. | ||
+ | |||
+ | If one partitions the data set into several clusters which are very similar, then approximating is more promising because one can assume the same distribution for all tuples of a cluster. Or, one can search approximations for each cluster individually—for instance, the best approximation for a the probability function of a cluster might be binomial distribution while the best one of another cluster might be uniform distribution. As these clusters have only a small deviation compared with the total data set, the deviation of a classifier is also smaller. Yet, the lower the deviation of a classifier is the lower is the error. For instance, assume that the deviation of a class $y$ is $\sigma$. As clustering decreases the deviation of attributes in clusters because it groups those tuples into the same cluster which are similar to each other, the deviation $\sigma'$ of class $y$ is typically smaller, too. Decreasing deviation is no necessary condition but a commonly-observed phenomenon. | ||
+ | |||
+ | A classification algorithm that has a smaller deviation of error than another classification algorithm is superior to this other classification algorithm in terms of quality. | ||
+ | |||
+ | If there is a clustering of a data set that significantly reduces the distances within a cluster, then clustering can improve quality of classification algorithms classifying the data set cluster by cluster. | ||
+ |