DrGollerResearchWiki

This is an old revision of the document!

Purposes of Clustering

Clustering for Data Reduction

Clustering algorithms scan a set of data and return a set of clusters. The set of data is typically very large in size while the number of clusters is small. Using the circumstance that most clustering algorithms do not store the data that are part of a cluster but a small set of describing features of each cluster, it is possible to use the set of features of a cluster instead of the cluster’s data. In other words, the features that characterise a cluster are a condensed representation of a subset of data. The features of all clusters form a compact representation of all data.

Several typical applications profit of the data reducing effect of clustering. Customer segmentation is only one of them. The aim of customer segmentation is to group the set of all customers of a company into few sets of customers that are represented by characteristic features—e.g. by an “average customer”.

Data reduction is a necessary pre-processing step whenever there is a large set of unique but similar objects that shall be used for data mining. Data mining techniques try to find frequently occurring patterns within the data, Yet, it is impossible to find frequently occurring patterns if each instance is unique because unique items can never be frequent. The usage of clustering can solve this problem. Before being clustered a specific tuple cannot be used for data mining because it is unique but after being clustered the same tuple can be used for data mining when it is treated as a an element of a cluster—there might be many tuples being element of the the same cluster while each tuple is unique. Hence, it is possible to find frequent patterns.

A simple example shall illustrate the necessity of clustering for pre-processing mentioned above:

Suppose a mail order company wants to analyse which costumers buy which kind of products. The company stores sales data and customer data in the two tables “sales” and “customer”, respectively, of a relational database system. If the company performs an association rule analysis on the “sales” table it receives rules of the form “If a customer buys the set of products ‘A’, he/she will probably buy the set of products ‘B’, too”. Yet, it is impossible to tell what kind of products a potential customer that has never bought anything is interested in. The customer must have bought at least a single item to guess what other good he/she might be interested in.

Assuming that two customers that are in a similar living condition—i.e. they have the same sex, similar age and similar educational and cultural background, also share some of their interests concerning specific type of products. Yet, the tuples representing customers in the “customer” table are unique combinations of attribute values in most times—i.e. usually, no two customers have same name, sex, birthday and address. But when clustering the data in the “customer” table, the company receives a set of typical customers, e.g. the female teenager. The resulting clusters can be combined with the result of the association rule analysis. By doing so, the company is now able to predict the set of products a potential customer is interested in. The company might use this knowledge for several purposes such as marketing events customised to the interests of specific customer segments, e.g. enclosing a catalogue with books that are primarily read by female teenagers into a girls’ magazine.

DrGollerResearchWiki

User Tools

Site Tools

Purposes of Clustering

Clustering for Data Reduction

Page Tools