This shows you the differences between two versions of the page.
— |
distancefunctions [2017/06/11 11:34] (current) mgoller created |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ===== Expressing Dissimilarity with Distance Function or Distance Matrix ===== | ||
+ | A clustering algorithm must be able to determine the similarity of tuples to determine the cluster a tuple is most similar with. Therefore, some kind of measure is required that indicates similarity or dissimilarity of tuples. | ||
+ | |||
+ | Hence, this section introduces distances, distance functions, and distance matrices as measures to express dissimilarity of tuples. | ||
+ | |||
+ | It is common practise to express dissimilarity of tuples with the so-called distance of tuples. The distance between two tuples can be a distance having a meaning but can also be only a fictive measurement. For instance, if two persons earn €10 and €30 per hour, one person earns €20 more than the other. But if three persons have the values 1.2, 1.4, and 2.0 as values of an artificial index, we can only determine that the first two persons are more similar to each other than to the last one. | ||
+ | |||
+ | For clustering tuples, there must exist a measurement of distances for each attribute and a way to combine distances of different attributes. Distance functions fulfill the function of combining distances of several continuously and interval scaled attributes to a common distance. Distance matrices are used in cases where it is impossible to find a suitable distance function. See below for details. | ||
+ | |||
+ | ==== Popular Distance Functions ==== | ||
+ | |||
+ | A distance function is a function that takes two tuples as its input and returns a float variable—the distance—that indicates the dissimilarity of both tuples. | ||
+ | |||
+ | The Euclidian distance, the Manhattan distance, and the Minkowski distance are popular distance functions for interval-scaled attributes. | ||
+ | |||
+ | The Euclidian distance $d_E$ of two tuples is the length of the distance vector of the vectors representing both tuples in a Euclidian vector space, i.e. if two tuples are represented by the vectors $\vec{x}$ and $\vec{y}$ the Euclidian distance is the length of the vector $\vec{x}-\vec{y}$ which is $$d_E(\vec{x},\vec{y})=||\vec{x}-\vec{y}||=\sqrt{\sum_{i=1}^d (x_i-y_i)^2}.$$ | ||
+ | |||
+ | |||
+ | |||
+ | The Manhattan distance $dM$ is the sum of the absolute values of the differences of the tuples in each attribute, i.e.$$d_M(\vec{x},\vec{y})=\sum_{i=1}^d |x_i-y_i|.$$ The name Manhattan distance originates from the way roads in Manhattan are organised: Streets follow East-West direction while avenues run from North to South. In analogy to that, to get from point $\vec{x}$ to point $\vec{y}$ one has to pass several blocks in one direction before changing the direction orthogonally. The Euclidian distance is the distance taking the straight way from point $\vec{x}$ to point $\vec{y}$, as shown in figure . | ||
+ | |||
+ | The Minkowski distance is the generalised distance function of Manhattan distance and Euclidian distance. The Minkowski distance has a parameter $p$ that determines the exponent of the difference of the attribute values of the tuples, i.e. $$d_p(\vec{x},\vec{y})=\sqrt[p]{\sum_{i=1}^d |x_i-y_i|^p}.$$ The Manhattan distance is a Minkowski distance with parameter $p=1$. The Euclidian distance is a Minkowski distance with parameter $p=2$. | ||
+ | |||
+ | The specific requirements of the process instance determines which distance function is the one to prefer. The Euclidian distance will be the most-appropriate distance function when analysing geographical data because it represents the real geographical distance between two points. The Euclidian distance is also chosen for many other technical applications with metric data only. | ||
+ | |||
+ | The Manhattan distance is applicable when the Euclidian distance is applicable but the Manhattan distance favours changes in only a single attribute. Let vector $\vec{x}$ assume $(0,0)$ and vector $\vec{y}$ assume $(2,2)$. Further let vector $\vec{z}$ assume $(0,3)$. According to the Manhattan distance, vector $\vec{z}$ is nearer to vector $\vec{x}$ than vector $\vec{y}$ is to vector $\vec{x}$, i.e. a distance of $3$ compared to a distance of $4$. According to the Euclidian distance, vector $\vec{y}$ is the nearest vector to vector $\vec{x}$ because the distances are $d_E(\vec{x},\vec{y})=2\sqrt 2\approx 2.8$ and $d_E(\vec{x},\vec{z})=3$, respectively. | ||
+ | |||
+ | The Manhattan distance is also applicable to data sets with ordinal attributes. In such a case, the Manhattan distance of two tuples $x$ and $y$ is the minimal number of steps to take to get from the attribute value combination of $x$ to the attribute value combination of $y$ in the matrix of all potential attribute value combinations, as shown in figure . | ||
+ | |||
+ | |||
+ | |||
+ | ^ attribute value^ $a_1$ ^ $a_2$ ^ $a_3$ ^ | ||
+ | | $b_1$| | | | | ||
+ | | $b_2$| | | $y$ | | ||
+ | | $b_3$| $x$ | $\rightarrow$ | $\uparrow$ | | ||
+ | |||
+ | tuples $x=(a_1,b_3)$, $y=(a_3,b_2)$ | ||
+ | |||
+ | Manhattan distance $d_M(x,y)=3$ | ||
+ | |||
+ | For comparing objects that include a non-empty set of objects, there exists a distance function that is computed as the number of different items in relation to the number of items in both sets, i.e. $$d_S(X,Y)=\frac{||(X\cup Y) - (X\cap Y) ||}{||X\cup Y||}.$$ Distance function $d_S$ can be used to cluster transactions of arbitrary length—what is needed when pre-clustering or post-clustering of an association rule analysis is required. | ||
+ | |||
+ | The distance $d_S$ is also an appropriate distance function for data sets with categorical attributes because a tuple of a data set with categorical attributes can be interpreted as a set of attribute values with a fixed size of items. If we omit the ordering of attribute values in figure we receive a relation with two categorical attributes. Assuming that tuple $X$ assumes $(a_1, b_2)$ and tuple $Y$ assumes $(a_2, b_2)$, the distance $d_S$ equals $d_S=\frac{||\{a_1,a_2\}||}{||\{a_1,a_2,b_2\}||}=2/3$. | ||
+ | |||
+ | The range of distance $d_S$ is $[0,1]$ where $0$ is the resulting value if and only if both compared sets are identical. Contrary, $1$ is the resulting distance if and only if no element of one set is element of the other set. | ||
+ | |||
+ | ==== Distance Matrix ==== | ||
+ | |||
+ | A distance matrix can replace a distance function in cases where there is no appropriate distance function available or computing the result of a distance function would be too expensive. For instance, let the attribute “job” be an attribute of interest in table “person”. It is impossible to find a distance function that determines the distance of two different jobs because similarity of jobs is subjective. One person might think that “nurse” and “doctor” are closely-related jobs because nurses and a lot of doctors work together in hospitals. Contrary, another person might think that “advocate” and “doctor” are more closely-related jobs because both jobs require a university degree. | ||
+ | |||
+ | Although it is impossible to give a distance function that represents the similarity a user associates with some pair of objects—e.g. the similarity of jobs—it is possible to express the user’s association of similarity of objects with a distance matrix. | ||
+ | |||
+ | A distance matrix is a matrix that includes the distances of all combinations of tuples, i.e. in the “jobs”-example a distance matrix has an element for each combination of two jobs that indicates how much related the user assumes that these jobs are. | ||
+ | |||
+ | Note that the resulting distance matrix expresses a subjective opinion of a user or a group of users—if they agree with the distance values stored in the distance matrix. Consequentially, the result of a clustering using such a distance matrix would be a subjective result. too. Thus, it is important to present the distance matrix together with the results to prevent users from taking those results for granted. | ||
+ | |||
+ | Using a distance matrix is limited to finite sets of tuples because it must include all combinations of tuples—infinite sets of tuples would require a distance matrix with infinite space. Hence, continuously scaled attributes and ordinal-scaled attributes with unbounded domain prevent the usage of distance matrices. | ||
+ | |||
+ | ==== Using Normalisation When Attributes Vary Significantly in Range ==== | ||
+ | |||
+ | Especially in cases with one attribute having a much broader range than the other ones, the attribute with the broader range dominates the result of the clustering because its distances are higher than the distances of other attributes. | ||
+ | |||
+ | Normalisation of attributes is used to prevent that the attribute with the broadest range dominates the clustering. Normalisation of attributes is a transformation of an attribute so that the ranges of all transformed attributes are approximately identical. | ||
+ | |||
+ | $Z$-normalisation or $z$-scoring is a popular way of attribute normalisation. The $z$-score $z$ of an attribute value is the transformed value we receive by subtracting the mean $\mu$ of the attribute of the attribute value first and divide the result of the substraction by the standard deviation $\sigma$ of that attribute. $$z:=\frac{x-\mu}{\sigma}$$ | ||
+ | |||
+ | The succeeding subsections introduce partitioning clustering algorithms and hierarchical clustering algorithms which both are used in CHAD, a clustering algorithm that uses two types of clustering algorithms to receive a higher overall performance. |