This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
dissertation [2016/08/25 23:11] mgoller |
dissertation [2017/06/11 11:40] (current) mgoller |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Introduction ====== | + | [[Diss_Introduction|Introduction]] |
| - | + | [[Diss_RunningExample|Running Example]] | |
| - | Knowledge discovery in databases (KDD) and data mining offer | + | [[ClusteringOverview|Clustering Overview]] |
| - | enterprises and other large organisations a set of methods to | + | [[distancefunctions|Distance Functions]] |
| - | discover previously unknown relations among data---which are often | + | [[partitioningclustering|Partitioning Clustering]] |
| - | called patterns, e.g. \cite{han}\cite{Frawley}\cite{ester}. With | + | |
| - | these patterns enterprises gain deeper insight into their context | + | |
| - | and are able to use them for better decision-making. | + | |
| - | + | ||
| - | Frawley, Piatesky-Shapiro and Matheus define \emph{knowledge | + | |
| - | discovery} as follows: ``Knowledge discovery is the nontrivial | + | |
| - | extraction of implicit, previously unknown, and potentially useful | + | |
| - | information from data''\cite[p. 3]{Frawley}. It is common practise | + | |
| - | to use the term //knowledge discovery in databases// or short | + | |
| - | //kdd// instead to stress the most commonly used use case. Yet, it is | + | |
| - | convenience to use the term //kdd// also for knowledge discovery in | + | |
| - | data that are not kept in databases but in texts or data streams. | + | |
| - | + | ||
| - | The term //data mining// denotes the step of the \kdd process | + | |
| - | in which sophisticated methods analyse the data for patterns of | + | |
| - | interest \cite{han}\cite{ester}. The herein used techniques are | + | |
| - | further called //data mining techniques//. The techniques | + | |
| - | clustering, classification, and association rule mining are the | + | |
| - | most-commonly used data mining techniques. | + | |
| - | + | ||
| - | Quality of patterns found by data mining algorithms and the time | + | |
| - | needed to find them are the measures of interest in \kdd projects. | + | |
| - | The quality of a found pattern determines its usability---for | + | |
| - | instance, to guide a decision. Additionally, when mining data in | + | |
| - | large databases, time is an issue, too. | + | |
| - | + | ||
| - | This dissertation focusses on the problem of pattern quality and | + | |
| - | time of \kdd projects. Recently, many approaches have been | + | |
| - | introduced to improve specific \kdd analyses. However, these | + | |
| - | approaches try to optimise a single analysis. | + | |
| - | + | ||
| - | In contrast to existing work, this dissertation presents a novel | + | |
| - | way to improve //kdd// projects in terms of shorter project time and | + | |
| - | higher quality of results by optimising a set of analyses instead | + | |
| - | of optimising each analysis individually. | + | |
| - | + | ||
| - | Therefore, the remainder of this section is organised as follows: | + | |
| - | To be able to show open issues of past work, it first introduces | + | |
| - | the \kdd process as is. Later, this section shows where there are | + | |
| - | open issues of the \kdd process. Last but not least, this section | + | |
| - | gives the basic idea of how to improve the \kdd process in terms | + | |
| - | of runtime or quality of results. | + | |
| - | + | ||
| - | \begin{figure} | + | |
| - | \centering | + | |
| - | \includegraphics[width=12cm]{kdd.eps} | + | |
| - | \caption{KDD process} \label{kdd process} | + | |
| - | \end{figure} | + | |
| - | + | ||
| - | % \begin{figure} | + | |
| - | % \centering | + | |
| - | % \includegraphics[width=12cm]{kdd_process_goller.eps} | + | |
| - | % \caption{KDD process} \label{kdd process} | + | |
| - | % \end{figure} | + | |
| - | + | ||
| - | \subsection{A First Introduction into the \kdd Process} | + | |
| - | + | ||
| - | This subsection gives a short introduction into the \kdd process. | + | |
| - | It gives the information that is necessary to show open issues | + | |
| - | which is topic of the next subsection. A more detailed discussion | + | |
| - | of the \kdd process will follow in section \ref{kdd process | + | |
| - | discussion}. | + | |
| - | + | ||
| - | Figure \ref{kdd process} shows the \kdd process with all its | + | |
| - | phases. The \kdd process consists of the phases pre-processing, | + | |
| - | data mining, and evaluating resulting patterns. Using evaluated | + | |
| - | patterns for decision-making is not part of the \kdd process but | + | |
| - | Figure \ref{kdd process} contains this item to denote that the | + | |
| - | \kdd process does not exist per se but is embedded in a broader | + | |
| - | economic context. | + | |
| - | + | ||
| - | The first step in a KDD process is selecting data that might | + | |
| - | contain patterns one is interested in. Some data mining algorithms | + | |
| - | require data as an input that are consistent with a specific data | + | |
| - | schema. The \emph{pre-processing} phase consists of all activities | + | |
| - | that intend to select or transform data according the needs of a | + | |
| - | data mining analysis. | + | |
| - | + | ||
| - | In the \emph{data mining} phase a data mining expert---further | + | |
| - | referenced as analyst---analyses the data using sophisticated data | + | |
| - | mining algorithms. The outcome of these algorithms is a set of | + | |
| - | patterns that contain potential new knowledge. | + | |
| - | + | ||
| - | \begin{figure} | + | |
| - | \centering | + | |
| - | \includegraphics[width=12cm]{iteratingnecessary.eps} | + | |
| - | \caption{Issue A: Insufficient results inflict additional | + | |
| - | iterations} \label{iteratingnecessary} | + | |
| - | \end{figure} | + | |
| - | + | ||
| - | The resulting patterns must be evaluated before they can be used. | + | |
| - | This happens during the \emph{evaluating} phase in which an | + | |
| - | analyst tests whether the patterns are new, valid and interesting | + | |
| - | or not. | + | |
| - | + | ||
| - | If there are no patterns that fulfill all the above conditions, | + | |
| - | the KDD process or a part of it has to be repeated once more. The | + | |
| - | reason why a KDD process has to be iterated can be a bad choice of | + | |
| - | parameter values or a bad selection of data. | + | |
| - | + | ||
| - | \subsection{Open Issues of the \kdd Process} | + | |
| - | + | ||
| - | Although there exist approaches that improve the runtime of | + | |
| - | instances of the \kdd process, time is still an open issue of the | + | |
| - | \kdd process---quality of results is the other one. Hence, we will | + | |
| - | discuss open issues concerning the runtime of instances of the | + | |
| - | \kdd process first before discussing issues related with quality. | + | |
| - | + | ||
| - | % The need of iterating an analysis during an instance of the \kdd | + | |
| - | % process is one of the open issues of the \kdd process. | + | |
| - | + | ||
| - | Algorithms with long runtime and many iterations during an | + | |
| - | instance of the \kdd process increase the runtime of this | + | |
| - | instance. If the data set which an algorithm is analysing is very | + | |
| - | large then even the runtime of algorithm of well-scaling | + | |
| - | algorithms is high. The consequence of a long runtime of an | + | |
| - | algorithm is that the analyst who started it is slowed down in | + | |
| - | ones work by the system until the algorithm terminates. | + | |
| - | + | ||
| - | % Long runtimes of process instances although very efficient | + | |
| - | % algorithms have been used are another one. Hence, this subsection | + | |
| - | % first discusses the need to iterate an analysis in an instance of | + | |
| - | % the \kdd process. Second, it discusses the limitations of previous | + | |
| - | % approaches to further improve the runtime of algorithms---although | + | |
| - | % further improvement would be necessary to enable the analyst to | + | |
| - | % analyse data without being slowed down by the system. | + | |
| - | + | ||
| - | Iterating steps in an instance of the \kdd process is necessary | + | |
| - | because the best values of the used algorithm's parameters and the | + | |
| - | best set of data for a specific analysis are initially unknown. | + | |
| - | Assuming the opposite would mean that an analyst performs an | + | |
| - | analysis the result he/she already knows. Yet, that would conflict | + | |
| - | with the definition of \kdd that patterns must be new. | + | |
| - | + | ||
| - | \begin{figure} | + | |
| - | \centering | + | |
| - | \includegraphics[width=12cm]{multipleinstances.eps} | + | |
| - | \caption{Issue B: Multiple isolated \kdd instances use the same | + | |
| - | data} \label{multipleinstances} | + | |
| - | \end{figure} | + | |
| - | + | ||
| - | Consequentially, it is common practise to run data mining | + | |
| - | algorithms several times---each time with a different combination | + | |
| - | of parameter values and data. More precisely, the analyst chooses | + | |
| - | the data and values that will most likely bring the best results | + | |
| - | best to one's current knowledge. Even an unsuccessful attempt | + | |
| - | commonly gives the analyst a better insight into the relations | + | |
| - | among the analysed data. Future analyses will be more likely | + | |
| - | successful. Figure \ref{iteratingnecessary} depicts an instance of | + | |
| - | the \kdd process that requires two iterations. | + | |
| - | + | ||
| - | When an analyst re-does an analysis with different settings, these | + | |
| - | analyses are not identical but several tasks are very similar. | + | |
| - | More specifically, both analysts share several intermediate | + | |
| - | results as later chapters of this dissertation will show. | + | |
| - | Obviously, processing the same intermediate results more than once | + | |
| - | means to waste time. Yet, previous work introduces techniques | + | |
| - | which are able to re-use results of a previous analysis when | + | |
| - | parameters change. If an analyst uses a different subset of the | + | |
| - | data he/she must compute all results of this new analysis from | + | |
| - | scratch. | + | |
| - | + | ||
| - | Two analyses that share the computation of some intermediate | + | |
| - | results is not limited to analyses of the same instance of the | + | |
| - | \kdd process. Moreover, a set of analyses can also share several | + | |
| - | tasks or parts of them. Especially if these analyses use the same | + | |
| - | data set and also share the same type of analysis, several | + | |
| - | intermediate results can be identical. Figure | + | |
| - | \ref{multipleinstances} shows two instances of the \kdd process | + | |
| - | that use the same data set to analyse. Although the goals of both | + | |
| - | instances are different, both instances need to cluster the same | + | |
| - | table. Yet, parameter values and used attributes differ. Later | + | |
| - | chapter will show that is possible to pre-compute common | + | |
| - | intermediate results for analyses when only attributes and | + | |
| - | parameters vary. | + | |
| - | + | ||
| - | Re-doing the same steps of analyses would be tolerable if the time | + | |
| - | needed to compute an analysis is negligible---which is not the | + | |
| - | case, as illustrated in the next paragraphs. | + | |
| - | + | ||
| - | \emph{Pre-processing} and most \emph{data mining methods} require | + | |
| - | accessing all data---which is very time-consuming in large | + | |
| - | databases exceeding 1GB in size. Some algorithms are inapplicable | + | |
| - | to large data sets without supplying proper database index | + | |
| - | structures. For instance, a single run of the DBSCAN \cite{DBSCAN} | + | |
| - | clustering algorithm scanning ten thousand tuples (less than 1 MB | + | |
| - | in size) endured approximately eleven hours on a SUN Sparc | + | |
| - | workstation \cite[pp. 149-151]{stoettinger}. The reason for this | + | |
| - | long runtime is that DBSCAN requires multiple scans of the | + | |
| - | selected data. | + | |
| - | + | ||
| - | Although some data mining algorithms like the clustering | + | |
| - | algorithms BIRCH \cite{zhang96birch} and CLARANS | + | |
| - | \cite{ng94efficient} are very efficient and scale better in the | + | |
| - | number of tuples, the minimum time needed for | + | |
| - | \emph{pre-processing} and \emph{data mining} is still too long for | + | |
| - | interactively analysing the data. During the tests for this | + | |
| - | dissertation the Oracle 9i database required few minutes to answer | + | |
| - | a simple \emph{SELECT COUNT(*) FROM table} query when the size of | + | |
| - | the table exceeded half a gigabyte---each time the query required | + | |
| - | accessing values of a specific attribute response time increased | + | |
| - | significantly to up to ten minutes. | + | |
| - | + | ||
| - | The databases of mail order companies or large warehouses exceed | + | |
| - | one gigabyte by far. For instance, think of a company operating | + | |
| - | supermarkets. When there are only $300$ small supermarkets in | + | |
| - | which there are only $300$ customers buying $5$ items per day and | + | |
| - | the database requires only $512$ byte to store a sold item, the | + | |
| - | data set that is daily stored requires more than $0.25$ GB disc | + | |
| - | space. | + | |
| - | + | ||
| - | A long runtime of an algorithm or a pre-processing task means that | + | |
| - | an analyst has to wait longer for a computer to finish its job | + | |
| - | than he/she can analyse the resulting patterns. | + | |
| - | + | ||
| - | Zhang et alii state that the most time-consuming part of | + | |
| - | data-mining algorithms is scanning the data while operations in | + | |
| - | main memory are negligible \cite{zhang96birch}. This statement is | + | |
| - | consistent with the test results of this dissertation. As a | + | |
| - | consequence, approaches of improving data mining algorithms must | + | |
| - | focus on minimising the number of required database scans. | + | |
| - | + | ||
| - | Avoiding scanning the database at least once is impossible. Even | + | |
| - | taking a sample requires at least one scan of the | + | |
| - | data---otherwise, the so-taken sample would not be representative. | + | |
| - | + | ||
| - | Some data mining algorithms already require only the minimum | + | |
| - | number of one database scan. BIRCH is a hierarchical clustering | + | |
| - | algorithm requiring only one scan, Bradley et alii have proposed | + | |
| - | in \cite{Bradley} an EM (expectation maximisation) clustering | + | |
| - | algorithm, or Kanungo et alii have presented in | + | |
| - | \cite{kanungo-efficient} a $k$-means-like algorithm, to name only | + | |
| - | few examples of clustering approaches that require only one scan | + | |
| - | of the data. | + | |
| - | + | ||
| - | If existing data mining algorithms are nearly as fast as the | + | |
| - | theoretical limit but even this minimal time is too long for | + | |
| - | interactive working, the source of major improvements in runtime | + | |
| - | must be elsewhere. The next subsection gives the basic idea of the | + | |
| - | approach of this dissertation that is capable to reduce the needed | + | |
| - | runtime below the current limit. The research question of this | + | |
| - | dissertation are a direct consequence of this approach. | + | |
| - | + | ||
| - | In many approaches which we will discuss later, one can decrease | + | |
| - | runtime on behalf of quality of the results of the \kdd process. | + | |
| - | Hence, quality is always an additional issue when discussing | + | |
| - | improvements of runtime. Yet, improving quality and runtime do not | + | |
| - | exclude each other. Thus, the effect on quality is an additional | + | |
| - | research question of any approach changing the \kdd process. Yet, | + | |
| - | the benefit of improving quality of results depends on the | + | |
| - | specific instance of the \kdd process. For instance, if a company | + | |
| - | can improve the classification of customers who are about to | + | |
| - | cancel their contracts with the company then each increase in | + | |
| - | quality, i.e. the accuracy of classification, is beneficial. | + | |
| - | + | ||
| - | \subsection{Basic Idea of Approach and Research Questions} | + | |
| - | + | ||
| - | The \kdd process as it is shown in figure \ref{kdd process} | + | |
| - | describes the sequence of steps of a single analysis. Especially, | + | |
| - | the \kdd process does not take into account that analysts do | + | |
| - | several analyses---not necessarily all of them at the same time | + | |
| - | but many analyses within a year. The discussion of the last | + | |
| - | subsection indicated that further improvement of analyses is | + | |
| - | necessary to enable working continuously on data analyses. | + | |
| - | However, improving a single analysis is improper because the | + | |
| - | theoretical limit of one scan is not fast enough. Changing this | + | |
| - | isolated point of view can be a source of major improvement which | + | |
| - | will be shown later. | + | |
| - | + | ||
| - | \bigskip \emph{The core idea of this dissertation is to re-think the | + | |
| - | knowledge discovery in databases process. Instances of the KDD | + | |
| - | process should no longer be regarded as single independent process | + | |
| - | instances but should be regarded as dependent instances that can | + | |
| - | profit from each other.} | + | |
| - | + | ||
| - | \emph{Further, if the specific setting of a future KDD process | + | |
| - | instance is unknown but several tasks of the instance are known, | + | |
| - | these known tasks should be done in anticipation.} | + | |
| - | \bigskip | + | |
| - | + | ||
| - | If two or more \kdd process instances share the same | + | |
| - | time-consuming tasks or parts of them then it is reasonable to do | + | |
| - | those tasks only in the first occurring instance---it would be | + | |
| - | wasted time to re-do such a task in the second and future | + | |
| - | instances. | + | |
| - | + | ||
| - | \bigskip \emph{The underlying basic assumption is that there are | + | |
| - | process instances that have time-consuming tasks in | + | |
| - | common.}\bigskip | + | |
| - | + | ||
| - | The research questions this dissertation intends to answer are | + | |
| - | consequents of the above-mentioned basic idea and basic | + | |
| - | assumption. To be specific, the research questions of this | + | |
| - | dissertation are as follows: | + | |
| - | + | ||
| - | \begin{enumerate} | + | |
| - | \item How do \kdd analyses differ from each other? | + | |
| - | + | ||
| - | \item Are there intermediate results shared by multiple | + | |
| - | analyses? | + | |
| - | + | ||
| - | \item Can (intermediate) results of an analysis affect the | + | |
| - | quality of the result of another analysis? | + | |
| - | \end{enumerate} | + | |
| - | + | ||
| - | If the variability of analyses is known, it is possible to | + | |
| - | identify items that are common in all analyses of a specific type | + | |
| - | of analysis. Additionally, one can identify the range of some | + | |
| - | parameters. If there are only a few values for a specific | + | |
| - | parameter the analyst can choose of, one can pre-compute the | + | |
| - | results for all of these potential values. When the analyst | + | |
| - | finally chooses the actual value for an analysis, the result has | + | |
| - | already been computed by the system. | + | |
| - | + | ||
| - | If there are intermediate results of an analysis that other | + | |
| - | analyses might need than these other analyses can profit of these | + | |
| - | intermediate results. Hence, the answer to the second research | + | |
| - | question is needed to improve the \kdd process: If common | + | |
| - | intermediate results are identified, one can pre-compute them in | + | |
| - | anticipation to save time in succeeding analyses. | + | |
| - | + | ||
| - | The third research question focusses on the aspect of quality of | + | |
| - | analyses. Taking several analyses into account can also affect the | + | |
| - | quality of the results of each analysis. For instance, knowledge | + | |
| - | gained in an analysis can help getting better results in another | + | |
| - | analysis by biasing the succeeding analysis. | + | |
| - | + | ||
| - | A specific type of analysis needs to perform tasks of a specific | + | |
| - | type. Some of these tasks depend on parameters of the analyst, | + | |
| - | others do not. It is possible to pre-compute the results of tasks | + | |
| - | which are independent of analyst's input. | + | |
| - | + | ||
| - | The data set which the system processes in a specific task is the | + | |
| - | only differing factor of the same independent tasks of analysis of | + | |
| - | the same type. As the number of tables in a database is limited, | + | |
| - | it is possible to pre-compute intermediate results of commonly | + | |
| - | used types of analyses of all large tables. | + | |
| - | + | ||
| - | However, there still remain tasks depending on parameters which | + | |
| - | cannot be pre-computed. Thus, it is necessary to split analyses | + | |
| - | into parts that can be processed anticipatory and tasks that | + | |
| - | cannot, as illustrated in figure \ref{somekddinstances}. A single | + | |
| - | instance of a supporting process pre-computes intermediate results | + | |
| - | for two instances of an altered \kdd process that consists only of | + | |
| - | parameter-specific tasks. Later sections will introduce the | + | |
| - | supported process as parameter-specific \kdd process and the | + | |
| - | supporting process as parameter-independent \kdd process. | + | |
| - | + | ||
| - | The instance of the parameter-independent \kdd process | + | |
| - | pre-computes intermediate results and keeps them up-to-date when a | + | |
| - | table which has been used to pre-process has changed, i.e. there | + | |
| - | are new, removed, or altered tuples in the table. | + | |
| - | + | ||
| - | The instances of the parameter-specific \kdd process use the | + | |
| - | pre-computed intermediate results either to process the final | + | |
| - | result for higher performance or to bias the computation to | + | |
| - | receive better results. Improving speed with pre-computed | + | |
| - | intermediate results corresponds with the second research | + | |
| - | question, while improving quality by biasing corresponds with the | + | |
| - | third research question. | + | |
| - | + | ||
| - | The above-mentioned concept of anticipatory pre-computing ideally | + | |
| - | support analysing data warehouses containing few but very large | + | |
| - | fact tables. As the combination of data warehouses and \kdd is a | + | |
| - | common combination in companies having mass data, this concept | + | |
| - | offers much potential for many companies. | + | |
| - | + | ||
| - | Some approaches of related work already arise from the same idea | + | |
| - | of pre-processing but are limited to complete tasks---most of them | + | |
| - | are tasks of the pre-processing phase. This dissertation | + | |
| - | introduces an approach that splits tasks into smaller junks and is | + | |
| - | able to increase the number of pre-computable tasks that way. | + | |
| - | Additionally, this approach extends the idea of pre-computing to | + | |
| - | the data mining phase. Hence, it is more broadly applicable than | + | |
| - | existing work. | + | |
| - | + | ||
| - | \begin{figure} | + | |
| - | \centering | + | |
| - | \includegraphics[width=12cm]{somekddinstances.eps} | + | |
| - | \caption{Improving the \kdd process by using pre-computed | + | |
| - | intermediate results and auxiliary data instead of tuples} | + | |
| - | \label{somekddinstances} | + | |
| - | \end{figure} | + | |
| - | + | ||
| - | This dissertation will present a technique to increase the | + | |
| - | performance of instance-specific pre-processing tasks to make them | + | |
| - | so fast that analysts can work on analyses without being slowed | + | |
| - | down by the data mining system. | + | |
| - | + | ||
| - | In addition to the \emph{pre-processing} phase, the \emph{data | + | |
| - | mining} phase also has got high potential for improvement. | + | |
| - | + | ||
| - | This dissertation presents techniques to enhance the performance | + | |
| - | of data mining techniques and intends to improve the \kdd process | + | |
| - | in terms of quality and speed that way. For being more specific, | + | |
| - | it is useful to distinguish the different data mining techniques | + | |
| - | association rule mining, classification, clustering, and | + | |
| - | regression at this point. | + | |
| - | + | ||
| - | Association rule mining requires finding frequently occurring sets | + | |
| - | of items before finding association rules among them. If the | + | |
| - | frequent item sets that are valid in the set of all transactions | + | |
| - | are once determined it is possible to determine the frequent item | + | |
| - | sets that are valid in subsets of all transactions. This has been | + | |
| - | sufficiently shown in other work. This work will discuss those | + | |
| - | works for reasons of completeness but will contribute only a minor | + | |
| - | new idea. | + | |
| - | + | ||
| - | Classification algorithms use a set of data to train a function | + | |
| - | that is called a classifier which is able to predict the class of | + | |
| - | a tuple using the tuples' attribute values. The set of data that | + | |
| - | is used for training typically contains only few | + | |
| - | tuples---consequentially, scaling algorithms is not a problem. The | + | |
| - | quality of the resulting classifier is the most concerning issue | + | |
| - | of classification. This work will discuss quality-improving | + | |
| - | classification approaches and contribute some new ideas how to | + | |
| - | improve the quality of a classifier by using auxiliary data gained | + | |
| - | as by-products of other \kdd process instances. | + | |
| - | + | ||
| - | Clustering algorithms produce a set of clusters. All clustering | + | |
| - | algorithms but density-based clustering algorithms represent | + | |
| - | clusters with a set few describing attributes which varies from | + | |
| - | algorithm to algorithm. For instance, \EM clustering algorithms | + | |
| - | generate a probabilistic model consisting of $k$ random variables | + | |
| - | and store mean and deviation of each variable, while $k$-means and | + | |
| - | related algorithms represent cluster by their euclidian mean. It | + | |
| - | is easy to compute mean and standard deviation of a cluster, when | + | |
| - | the number of the cluster's tuples, the linear sum of its tuples | + | |
| - | and their sum of squares is known, which will be demonstrated | + | |
| - | later. | + | |
| - | + | ||
| - | If the tuples within a subset of the data are very similar to each | + | |
| - | other, they will be part of the same cluster in most times. Thus, | + | |
| - | it is possible to build a subtotal of all of these tuples. | + | |
| - | + | ||
| - | Determining the mean by summing up all tuples or summing up | + | |
| - | subtotals will produce the same result as long as there are no | + | |
| - | tuples included that are part of a different cluster. If there are | + | |
| - | such tuples then the result might but need not be erroneous. The | + | |
| - | experimental results will show that this error is tolerable. | + | |
| - | + | ||
| - | This dissertation will discuss approaches improving clustering by | + | |
| - | aggregating tuples to sub-clusters in detail and will present new | + | |
| - | ideas. Especially, the re-use of sub-clusters will be enhanced in | + | |
| - | comparison to existing approaches like \cite{zhang96birch} or | + | |
| - | \cite{bradley98scaling} such that it is possible to find | + | |
| - | sub-clusters in anticipation. | + | |
| - | + | ||
| - | Additionally, this dissertation will also discuss other popular | + | |
| - | approaches improving the performance of analyses such as sampling. | + | |
| - | + | ||
| - | A running example shall illustrate the concepts of this | + | |
| - | dissertation. Therefore, the next section introduces a running | + | |
| - | example which is used to demonstrate more complex parts of this | + | |
| - | dissertation's concepts. | + | |