User Tools

Site Tools


dissertation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
dissertation [2016/08/25 23:05]
mgoller created
dissertation [2017/06/11 11:40] (current)
mgoller
Line 1: Line 1:
-====== Introduction +[[Diss_Introduction|Introduction]] 
- +[[Diss_RunningExample|Running Example]
-Knowledge discovery in databases (KDD) and data mining offer +[[ClusteringOverview|Clustering Overview]
-enterprises and other large organisations a set of methods to +[[distancefunctions|Distance Functions]
-discover previously unknown relations among data---which are often +[[partitioningclustering|Partitioning Clustering]]
-called patterns, e.g. \cite{han}\cite{Frawley}\cite{ester}. With +
-these patterns enterprises gain deeper insight into their context +
-and are able to use them for better decision-making. +
- +
-Frawley, Piatesky-Shapiro and Matheus define \emph{knowledge +
-discovery} as follows: ``Knowledge discovery is the nontrivial +
-extraction of implicit, previously unknown, and potentially useful +
-information from data''​\cite[p. 3]{Frawley}. It is common practise +
-to use the term \emph{knowledge discovery in databases} or short +
-\kdd instead to stress the most commonly used use case. Yet, it is +
-convenience to use the term \kdd also for knowledge discovery in +
-data that are not kept in databases but in texts or data streams. +
- +
-The term //data mining// denotes the step of the \kdd process +
-in which sophisticated methods analyse the data for patterns of +
-interest \cite{han}\cite{ester}. The herein used techniques are +
-further called //data mining techniques//​. The techniques +
-clustering, classification,​ and association rule mining are the +
-most-commonly used data mining techniques. +
- +
-Quality of patterns found by data mining algorithms and the time +
-needed to find them are the measures of interest in \kdd projects. +
-The quality of a found pattern determines its usability---for +
-instance, to guide a decision. Additionally,​ when mining data in +
-large databases, time is an issue, too. +
- +
-This dissertation focusses on the problem of pattern quality and +
-time of \kdd projects. Recently, many approaches have been +
-introduced to improve specific \kdd analyses. However, these +
-approaches try to optimise a single analysis. +
- +
-In contrast to existing work, this dissertation presents a novel +
-way to improve \kdd projects in terms of shorter project time and +
-higher quality of results by optimising a set of analyses instead +
-of optimising each analysis individually. +
- +
-Therefore, the remainder of this section is organised as follows: +
-To be able to show open issues of past work, it first introduces +
-the \kdd process as is. Later, this section shows where there are +
-open issues of the \kdd process. Last but not least, this section +
-gives the basic idea of how to improve the \kdd process in terms +
-of runtime or quality of results. +
- +
-\begin{figure} +
-\centering +
-\includegraphics[width=12cm]{kdd.eps} +
-\caption{KDD process} \label{kdd process} +
-\end{figure} +
- +
-% \begin{figure} +
-% \centering +
-% \includegraphics[width=12cm]{kdd_process_goller.eps} +
-% \caption{KDD process} \label{kdd process} +
-% \end{figure} +
- +
-\subsection{A First Introduction into the \kdd Process} +
- +
-This subsection gives a short introduction into the \kdd process. +
-It gives the information that is necessary to show open issues +
-which is topic of the next subsection. A more detailed discussion +
-of the \kdd process will follow in section \ref{kdd process +
-discussion}. +
- +
-Figure \ref{kdd process} shows the \kdd process with all its +
-phases. The \kdd process consists of the phases pre-processing,​ +
-data mining, and evaluating resulting patterns. Using evaluated +
-patterns for decision-making is not part of the \kdd process but +
-Figure \ref{kdd process} contains this item to denote that the +
-\kdd process does not exist per se but is embedded in a broader +
-economic context. +
- +
-The first step in a KDD process is selecting data that might +
-contain patterns one is interested in. Some data mining algorithms +
-require data as an input that are consistent with a specific data +
-schema. The \emph{pre-processing} phase consists of all activities +
-that intend to select or transform data according the needs of a +
-data mining analysis. +
- +
-In the \emph{data mining} phase a data mining expert---further +
-referenced as analyst---analyses the data using sophisticated data +
-mining algorithms. The outcome of these algorithms is a set of +
-patterns that contain potential new knowledge. +
- +
-\begin{figure} +
-\centering +
-\includegraphics[width=12cm]{iteratingnecessary.eps} +
-\caption{Issue A: Insufficient results inflict additional +
-iterations} \label{iteratingnecessary} +
-\end{figure} +
- +
-The resulting patterns must be evaluated before they can be used. +
-This happens during the \emph{evaluating} phase in which an +
-analyst tests whether the patterns are new, valid and interesting +
-or not. +
- +
-If there are no patterns that fulfill all the above conditions,​ +
-the KDD process or a part of it has to be repeated once more. The +
-reason why a KDD process has to be iterated can be a bad choice of +
-parameter values or a bad selection of data. +
- +
-\subsection{Open Issues of the \kdd Process} +
- +
-Although there exist approaches that improve the runtime of +
-instances of the \kdd process, time is still an open issue of the +
-\kdd process---quality of results is the other one. Hence, we will +
-discuss open issues concerning the runtime of instances of the +
-\kdd process first before discussing issues related with quality. +
- +
-% The need of iterating an analysis during an instance of the \kdd +
-% process is one of the open issues of the \kdd process. +
- +
-Algorithms with long runtime and many iterations during an +
-instance of the \kdd process increase the runtime of this +
-instance. If the data set which an algorithm is analysing is very +
-large then even the runtime of algorithm of well-scaling +
-algorithms is high. The consequence of a long runtime of an +
-algorithm is that the analyst who started it is slowed down in +
-ones work by the system until the algorithm terminates. +
- +
-% Long runtimes of process instances although very efficient +
-% algorithms have been used are another one. Hence, this subsection +
-% first discusses the need to iterate an analysis in an instance of +
-% the \kdd process. Second, it discusses the limitations of previous +
-% approaches to further improve the runtime of algorithms---although +
-% further improvement would be necessary to enable the analyst to +
-% analyse data without being slowed down by the system. +
- +
-Iterating steps in an instance of the \kdd process is necessary +
-because the best values of the used algorithm'​s parameters and the +
-best set of data for a specific analysis are initially unknown. +
-Assuming the opposite would mean that an analyst performs an +
-analysis the result he/she already knows. Yet, that would conflict +
-with the definition of \kdd that patterns must be new. +
- +
-\begin{figure} +
-\centering +
-\includegraphics[width=12cm]{multipleinstances.eps} +
-\caption{Issue B: Multiple isolated \kdd instances use the same +
-data} \label{multipleinstances} +
-\end{figure} +
- +
-Consequentially,​ it is common practise to run data mining +
-algorithms several times---each time with a different combination +
-of parameter values and data. More precisely, the analyst chooses +
-the data and values that will most likely bring the best results +
-best to one's current knowledge. Even an unsuccessful attempt +
-commonly gives the analyst a better insight into the relations +
-among the analysed data. Future analyses will be more likely +
-successful. Figure \ref{iteratingnecessary} depicts an instance of +
-the \kdd process that requires two iterations. +
- +
-When an analyst re-does an analysis with different settings, these +
-analyses are not identical but several tasks are very similar. +
-More specifically,​ both analysts share several intermediate +
-results as later chapters of this dissertation will show. +
-Obviously, processing the same intermediate results more than once +
-means to waste time. Yet, previous work introduces techniques +
-which are able to re-use results of a previous analysis when +
-parameters change. If an analyst uses a different subset of the +
-data he/she must compute all results of this new analysis from +
-scratch. +
- +
-Two analyses that share the computation of some intermediate +
-results is not limited to analyses of the same instance of the +
-\kdd process. Moreover, a set of analyses can also share several +
-tasks or parts of them. Especially if these analyses use the same +
-data set and also share the same type of analysis, several +
-intermediate results can be identical. Figure +
-\ref{multipleinstances} shows two instances of the \kdd process +
-that use the same data set to analyse. Although the goals of both +
-instances are different, both instances need to cluster the same +
-table. Yet, parameter values and used attributes differ. Later +
-chapter will show that is possible to pre-compute common +
-intermediate results for analyses when only attributes and +
-parameters vary. +
- +
-Re-doing the same steps of analyses would be tolerable if the time +
-needed to compute an analysis is negligible---which is not the +
-case, as illustrated in the next paragraphs. +
- +
-\emph{Pre-processing} and most \emph{data mining methods} require +
-accessing all data---which is very time-consuming in large +
-databases exceeding 1GB in size. Some algorithms are inapplicable +
-to large data sets without supplying proper database index +
-structures. For instance, a single run of the DBSCAN \cite{DBSCAN} +
-clustering algorithm scanning ten thousand tuples (less than 1 MB +
-in size) endured approximately eleven hours on a SUN Sparc +
-workstation \cite[pp. 149-151]{stoettinger}. The reason for this +
-long runtime is that DBSCAN requires multiple scans of the +
-selected data. +
- +
-Although some data mining algorithms like the clustering +
-algorithms BIRCH \cite{zhang96birch} and CLARANS +
-\cite{ng94efficient} are very efficient and scale better in the +
-number of tuples, the minimum time needed for +
-\emph{pre-processing} and \emph{data mining} is still too long for +
-interactively analysing the data. During the tests for this +
-dissertation the Oracle 9i database required few minutes to answer +
-a simple \emph{SELECT COUNT(*) FROM table} query when the size of +
-the table exceeded half a gigabyte---each time the query required +
-accessing values of a specific attribute response time increased +
-significantly to up to ten minutes. +
- +
-The databases of mail order companies or large warehouses exceed +
-one gigabyte by far. For instance, think of a company operating +
-supermarkets. When there are only $300$ small supermarkets in +
-which there are only $300$ customers buying $5$ items per day and +
-the database requires only $512$ byte to store a sold item, the +
-data set that is daily stored requires more than $0.25$ GB disc +
-space. +
- +
-A long runtime of an algorithm or a pre-processing task means that +
-an analyst has to wait longer for a computer to finish its job +
-than he/she can analyse the resulting patterns. +
- +
-Zhang et alii state that the most time-consuming part of +
-data-mining algorithms is scanning the data while operations in +
-main memory are negligible \cite{zhang96birch}. This statement is +
-consistent with the test results of this dissertation. As a +
-consequence,​ approaches of improving data mining algorithms must +
-focus on minimising the number of required database scans. +
- +
-Avoiding scanning the database at least once is impossible. Even +
-taking a sample requires at least one scan of the +
-data---otherwise,​ the so-taken sample would not be representative. +
- +
-Some data mining algorithms already require only the minimum +
-number of one database scan. BIRCH is a hierarchical clustering +
-algorithm requiring only one scan, Bradley et alii have proposed +
-in \cite{Bradley} an EM (expectation maximisation) clustering +
-algorithm, or Kanungo et alii have presented in +
-\cite{kanungo-efficient} a $k$-means-like algorithm, to name only +
-few examples of clustering approaches that require only one scan +
-of the data. +
- +
-If existing data mining algorithms are nearly as fast as the +
-theoretical limit but even this minimal time is too long for +
-interactive working, the source of major improvements in runtime +
-must be elsewhere. The next subsection gives the basic idea of the +
-approach of this dissertation that is capable to reduce the needed +
-runtime below the current limit. The research question of this +
-dissertation are a direct consequence of this approach. +
- +
-In many approaches which we will discuss later, one can decrease +
-runtime on behalf of quality of the results of the \kdd process. +
-Hence, quality is always an additional issue when discussing +
-improvements of runtime. Yet, improving quality and runtime do not +
-exclude each other. Thus, the effect on quality is an additional +
-research question of any approach changing the \kdd process. Yet, +
-the benefit of improving quality of results depends on the +
-specific instance of the \kdd process. For instance, if a company +
-can improve the classification of customers who are about to +
-cancel their contracts with the company then each increase in +
-quality, i.e. the accuracy of classification,​ is beneficial. +
- +
-\subsection{Basic Idea of Approach and Research Questions} +
- +
-The \kdd process as it is shown in figure \ref{kdd process} +
-describes the sequence of steps of a single analysis. Especially,​ +
-the \kdd process does not take into account that analysts do +
-several analyses---not necessarily all of them at the same time +
-but many analyses within a year. The discussion of the last +
-subsection indicated that further improvement of analyses is +
-necessary to enable working continuously on data analyses. +
-However, improving a single analysis is improper because the +
-theoretical limit of one scan is not fast enough. Changing this +
-isolated point of view can be a source of major improvement which +
-will be shown later. +
- +
-\bigskip \emph{The core idea of this dissertation is to re-think the +
-knowledge discovery in databases process. Instances of the KDD +
-process should no longer be regarded as single independent process +
-instances but should be regarded as dependent instances that can +
-profit from each other.} +
- +
-\emph{Further,​ if the specific setting of a future KDD process +
-instance is unknown but several tasks of the instance are known, +
-these known tasks should be done in anticipation.} +
-\bigskip +
- +
-If two or more \kdd process instances share the same +
-time-consuming tasks or parts of them then it is reasonable to do +
-those tasks only in the first occurring instance---it would be +
-wasted time to re-do such a task in the second and future +
-instances. +
- +
-\bigskip \emph{The underlying basic assumption is that there are +
-process instances that have time-consuming tasks in +
-common.}\bigskip +
- +
-The research questions this dissertation intends to answer are +
-consequents of the above-mentioned basic idea and basic +
-assumption. To be specific, the research questions of this +
-dissertation are as follows: +
- +
-\begin{enumerate} +
-  \item How do \kdd analyses differ from each other? +
- +
-  \item Are there intermediate results shared by multiple +
-  analyses? +
- +
-  \item Can (intermediate) results of an analysis affect the +
-  quality of the result of another analysis? +
-\end{enumerate} +
- +
-If the variability of analyses is known, it is possible to +
-identify items that are common in all analyses of a specific type +
-of analysis. Additionally,​ one can identify the range of some +
-parameters. If there are only a few values for a specific +
-parameter the analyst can choose of, one can pre-compute the +
-results for all of these potential values. When the analyst +
-finally chooses the actual value for an analysis, the result has +
-already been computed by the system. +
- +
-If there are intermediate results of an analysis that other +
-analyses might need than these other analyses can profit of these +
-intermediate results. Hence, the answer to the second research +
-question is needed to improve the \kdd process: If common +
-intermediate results are identified, one can pre-compute them in +
-anticipation to save time in succeeding analyses. +
- +
-The third research question focusses on the aspect of quality of +
-analyses. Taking several analyses into account can also affect the +
-quality of the results of each analysis. For instance, knowledge +
-gained in an analysis can help getting better results in another +
-analysis by biasing the succeeding analysis. +
- +
-A specific type of analysis needs to perform tasks of a specific +
-type. Some of these tasks depend on parameters of the analyst, +
-others do not. It is possible to pre-compute the results of tasks +
-which are independent of analyst'​s input. +
- +
-The data set which the system processes in a specific task is the +
-only differing factor of the same independent tasks of analysis of +
-the same type. As the number of tables in a database is limited, +
-it is possible to pre-compute intermediate results of commonly +
-used types of analyses of all large tables. +
- +
-However, there still remain tasks depending on parameters which +
-cannot be pre-computed. Thus, it is necessary to split analyses +
-into parts that can be processed anticipatory and tasks that +
-cannot, as illustrated in figure \ref{somekddinstances}. A single +
-instance of a supporting process pre-computes intermediate results +
-for two instances of an altered \kdd process that consists only of +
-parameter-specific tasks. Later sections will introduce the +
-supported process as parameter-specific \kdd process and the +
-supporting process as parameter-independent \kdd process. +
- +
-The instance of the parameter-independent \kdd process +
-pre-computes intermediate results and keeps them up-to-date when a +
-table which has been used to pre-process has changed, i.e. there +
-are new, removed, or altered tuples in the table. +
- +
-The instances of the parameter-specific \kdd process use the +
-pre-computed intermediate results either to process the final +
-result for higher performance or to bias the computation to +
-receive better results. Improving speed with pre-computed +
-intermediate results corresponds with the second research +
-question, while improving quality by biasing corresponds with the +
-third research question. +
- +
-The above-mentioned concept of anticipatory pre-computing ideally +
-support analysing data warehouses containing few but very large +
-fact tables. As the combination of data warehouses and \kdd is a +
-common combination in companies having mass data, this concept +
-offers much potential for many companies. +
- +
-Some approaches of related work already arise from the same idea +
-of pre-processing but are limited to complete tasks---most of them +
-are tasks of the pre-processing phase. This dissertation +
-introduces an approach that splits tasks into smaller junks and is +
-able to increase the number of pre-computable tasks that way. +
-Additionally,​ this approach extends the idea of pre-computing to +
-the data mining phase. Hence, it is more broadly applicable than +
-existing work. +
- +
-\begin{figure} +
-\centering +
-\includegraphics[width=12cm]{somekddinstances.eps} +
-\caption{Improving the \kdd process by using pre-computed +
-intermediate results and auxiliary data instead of tuples} +
-\label{somekddinstances} +
-\end{figure} +
- +
-This dissertation will present a technique to increase the +
-performance of instance-specific pre-processing tasks to make them +
-so fast that analysts can work on analyses without being slowed +
-down by the data mining system. +
- +
-In addition to the \emph{pre-processing} phase, the \emph{data +
-mining} phase also has got high potential for improvement. +
- +
-This dissertation presents techniques to enhance the performance +
-of data mining techniques and intends to improve the \kdd process +
-in terms of quality and speed that way. For being more specific, +
-it is useful to distinguish the different data mining techniques +
-association rule mining, classification,​ clustering, and +
-regression at this point. +
- +
-Association rule mining requires finding frequently occurring sets +
-of items before finding association rules among them. If the +
-frequent item sets that are valid in the set of all transactions +
-are once determined it is possible to determine the frequent item +
-sets that are valid in subsets of all transactions. This has been +
-sufficiently shown in other work. This work will discuss those +
-works for reasons of completeness but will contribute only a minor +
-new idea. +
- +
-Classification algorithms use a set of data to train a function +
-that is called a classifier which is able to predict the class of +
-a tuple using the tuples'​ attribute values. The set of data that +
-is used for training typically contains only few +
-tuples---consequentially,​ scaling algorithms is not a problem. The +
-quality of the resulting classifier is the most concerning issue +
-of classification. This work will discuss quality-improving +
-classification approaches and contribute some new ideas how to +
-improve the quality of a classifier by using auxiliary data gained +
-as by-products of other \kdd process instances. +
- +
-Clustering algorithms produce a set of clusters. All clustering +
-algorithms but density-based clustering algorithms represent +
-clusters with a set few describing attributes which varies from +
-algorithm to algorithm. For instance, \EM clustering algorithms +
-generate a probabilistic model consisting of $k$ random variables +
-and store mean and deviation of each variable, while $k$-means and +
-related algorithms represent cluster by their euclidian mean. It +
-is easy to compute mean and standard deviation of a cluster, when +
-the number of the cluster'​s tuples, the linear sum of its tuples +
-and their sum of squares is known, which will be demonstrated +
-later. +
- +
-If the tuples within a subset of the data are very similar to each +
-other, they will be part of the same cluster in most times. Thus, +
-it is possible to build a subtotal of all of these tuples. +
- +
-Determining the mean by summing up all tuples or summing up +
-subtotals will produce the same result as long as there are no +
-tuples included that are part of a different cluster. If there are +
-such tuples then the result might but need not be erroneous. The +
-experimental results will show that this error is tolerable. +
- +
-This dissertation will discuss approaches improving clustering by +
-aggregating tuples to sub-clusters in detail and will present new +
-ideas. Especially, the re-use of sub-clusters will be enhanced in +
-comparison to existing approaches like \cite{zhang96birch} or +
-\cite{bradley98scaling} such that it is possible to find +
-sub-clusters in anticipation. +
- +
-Additionally,​ this dissertation will also discuss other popular +
-approaches improving the performance of analyses such as sampling. +
- +
-A running example shall illustrate the concepts of this +
-dissertation. Therefore, the next section introduces a running +
-example which is used to demonstrate more complex parts of this +
-dissertation'​s concepts. +
dissertation.1472159144.txt.gz · Last modified: 2016/08/25 23:05 by mgoller