Knowledge discovery in databases (KDD) and data mining offer enterprises and other large organisations a set of methods to discover previously unknown relations among data—which are often called patterns, e.g. han, Frawley, ester. With these patterns enterprises gain deeper insight into their context and are able to use them for better decision-making.
Frawley, Piatesky-Shapiro and Matheus define knowledge discovery as follows:
Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [p. 3]Frawley.
It is common practise to use the term knowledge discovery in databases or short kdd instead to stress the most commonly used use case. Yet, it is convenience to use the term kdd also for knowledge discovery in data that are not kept in databases but in texts or data streams.
The term data mining denotes the step of the kdd process in which sophisticated methods analyse the data for patterns of interest han, ester. The herein used techniques are further called data mining techniques. The techniques clustering, classification, and association rule mining are the most-commonly used data mining techniques.
Quality of patterns found by data mining algorithms and the time needed to find them are the measures of interest in kdd projects. The quality of a found pattern determines its usability—for instance, to guide a decision. Additionally, when mining data in large databases, time is an issue, too.
This dissertation focusses on the problem of pattern quality and time of kdd projects. Recently, many approaches have been introduced to improve specific kdd analyses. However, these approaches try to optimise a single analysis.
In contrast to existing work, this dissertation presents a novel way to improve kdd projects in terms of shorter project time and higher quality of results by optimising a set of analyses instead of optimising each analysis individually.
Therefore, the remainder of this section is organised as follows: To be able to show open issues of past work, it first introduces the kdd process as is. Later, this section shows where there are open issues of the kdd process. Last but not least, this section gives the basic idea of how to improve the kdd process in terms of runtime or quality of results.
This subsection gives a short introduction into the kdd process. It gives the information that is necessary to show open issues which is topic of the next subsection. A more detailed discussion of the kdd process will follow in section \ref{kdd process discussion}.
Figure \ref{kdd process} shows the kdd process with all its phases. The kdd process consists of the phases pre-processing, data mining, and evaluating resulting patterns. Using evaluated patterns for decision-making is not part of the kdd process but Figure \ref{kdd process} contains this item to denote that the kdd process does not exist per se but is embedded in a broader economic context.
The first step in a KDD process is selecting data that might contain patterns one is interested in. Some data mining algorithms require data as an input that are consistent with a specific data schema. The pre-processing phase consists of all activities that intend to select or transform data according the needs of a data mining analysis.
In the data mining phase a data mining expert—further referenced as analyst—analyses the data using sophisticated data mining algorithms. The outcome of these algorithms is a set of patterns that contain potential new knowledge.
The resulting patterns must be evaluated before they can be used. This happens during the evaluating phase in which an analyst tests whether the patterns are new, valid and interesting or not.
If there are no patterns that fulfill all the above conditions, the KDD process or a part of it has to be repeated once more. The reason why a KDD process has to be iterated can be a bad choice of parameter values or a bad selection of data.
Although there exist approaches that improve the runtime of instances of the kdd process, time is still an open issue of the kdd process—quality of results is the other one. Hence, we will discuss open issues concerning the runtime of instances of the kdd process first before discussing issues related with quality.
Algorithms with long runtime and many iterations during an instance of the kdd process increase the runtime of this instance. If the data set which an algorithm is analysing is very large then even the runtime of algorithm of well-scaling algorithms is high. The consequence of a long runtime of an algorithm is that the analyst who started it is slowed down in ones work by the system until the algorithm terminates.
Iterating steps in an instance of thekdd process is necessary because the best values of the used algorithm's parameters and the best set of data for a specific analysis are initially unknown. Assuming the opposite would mean that an analyst performs an analysis the result he/she already knows. Yet, that would conflict with the definition of kdd that patterns must be new.
Consequentially, it is common practise to run data mining algorithms several times—each time with a different combination of parameter values and data. More precisely, the analyst chooses the data and values that will most likely bring the best results best to one's current knowledge. Even an unsuccessful attempt commonly gives the analyst a better insight into the relations among the analysed data. Future analyses will be more likely successful. Figure \ref{iteratingnecessary} depicts an instance of the kdd process that requires two iterations.
When an analyst re-does an analysis with different settings, these analyses are not identical but several tasks are very similar. More specifically, both analysts share several intermediate results as later chapters of this dissertation will show. Obviously, processing the same intermediate results more than once means to waste time. Yet, previous work introduces techniques which are able to re-use results of a previous analysis when parameters change. If an analyst uses a different subset of the data he/she must compute all results of this new analysis from scratch.
Two analyses that share the computation of some intermediate results is not limited to analyses of the same instance of the kdd process. Moreover, a set of analyses can also share several tasks or parts of them. Especially if these analyses use the same data set and also share the same type of analysis, several intermediate results can be identical. Figure \ref{multipleinstances} shows two instances of the kdd process that use the same data set to analyse. Although the goals of both instances are different, both instances need to cluster the same table. Yet, parameter values and used attributes differ. Later chapter will show that is possible to pre-compute common intermediate results for analyses when only attributes and parameters vary.
Re-doing the same steps of analyses would be tolerable if the time needed to compute an analysis is negligible—which is not the case, as illustrated in the next paragraphs.
Pre-processing and most data mining methods require accessing all data—which is very time-consuming in large databases exceeding 1GB in size. Some algorithms are inapplicable to large data sets without supplying proper database index structures. For instance, a single run of the DBSCAN DBSCAN clustering algorithm scanning ten thousand tuples (less than 1 MB in size) endured approximately eleven hours on a SUN Sparc workstation [pp. 149-151]stoettinger. The reason for this long runtime is that DBSCAN requires multiple scans of the selected data.
Although some data mining algorithms like the clustering algorithms BIRCH zhang96birch and CLARANS ng94efficient are very efficient and scale better in the number of tuples, the minimum time needed for pre-processing and data mining is still too long for interactively analysing the data. During the tests for this dissertation the Oracle 9i database required few minutes to answer a simple SELECT COUNT(*) FROM table query when the size of the table exceeded half a gigabyte—each time the query required accessing values of a specific attribute response time increased significantly to up to ten minutes.
The databases of mail order companies or large warehouses exceed one gigabyte by far. For instance, think of a company operating supermarkets. When there are only $300$ small supermarkets in which there are only $300$ customers buying $5$ items per day and the database requires only $512$ byte to store a sold item, the data set that is daily stored requires more than $0.25$ GB disc space.
A long runtime of an algorithm or a pre-processing task means that an analyst has to wait longer for a computer to finish its job than he/she can analyse the resulting patterns.
Zhang et alii state that the most time-consuming part of data-mining algorithms is scanning the data while operations in main memory are negligible zhang96birch. This statement is consistent with the test results of this dissertation. As a consequence, approaches of improving data mining algorithms must focus on minimising the number of required database scans.
Avoiding scanning the database at least once is impossible. Even taking a sample requires at least one scan of the data—otherwise, the so-taken sample would not be representative.
Some data mining algorithms already require only the minimum number of one database scan. BIRCH is a hierarchical clustering algorithm requiring only one scan, Bradley et alii have proposed in Bradley an EM (expectation maximisation) clustering algorithm, or Kanungo et alii have presented in kanungo-efficient a k-means-like algorithm, to name only few examples of clustering approaches that require only one scan of the data.
If existing data mining algorithms are nearly as fast as the theoretical limit but even this minimal time is too long for interactive working, the source of major improvements in runtime must be elsewhere. The next subsection gives the basic idea of the approach of this dissertation that is capable to reduce the needed runtime below the current limit. The research question of this dissertation are a direct consequence of this approach.
In many approaches which we will discuss later, one can decrease runtime on behalf of quality of the results of the kdd process. Hence, quality is always an additional issue when discussing improvements of runtime. Yet, improving quality and runtime do not exclude each other. Thus, the effect on quality is an additional research question of any approach changing the kdd process. Yet, the benefit of improving quality of results depends on the specific instance of the kdd process. For instance, if a company can improve the classification of customers who are about to cancel their contracts with the company then each increase in quality, i.e. the accuracy of classification, is beneficial.
The kdd process as it is shown in figure \ref{kdd process} describes the sequence of steps of a single analysis. Especially, the kdd process does not take into account that analysts do several analyses—not necessarily all of them at the same time but many analyses within a year. The discussion of the last subsection indicated that further improvement of analyses is necessary to enable working continuously on data analyses. However, improving a single analysis is improper because the theoretical limit of one scan is not fast enough. Changing this isolated point of view can be a source of major improvement which will be shown later.
The core idea of this dissertation is to re-think the knowledge discovery in databases process. Instances of the KDD process should no longer be regarded as single independent process instances but should be regarded as dependent instances that can profit from each other.
Further, if the specific setting of a future KDD process instance is unknown but several tasks of the instance are known, these known tasks should be done in anticipation.
If two or more kdd process instances share the same time-consuming tasks or parts of them then it is reasonable to do those tasks only in the first occurring instance—it would be wasted time to re-do such a task in the second and future instances.
The underlying basic assumption is that there are process instances that have time-consuming tasks in common.
The research questions this dissertation intends to answer are consequents of the above-mentioned basic idea and basic assumption. To be specific, the research questions of this dissertation are as follows:
If the variability of analyses is known, it is possible to identify items that are common in all analyses of a specific type of analysis. Additionally, one can identify the range of some parameters. If there are only a few values for a specific parameter the analyst can choose of, one can pre-compute the results for all of these potential values. When the analyst finally chooses the actual value for an analysis, the result has already been computed by the system.
If there are intermediate results of an analysis that other analyses might need than these other analyses can profit of these intermediate results. Hence, the answer to the second research question is needed to improve the kdd process: If common intermediate results are identified, one can pre-compute them in anticipation to save time in succeeding analyses.
The third research question focusses on the aspect of quality of analyses. Taking several analyses into account can also affect the quality of the results of each analysis. For instance, knowledge gained in an analysis can help getting better results in another analysis by biasing the succeeding analysis.
A specific type of analysis needs to perform tasks of a specific type. Some of these tasks depend on parameters of the analyst, others do not. It is possible to pre-compute the results of tasks which are independent of analyst's input.
The data set which the system processes in a specific task is the only differing factor of the same independent tasks of analysis of the same type. As the number of tables in a database is limited, it is possible to pre-compute intermediate results of commonly used types of analyses of all large tables.
However, there still remain tasks depending on parameters which cannot be pre-computed. Thus, it is necessary to split analyses into parts that can be processed anticipatory and tasks that cannot, as illustrated in figure \ref{somekddinstances}. A single instance of a supporting process pre-computes intermediate results for two instances of an altered kdd process that consists only of parameter-specific tasks. Later sections will introduce the supported process as parameter-specific kdd process and the supporting process as parameter-independent kdd process.
The instance of the parameter-independent kdd process pre-computes intermediate results and keeps them up-to-date when a table which has been used to pre-process has changed, i.e. there are new, removed, or altered tuples in the table.
The instances of the parameter-specific kdd process use the pre-computed intermediate results either to process the final result for higher performance or to bias the computation to receive better results. Improving speed with pre-computed intermediate results corresponds with the second research question, while improving quality by biasing corresponds with the third research question.
The above-mentioned concept of anticipatory pre-computing ideally support analysing data warehouses containing few but very large fact tables. As the combination of data warehouses and kdd is a common combination in companies having mass data, this concept offers much potential for many companies.
Some approaches of related work already arise from the same idea of pre-processing but are limited to complete tasks—most of them are tasks of the pre-processing phase. This dissertation introduces an approach that splits tasks into smaller junks and is able to increase the number of pre-computable tasks that way. Additionally, this approach extends the idea of pre-computing to the data mining phase. Hence, it is more broadly applicable than existing work.
This dissertation will present a technique to increase the performance of instance-specific pre-processing tasks to make them so fast that analysts can work on analyses without being slowed down by the data mining system.
In addition to the \emph{pre-processing} phase, the \emph{data mining} phase also has got high potential for improvement.
This dissertation presents techniques to enhance the performance of data mining techniques and intends to improve the kdd process in terms of quality and speed that way. For being more specific, it is useful to distinguish the different data mining techniques association rule mining, classification, clustering, and regression at this point.
Association rule mining requires finding frequently occurring sets of items before finding association rules among them. If the frequent item sets that are valid in the set of all transactions are once determined it is possible to determine the frequent item sets that are valid in subsets of all transactions. This has been sufficiently shown in other work. This work will discuss those works for reasons of completeness but will contribute only a minor new idea.
Classification algorithms use a set of data to train a function that is called a classifier which is able to predict the class of a tuple using the tuples' attribute values. The set of data that is used for training typically contains only few tuples—consequentially, scaling algorithms is not a problem. The quality of the resulting classifier is the most concerning issue of classification. This work will discuss quality-improving classification approaches and contribute some new ideas how to improve the quality of a classifier by using auxiliary data gained as by-products of other kdd process instances.
Clustering algorithms produce a set of clusters. All clustering algorithms but density-based clustering algorithms represent clusters with a set few describing attributes which varies from algorithm to algorithm. For instance, \EM clustering algorithms generate a probabilistic model consisting of $k$ random variables and store mean and deviation of each variable, while $k$-means and related algorithms represent cluster by their euclidian mean. It is easy to compute mean and standard deviation of a cluster, when the number of the cluster's tuples, the linear sum of its tuples and their sum of squares is known, which will be demonstrated later.
If the tuples within a subset of the data are very similar to each other, they will be part of the same cluster in most times. Thus, it is possible to build a subtotal of all of these tuples.
Determining the mean by summing up all tuples or summing up subtotals will produce the same result as long as there are no tuples included that are part of a different cluster. If there are such tuples then the result might but need not be erroneous. The experimental results will show that this error is tolerable.
This dissertation will discuss approaches improving clustering by aggregating tuples to sub-clusters in detail and will present new ideas. Especially, the re-use of sub-clusters will be enhanced in comparison to existing approaches like zhang96birch or bradley98scaling such that it is possible to find sub-clusters in anticipation.
Additionally, this dissertation will also discuss other popular approaches improving the performance of analyses such as sampling.
A running example shall illustrate the concepts of this dissertation. Therefore, the next section introduces a running example which is used to demonstrate more complex parts of this dissertation's concepts.