This shows you the differences between two versions of the page.
| — |
diss_introduction [2016/09/04 22:39] (current) mgoller created |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Introduction ====== | ||
| + | |||
| + | Knowledge discovery in databases (KDD) and data mining offer | ||
| + | enterprises and other large organisations a set of methods to | ||
| + | discover previously unknown relations among data---which are often | ||
| + | called patterns, e.g. [[diss_bibliography#han|han]], [[diss_bibliography#Frawley|Frawley]], [[diss_bibliography#ester|ester]]. With | ||
| + | these patterns enterprises gain deeper insight into their context | ||
| + | and are able to use them for better decision-making. | ||
| + | |||
| + | Frawley, Piatesky-Shapiro and Matheus define //knowledge | ||
| + | discovery// as follows: | ||
| + | > Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [p. 3][[diss_bibliography#Frawley|Frawley]]. | ||
| + | It is common practise | ||
| + | to use the term //knowledge discovery in databases// or short | ||
| + | //kdd// instead to stress the most commonly used use case. Yet, it is | ||
| + | convenience to use the term //kdd// also for knowledge discovery in | ||
| + | data that are not kept in databases but in texts or data streams. | ||
| + | |||
| + | The term //data mining// denotes the step of the //kdd// process | ||
| + | in which sophisticated methods analyse the data for patterns of | ||
| + | interest [[diss_bibliography#han|han]], [[diss_bibliography#ester|ester]]. The herein used techniques are | ||
| + | further called //data mining techniques//. The techniques | ||
| + | clustering, classification, and association rule mining are the | ||
| + | most-commonly used data mining techniques. | ||
| + | |||
| + | Quality of patterns found by data mining algorithms and the time | ||
| + | needed to find them are the measures of interest in //kdd// projects. | ||
| + | The quality of a found pattern determines its usability---for | ||
| + | instance, to guide a decision. Additionally, when mining data in | ||
| + | large databases, time is an issue, too. | ||
| + | |||
| + | This dissertation focusses on the problem of pattern quality and | ||
| + | time of //kdd// projects. Recently, many approaches have been | ||
| + | introduced to improve specific //kdd// analyses. However, these | ||
| + | approaches try to optimise a single analysis. | ||
| + | |||
| + | In contrast to existing work, this dissertation presents a novel | ||
| + | way to improve //kdd// projects in terms of shorter project time and | ||
| + | higher quality of results by optimising a set of analyses instead | ||
| + | of optimising each analysis individually. | ||
| + | |||
| + | Therefore, the remainder of this section is organised as follows: | ||
| + | To be able to show open issues of past work, it first introduces | ||
| + | the //kdd// process as is. Later, this section shows where there are | ||
| + | open issues of the //kdd// process. Last but not least, this section | ||
| + | gives the basic idea of how to improve the //kdd// process in terms | ||
| + | of runtime or quality of results. | ||
| + | |||
| + | {{ wiki:kdd.png | KDD process}} | ||
| + | |||
| + | ===== A First Introduction into the kdd Process ===== | ||
| + | |||
| + | This subsection gives a short introduction into the //kdd// process. | ||
| + | It gives the information that is necessary to show open issues | ||
| + | which is topic of the next subsection. A more detailed discussion | ||
| + | of the //kdd// process will follow in section \ref{kdd process | ||
| + | discussion}. | ||
| + | |||
| + | Figure \ref{kdd process} shows the //kdd// process with all its | ||
| + | phases. The //kdd// process consists of the phases pre-processing, | ||
| + | data mining, and evaluating resulting patterns. Using evaluated | ||
| + | patterns for decision-making is not part of the //kdd// process but | ||
| + | Figure \ref{kdd process} contains this item to denote that the | ||
| + | //kdd// process does not exist per se but is embedded in a broader | ||
| + | economic context. | ||
| + | |||
| + | The first step in a KDD process is selecting data that might | ||
| + | contain patterns one is interested in. Some data mining algorithms | ||
| + | require data as an input that are consistent with a specific data | ||
| + | schema. The //pre-processing// phase consists of all activities | ||
| + | that intend to select or transform data according the needs of a | ||
| + | data mining analysis. | ||
| + | |||
| + | In the //data mining// phase a data mining expert---further | ||
| + | referenced as analyst---analyses the data using sophisticated data | ||
| + | mining algorithms. The outcome of these algorithms is a set of | ||
| + | patterns that contain potential new knowledge. | ||
| + | |||
| + | {{ wiki:iteratingnecessary.png | Issue A: Insufficient results inflict additional | ||
| + | iterations }} | ||
| + | |||
| + | The resulting patterns must be evaluated before they can be used. | ||
| + | This happens during the //evaluating// phase in which an | ||
| + | analyst tests whether the patterns are new, valid and interesting | ||
| + | or not. | ||
| + | |||
| + | If there are no patterns that fulfill all the above conditions, | ||
| + | the KDD process or a part of it has to be repeated once more. The | ||
| + | reason why a KDD process has to be iterated can be a bad choice of | ||
| + | parameter values or a bad selection of data. | ||
| + | |||
| + | ===== Open Issues of the kdd Process ===== | ||
| + | |||
| + | Although there exist approaches that improve the runtime of | ||
| + | instances of the //kdd// process, time is still an open issue of the | ||
| + | //kdd// process---quality of results is the other one. Hence, we will | ||
| + | discuss open issues concerning the runtime of instances of the | ||
| + | //kdd// process first before discussing issues related with quality. | ||
| + | |||
| + | Algorithms with long runtime and many iterations during an | ||
| + | instance of the //kdd// process increase the runtime of this | ||
| + | instance. If the data set which an algorithm is analysing is very | ||
| + | large then even the runtime of algorithm of well-scaling | ||
| + | algorithms is high. The consequence of a long runtime of an | ||
| + | algorithm is that the analyst who started it is slowed down in | ||
| + | ones work by the system until the algorithm terminates. | ||
| + | |||
| + | Iterating steps in an instance of the//kdd// process is necessary | ||
| + | because the best values of the used algorithm's parameters and the | ||
| + | best set of data for a specific analysis are initially unknown. | ||
| + | Assuming the opposite would mean that an analyst performs an | ||
| + | analysis the result he/she already knows. Yet, that would conflict | ||
| + | with the definition of //kdd// that patterns must be new. | ||
| + | |||
| + | {{ wiki:multipleinstances.png | Issue B: Multiple isolated //kdd// instances use the same | ||
| + | data }} | ||
| + | |||
| + | Consequentially, it is common practise to run data mining | ||
| + | algorithms several times---each time with a different combination | ||
| + | of parameter values and data. More precisely, the analyst chooses | ||
| + | the data and values that will most likely bring the best results | ||
| + | best to one's current knowledge. Even an unsuccessful attempt | ||
| + | commonly gives the analyst a better insight into the relations | ||
| + | among the analysed data. Future analyses will be more likely | ||
| + | successful. Figure \ref{iteratingnecessary} depicts an instance of | ||
| + | the //kdd// process that requires two iterations. | ||
| + | |||
| + | When an analyst re-does an analysis with different settings, these | ||
| + | analyses are not identical but several tasks are very similar. | ||
| + | More specifically, both analysts share several intermediate | ||
| + | results as later chapters of this dissertation will show. | ||
| + | Obviously, processing the same intermediate results more than once | ||
| + | means to waste time. Yet, previous work introduces techniques | ||
| + | which are able to re-use results of a previous analysis when | ||
| + | parameters change. If an analyst uses a different subset of the | ||
| + | data he/she must compute all results of this new analysis from | ||
| + | scratch. | ||
| + | |||
| + | Two analyses that share the computation of some intermediate | ||
| + | results is not limited to analyses of the same instance of the | ||
| + | //kdd// process. Moreover, a set of analyses can also share several | ||
| + | tasks or parts of them. Especially if these analyses use the same | ||
| + | data set and also share the same type of analysis, several | ||
| + | intermediate results can be identical. Figure | ||
| + | \ref{multipleinstances} shows two instances of the //kdd// process | ||
| + | that use the same data set to analyse. Although the goals of both | ||
| + | instances are different, both instances need to cluster the same | ||
| + | table. Yet, parameter values and used attributes differ. Later | ||
| + | chapter will show that is possible to pre-compute common | ||
| + | intermediate results for analyses when only attributes and | ||
| + | parameters vary. | ||
| + | |||
| + | Re-doing the same steps of analyses would be tolerable if the time | ||
| + | needed to compute an analysis is negligible---which is not the | ||
| + | case, as illustrated in the next paragraphs. | ||
| + | |||
| + | //Pre-processing// and most //data mining methods// require | ||
| + | accessing all data---which is very time-consuming in large | ||
| + | databases exceeding 1GB in size. Some algorithms are inapplicable | ||
| + | to large data sets without supplying proper database index | ||
| + | structures. For instance, a single run of the DBSCAN [[diss_bibliography#DBSCAN|DBSCAN]] | ||
| + | clustering algorithm scanning ten thousand tuples (less than 1 MB | ||
| + | in size) endured approximately eleven hours on a SUN Sparc | ||
| + | workstation [pp. 149-151][[diss_bibliography#stoettinger|stoettinger]]. The reason for this | ||
| + | long runtime is that DBSCAN requires multiple scans of the | ||
| + | selected data. | ||
| + | |||
| + | Although some data mining algorithms like the clustering | ||
| + | algorithms BIRCH [[diss_bibliography#zhang96birch|zhang96birch]] and CLARANS | ||
| + | [[diss_bibliography#ng94efficient|ng94efficient]] are very efficient and scale better in the | ||
| + | number of tuples, the minimum time needed for | ||
| + | //pre-processing// and //data mining// is still too long for | ||
| + | interactively analysing the data. During the tests for this | ||
| + | dissertation the Oracle 9i database required few minutes to answer | ||
| + | a simple //SELECT COUNT(*) FROM table// query when the size of | ||
| + | the table exceeded half a gigabyte---each time the query required | ||
| + | accessing values of a specific attribute response time increased | ||
| + | significantly to up to ten minutes. | ||
| + | |||
| + | The databases of mail order companies or large warehouses exceed | ||
| + | one gigabyte by far. For instance, think of a company operating | ||
| + | supermarkets. When there are only $300$ small supermarkets in | ||
| + | which there are only $300$ customers buying $5$ items per day and | ||
| + | the database requires only $512$ byte to store a sold item, the | ||
| + | data set that is daily stored requires more than $0.25$ GB disc | ||
| + | space. | ||
| + | |||
| + | A long runtime of an algorithm or a pre-processing task means that | ||
| + | an analyst has to wait longer for a computer to finish its job | ||
| + | than he/she can analyse the resulting patterns. | ||
| + | |||
| + | Zhang et alii state that the most time-consuming part of | ||
| + | data-mining algorithms is scanning the data while operations in | ||
| + | main memory are negligible [[diss_bibliography#zhang96birch|zhang96birch]]. This statement is | ||
| + | consistent with the test results of this dissertation. As a | ||
| + | consequence, approaches of improving data mining algorithms must | ||
| + | focus on minimising the number of required database scans. | ||
| + | |||
| + | Avoiding scanning the database at least once is impossible. Even | ||
| + | taking a sample requires at least one scan of the | ||
| + | data---otherwise, the so-taken sample would not be representative. | ||
| + | |||
| + | Some data mining algorithms already require only the minimum | ||
| + | number of one database scan. BIRCH is a hierarchical clustering | ||
| + | algorithm requiring only one scan, Bradley et alii have proposed | ||
| + | in [[diss_bibliography#Bradley|Bradley]] an EM (expectation maximisation) clustering | ||
| + | algorithm, or Kanungo et alii have presented in | ||
| + | [[diss_bibliography#kanungo-efficient|kanungo-efficient]] a k-means-like algorithm, to name only | ||
| + | few examples of clustering approaches that require only one scan | ||
| + | of the data. | ||
| + | |||
| + | If existing data mining algorithms are nearly as fast as the | ||
| + | theoretical limit but even this minimal time is too long for | ||
| + | interactive working, the source of major improvements in runtime | ||
| + | must be elsewhere. The next subsection gives the basic idea of the | ||
| + | approach of this dissertation that is capable to reduce the needed | ||
| + | runtime below the current limit. The research question of this | ||
| + | dissertation are a direct consequence of this approach. | ||
| + | |||
| + | In many approaches which we will discuss later, one can decrease | ||
| + | runtime on behalf of quality of the results of the //kdd// process. | ||
| + | Hence, quality is always an additional issue when discussing | ||
| + | improvements of runtime. Yet, improving quality and runtime do not | ||
| + | exclude each other. Thus, the effect on quality is an additional | ||
| + | research question of any approach changing the //kdd// process. Yet, | ||
| + | the benefit of improving quality of results depends on the | ||
| + | specific instance of the //kdd// process. For instance, if a company | ||
| + | can improve the classification of customers who are about to | ||
| + | cancel their contracts with the company then each increase in | ||
| + | quality, i.e. the accuracy of classification, is beneficial. | ||
| + | |||
| + | ===== Basic Idea of Approach and Research Questions ===== | ||
| + | |||
| + | The //kdd// process as it is shown in figure \ref{kdd process} | ||
| + | describes the sequence of steps of a single analysis. Especially, | ||
| + | the //kdd// process does not take into account that analysts do | ||
| + | several analyses---not necessarily all of them at the same time | ||
| + | but many analyses within a year. The discussion of the last | ||
| + | subsection indicated that further improvement of analyses is | ||
| + | necessary to enable working continuously on data analyses. | ||
| + | However, improving a single analysis is improper because the | ||
| + | theoretical limit of one scan is not fast enough. Changing this | ||
| + | isolated point of view can be a source of major improvement which | ||
| + | will be shown later. | ||
| + | |||
| + | //The core idea of this dissertation is to re-think the | ||
| + | knowledge discovery in databases process. Instances of the KDD | ||
| + | process should no longer be regarded as single independent process | ||
| + | instances but should be regarded as dependent instances that can | ||
| + | profit from each other.// | ||
| + | |||
| + | //Further, if the specific setting of a future KDD process | ||
| + | instance is unknown but several tasks of the instance are known, | ||
| + | these known tasks should be done in anticipation.// | ||
| + | |||
| + | If two or more //kdd// process instances share the same | ||
| + | time-consuming tasks or parts of them then it is reasonable to do | ||
| + | those tasks only in the first occurring instance---it would be | ||
| + | wasted time to re-do such a task in the second and future | ||
| + | instances. | ||
| + | |||
| + | //The underlying basic assumption is that there are | ||
| + | process instances that have time-consuming tasks in | ||
| + | common.// | ||
| + | |||
| + | The research questions this dissertation intends to answer are | ||
| + | consequents of the above-mentioned basic idea and basic | ||
| + | assumption. To be specific, the research questions of this | ||
| + | dissertation are as follows: | ||
| + | |||
| + | - How do //kdd// analyses differ from each other? | ||
| + | - Are there intermediate results shared by multiple analyses? | ||
| + | - Can (intermediate) results of an analysis affect the quality of the result of another analysis? | ||
| + | |||
| + | If the variability of analyses is known, it is possible to | ||
| + | identify items that are common in all analyses of a specific type | ||
| + | of analysis. Additionally, one can identify the range of some | ||
| + | parameters. If there are only a few values for a specific | ||
| + | parameter the analyst can choose of, one can pre-compute the | ||
| + | results for all of these potential values. When the analyst | ||
| + | finally chooses the actual value for an analysis, the result has | ||
| + | already been computed by the system. | ||
| + | |||
| + | If there are intermediate results of an analysis that other | ||
| + | analyses might need than these other analyses can profit of these | ||
| + | intermediate results. Hence, the answer to the second research | ||
| + | question is needed to improve the //kdd// process: If common | ||
| + | intermediate results are identified, one can pre-compute them in | ||
| + | anticipation to save time in succeeding analyses. | ||
| + | |||
| + | The third research question focusses on the aspect of quality of | ||
| + | analyses. Taking several analyses into account can also affect the | ||
| + | quality of the results of each analysis. For instance, knowledge | ||
| + | gained in an analysis can help getting better results in another | ||
| + | analysis by biasing the succeeding analysis. | ||
| + | |||
| + | A specific type of analysis needs to perform tasks of a specific | ||
| + | type. Some of these tasks depend on parameters of the analyst, | ||
| + | others do not. It is possible to pre-compute the results of tasks | ||
| + | which are independent of analyst's input. | ||
| + | |||
| + | The data set which the system processes in a specific task is the | ||
| + | only differing factor of the same independent tasks of analysis of | ||
| + | the same type. As the number of tables in a database is limited, | ||
| + | it is possible to pre-compute intermediate results of commonly | ||
| + | used types of analyses of all large tables. | ||
| + | |||
| + | However, there still remain tasks depending on parameters which | ||
| + | cannot be pre-computed. Thus, it is necessary to split analyses | ||
| + | into parts that can be processed anticipatory and tasks that | ||
| + | cannot, as illustrated in figure \ref{somekddinstances}. A single | ||
| + | instance of a supporting process pre-computes intermediate results | ||
| + | for two instances of an altered //kdd// process that consists only of | ||
| + | parameter-specific tasks. Later sections will introduce the | ||
| + | supported process as parameter-specific //kdd// process and the | ||
| + | supporting process as parameter-independent //kdd// process. | ||
| + | |||
| + | The instance of the parameter-independent //kdd// process | ||
| + | pre-computes intermediate results and keeps them up-to-date when a | ||
| + | table which has been used to pre-process has changed, i.e. there | ||
| + | are new, removed, or altered tuples in the table. | ||
| + | |||
| + | The instances of the parameter-specific //kdd// process use the | ||
| + | pre-computed intermediate results either to process the final | ||
| + | result for higher performance or to bias the computation to | ||
| + | receive better results. Improving speed with pre-computed | ||
| + | intermediate results corresponds with the second research | ||
| + | question, while improving quality by biasing corresponds with the | ||
| + | third research question. | ||
| + | |||
| + | The above-mentioned concept of anticipatory pre-computing ideally | ||
| + | support analysing data warehouses containing few but very large | ||
| + | fact tables. As the combination of data warehouses and //kdd// is a | ||
| + | common combination in companies having mass data, this concept | ||
| + | offers much potential for many companies. | ||
| + | |||
| + | Some approaches of related work already arise from the same idea | ||
| + | of pre-processing but are limited to complete tasks---most of them | ||
| + | are tasks of the pre-processing phase. This dissertation | ||
| + | introduces an approach that splits tasks into smaller junks and is | ||
| + | able to increase the number of pre-computable tasks that way. | ||
| + | Additionally, this approach extends the idea of pre-computing to | ||
| + | the data mining phase. Hence, it is more broadly applicable than | ||
| + | existing work. | ||
| + | |||
| + | {{ wiki:somekddinstances.png | Improving the //kdd// process by using pre-computed | ||
| + | intermediate results and auxiliary data instead of tuples }} | ||
| + | |||
| + | |||
| + | This dissertation will present a technique to increase the | ||
| + | performance of instance-specific pre-processing tasks to make them | ||
| + | so fast that analysts can work on analyses without being slowed | ||
| + | down by the data mining system. | ||
| + | |||
| + | In addition to the \emph{pre-processing} phase, the \emph{data | ||
| + | mining} phase also has got high potential for improvement. | ||
| + | |||
| + | This dissertation presents techniques to enhance the performance | ||
| + | of data mining techniques and intends to improve the //kdd// process | ||
| + | in terms of quality and speed that way. For being more specific, | ||
| + | it is useful to distinguish the different data mining techniques | ||
| + | association rule mining, classification, clustering, and | ||
| + | regression at this point. | ||
| + | |||
| + | Association rule mining requires finding frequently occurring sets | ||
| + | of items before finding association rules among them. If the | ||
| + | frequent item sets that are valid in the set of all transactions | ||
| + | are once determined it is possible to determine the frequent item | ||
| + | sets that are valid in subsets of all transactions. This has been | ||
| + | sufficiently shown in other work. This work will discuss those | ||
| + | works for reasons of completeness but will contribute only a minor | ||
| + | new idea. | ||
| + | |||
| + | Classification algorithms use a set of data to train a function | ||
| + | that is called a classifier which is able to predict the class of | ||
| + | a tuple using the tuples' attribute values. The set of data that | ||
| + | is used for training typically contains only few | ||
| + | tuples---consequentially, scaling algorithms is not a problem. The | ||
| + | quality of the resulting classifier is the most concerning issue | ||
| + | of classification. This work will discuss quality-improving | ||
| + | classification approaches and contribute some new ideas how to | ||
| + | improve the quality of a classifier by using auxiliary data gained | ||
| + | as by-products of other //kdd// process instances. | ||
| + | |||
| + | Clustering algorithms produce a set of clusters. All clustering | ||
| + | algorithms but density-based clustering algorithms represent | ||
| + | clusters with a set few describing attributes which varies from | ||
| + | algorithm to algorithm. For instance, \EM clustering algorithms | ||
| + | generate a probabilistic model consisting of $k$ random variables | ||
| + | and store mean and deviation of each variable, while $k$-means and | ||
| + | related algorithms represent cluster by their euclidian mean. It | ||
| + | is easy to compute mean and standard deviation of a cluster, when | ||
| + | the number of the cluster's tuples, the linear sum of its tuples | ||
| + | and their sum of squares is known, which will be demonstrated | ||
| + | later. | ||
| + | |||
| + | If the tuples within a subset of the data are very similar to each | ||
| + | other, they will be part of the same cluster in most times. Thus, | ||
| + | it is possible to build a subtotal of all of these tuples. | ||
| + | |||
| + | Determining the mean by summing up all tuples or summing up | ||
| + | subtotals will produce the same result as long as there are no | ||
| + | tuples included that are part of a different cluster. If there are | ||
| + | such tuples then the result might but need not be erroneous. The | ||
| + | experimental results will show that this error is tolerable. | ||
| + | |||
| + | This dissertation will discuss approaches improving clustering by | ||
| + | aggregating tuples to sub-clusters in detail and will present new | ||
| + | ideas. Especially, the re-use of sub-clusters will be enhanced in | ||
| + | comparison to existing approaches like [[diss_bibliography#zhang96birch|zhang96birch]] or | ||
| + | [[diss_bibliography#bradley98scaling|bradley98scaling]] such that it is possible to find | ||
| + | sub-clusters in anticipation. | ||
| + | |||
| + | Additionally, this dissertation will also discuss other popular | ||
| + | approaches improving the performance of analyses such as sampling. | ||
| + | |||
| + | A running example shall illustrate the concepts of this | ||
| + | dissertation. Therefore, the next section introduces a running | ||
| + | example which is used to demonstrate more complex parts of this | ||
| + | dissertation's concepts. | ||