Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+====== Introduction ======
+Knowledge discovery in databases (KDD) and data mining offer
+enterprises and other large organisations a set of methods to
+discover previously unknown relations among data---which are often
+called patterns, e.g. [[diss_bibliography#han|han]], [[diss_bibliography#Frawley|Frawley]], [[diss_bibliography#ester|ester]]. With
+these patterns enterprises gain deeper insight into their context
+and are able to use them for better decision-making.
+Frawley, Piatesky-Shapiro and Matheus define //knowledge
+discovery// as follows:
+> Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [p. 3][[diss_bibliography#Frawley|Frawley]].
+It is common practise
+to use the term //knowledge discovery in databases// or short
+//kdd// instead to stress the most commonly used use case. Yet, it is
+convenience to use the term //kdd// also for knowledge discovery in
+data that are not kept in databases but in texts or data streams.
+The term //data mining// denotes the step of the //kdd// process
+in which sophisticated methods analyse the data for patterns of
+interest [[diss_bibliography#han|han]], [[diss_bibliography#ester|ester]]. The herein used techniques are
+further called //data mining techniques//. The techniques
+clustering, classification, and association rule mining are the
+most-commonly used data mining techniques.
+Quality of patterns found by data mining algorithms and the time
+needed to find them are the measures of interest in //kdd// projects.
+The quality of a found pattern determines its usability---for
+instance, to guide a decision. Additionally, when mining data in
+large databases, time is an issue, too.
+This dissertation focusses on the problem of pattern quality and
+time of //kdd// projects. Recently, many approaches have been
+introduced to improve specific //kdd// analyses. However, these
+approaches try to optimise a single analysis.
+In contrast to existing work, this dissertation presents a novel
+way to improve //kdd// projects in terms of shorter project time and
+higher quality of results by optimising a set of analyses instead
+of optimising each analysis individually.
+Therefore, the remainder of this section is organised as follows:
+To be able to show open issues of past work, it first introduces
+the //kdd// process as is. Later, this section shows where there are
+open issues of the //kdd// process. Last but not least, this section
+gives the basic idea of how to improve the //kdd// process in terms
+of runtime or quality of results.
+{{ wiki:kdd.png | KDD process}}
+===== A First Introduction into the kdd Process =====
+This subsection gives a short introduction into the //kdd// process.
+It gives the information that is necessary to show open issues
+which is topic of the next subsection. A more detailed discussion
+of the //kdd// process will follow in section \ref{kdd process
+discussion}.
+Figure \ref{kdd process} shows the //kdd// process with all its
+phases. The //kdd// process consists of the phases pre-processing,
+data mining, and evaluating resulting patterns. Using evaluated
+patterns for decision-making is not part of the //kdd// process but
+Figure \ref{kdd process} contains this item to denote that the
+//kdd// process does not exist per se but is embedded in a broader
+economic context.
+The first step in a KDD process is selecting data that might
+contain patterns one is interested in. Some data mining algorithms
+require data as an input that are consistent with a specific data
+schema. The //pre-processing// phase consists of all activities
+that intend to select or transform data according the needs of a
+data mining analysis.
+In the //data mining// phase a data mining expert---further
+referenced as analyst---analyses the data using sophisticated data
+mining algorithms. The outcome of these algorithms is a set of
+patterns that contain potential new knowledge.
+{{ wiki:iteratingnecessary.png | Issue A: Insufficient results inflict additional
+iterations }}
+The resulting patterns must be evaluated before they can be used.
+This happens during the //evaluating// phase in which an
+analyst tests whether the patterns are new, valid and interesting
+or not.
+If there are no patterns that fulfill all the above conditions,
+the KDD process or a part of it has to be repeated once more. The
+reason why a KDD process has to be iterated can be a bad choice of
+parameter values or a bad selection of data.
+===== Open Issues of the kdd Process =====
+Although there exist approaches that improve the runtime of
+instances of the //kdd// process, time is still an open issue of the
+//kdd// process---quality of results is the other one. Hence, we will
+discuss open issues concerning the runtime of instances of the
+//kdd// process first before discussing issues related with quality.
+Algorithms with long runtime and many iterations during an
+instance of the //kdd// process increase the runtime of this
+instance. If the data set which an algorithm is analysing is very
+large then even the runtime of algorithm of well-scaling
+algorithms is high. The consequence of a long runtime of an
+algorithm is that the analyst who started it is slowed down in
+ones work by the system until the algorithm terminates.
+Iterating steps in an instance of the//kdd// process is necessary
+because the best values of the used algorithm's parameters and the
+best set of data for a specific analysis are initially unknown.
+Assuming the opposite would mean that an analyst performs an
+analysis the result he/she already knows. Yet, that would conflict
+with the definition of //kdd// that patterns must be new.
+{{ wiki:multipleinstances.png | Issue B: Multiple isolated //kdd// instances use the same
+data }}
+Consequentially, it is common practise to run data mining
+algorithms several times---each time with a different combination
+of parameter values and data. More precisely, the analyst chooses
+the data and values that will most likely bring the best results
+best to one's current knowledge. Even an unsuccessful attempt
+commonly gives the analyst a better insight into the relations
+among the analysed data. Future analyses will be more likely
+successful. Figure \ref{iteratingnecessary} depicts an instance of
+the //kdd// process that requires two iterations.
+When an analyst re-does an analysis with different settings, these
+analyses are not identical but several tasks are very similar.
+More specifically, both analysts share several intermediate
+results as later chapters of this dissertation will show.
+Obviously, processing the same intermediate results more than once
+means to waste time. Yet, previous work introduces techniques
+which are able to re-use results of a previous analysis when
+parameters change. If an analyst uses a different subset of the
+data he/she must compute all results of this new analysis from
+scratch.
+Two analyses that share the computation of some intermediate
+results is not limited to analyses of the same instance of the
+//kdd// process. Moreover, a set of analyses can also share several
+tasks or parts of them. Especially if these analyses use the same
+data set and also share the same type of analysis, several
+intermediate results can be identical. Figure
+\ref{multipleinstances} shows two instances of the //kdd// process
+that use the same data set to analyse. Although the goals of both
+instances are different, both instances need to cluster the same
+table. Yet, parameter values and used attributes differ. Later
+chapter will show that is possible to pre-compute common
+intermediate results for analyses when only attributes and
+parameters vary.
+Re-doing the same steps of analyses would be tolerable if the time
+needed to compute an analysis is negligible---which is not the
+case, as illustrated in the next paragraphs.
+//Pre-processing// and most //data mining methods// require
+accessing all data---which is very time-consuming in large
+databases exceeding 1GB in size. Some algorithms are inapplicable
+to large data sets without supplying proper database index
+structures. For instance, a single run of the DBSCAN [[diss_bibliography#DBSCAN|DBSCAN]]
+clustering algorithm scanning ten thousand tuples (less than 1 MB
+in size) endured approximately eleven hours on a SUN Sparc
+workstation [pp. 149-151][[diss_bibliography#stoettinger|stoettinger]]. The reason for this
+long runtime is that DBSCAN requires multiple scans of the
+selected data.
+Although some data mining algorithms like the clustering
+algorithms BIRCH [[diss_bibliography#zhang96birch|zhang96birch]] and CLARANS
+[[diss_bibliography#ng94efficient|ng94efficient]] are very efficient and scale better in the
+number of tuples, the minimum time needed for
+//pre-processing// and //data mining// is still too long for
+interactively analysing the data. During the tests for this
+dissertation the Oracle 9i database required few minutes to answer
+a simple //SELECT COUNT(*) FROM table// query when the size of
+the table exceeded half a gigabyte---each time the query required
+accessing values of a specific attribute response time increased
+significantly to up to ten minutes.
+The databases of mail order companies or large warehouses exceed
+one gigabyte by far. For instance, think of a company operating
+supermarkets. When there are only $300$ small supermarkets in
+which there are only $300$ customers buying $5$ items per day and
+the database requires only $512$ byte to store a sold item, the
+data set that is daily stored requires more than $0.25$ GB disc
+space.
+A long runtime of an algorithm or a pre-processing task means that
+an analyst has to wait longer for a computer to finish its job
+than he/she can analyse the resulting patterns.
+Zhang et alii state that the most time-consuming part of
+data-mining algorithms is scanning the data while operations in
+main memory are negligible [[diss_bibliography#zhang96birch|zhang96birch]]. This statement is
+consistent with the test results of this dissertation. As a
+consequence, approaches of improving data mining algorithms must
+focus on minimising the number of required database scans.
+Avoiding scanning the database at least once is impossible. Even
+taking a sample requires at least one scan of the
+data---otherwise, the so-taken sample would not be representative.
+Some data mining algorithms already require only the minimum
+number of one database scan. BIRCH is a hierarchical clustering
+algorithm requiring only one scan, Bradley et alii have proposed
+in [[diss_bibliography#Bradley|Bradley]] an EM (expectation maximisation) clustering
+algorithm, or Kanungo et alii have presented in
+[[diss_bibliography#kanungo-efficient|kanungo-efficient]] a k-means-like algorithm, to name only
+few examples of clustering approaches that require only one scan
+of the data.
+If existing data mining algorithms are nearly as fast as the
+theoretical limit but even this minimal time is too long for
+interactive working, the source of major improvements in runtime
+must be elsewhere. The next subsection gives the basic idea of the
+approach of this dissertation that is capable to reduce the needed
+runtime below the current limit. The research question of this
+dissertation are a direct consequence of this approach.
+In many approaches which we will discuss later, one can decrease
+runtime on behalf of quality of the results of the //kdd// process.
+Hence, quality is always an additional issue when discussing
+improvements of runtime. Yet, improving quality and runtime do not
+exclude each other. Thus, the effect on quality is an additional
+research question of any approach changing the //kdd// process. Yet,
+the benefit of improving quality of results depends on the
+specific instance of the //kdd// process. For instance, if a company
+can improve the classification of customers who are about to
+cancel their contracts with the company then each increase in
+quality, i.e. the accuracy of classification, is beneficial.
+===== Basic Idea of Approach and Research Questions =====
+The //kdd// process as it is shown in figure \ref{kdd process}
+describes the sequence of steps of a single analysis. Especially,
+the //kdd// process does not take into account that analysts do
+several analyses---not necessarily all of them at the same time
+but many analyses within a year. The discussion of the last
+subsection indicated that further improvement of analyses is
+necessary to enable working continuously on data analyses.
+However, improving a single analysis is improper because the
+theoretical limit of one scan is not fast enough. Changing this
+isolated point of view can be a source of major improvement which
+will be shown later.
+//The core idea of this dissertation is to re-think the
+knowledge discovery in databases process. Instances of the KDD
+process should no longer be regarded as single independent process
+instances but should be regarded as dependent instances that can
+profit from each other.//
+//Further, if the specific setting of a future KDD process
+instance is unknown but several tasks of the instance are known,
+these known tasks should be done in anticipation.//
+If two or more //kdd// process instances share the same
+time-consuming tasks or parts of them then it is reasonable to do
+those tasks only in the first occurring instance---it would be
+wasted time to re-do such a task in the second and future
+instances.
+//The underlying basic assumption is that there are
+process instances that have time-consuming tasks in
+common.//
+The research questions this dissertation intends to answer are
+consequents of the above-mentioned basic idea and basic
+assumption. To be specific, the research questions of this
+dissertation are as follows:
+  - How do //kdd// analyses differ from each other?
+  - Are there intermediate results shared by multiple analyses?
+  - Can (intermediate) results of an analysis affect the quality of the result of another analysis?
+If the variability of analyses is known, it is possible to
+identify items that are common in all analyses of a specific type
+of analysis. Additionally, one can identify the range of some
+parameters. If there are only a few values for a specific
+parameter the analyst can choose of, one can pre-compute the
+results for all of these potential values. When the analyst
+finally chooses the actual value for an analysis, the result has
+already been computed by the system.
+If there are intermediate results of an analysis that other
+analyses might need than these other analyses can profit of these
+intermediate results. Hence, the answer to the second research
+question is needed to improve the //kdd// process: If common
+intermediate results are identified, one can pre-compute them in
+anticipation to save time in succeeding analyses.
+The third research question focusses on the aspect of quality of
+analyses. Taking several analyses into account can also affect the
+quality of the results of each analysis. For instance, knowledge
+gained in an analysis can help getting better results in another
+analysis by biasing the succeeding analysis.
+A specific type of analysis needs to perform tasks of a specific
+type. Some of these tasks depend on parameters of the analyst,
+others do not. It is possible to pre-compute the results of tasks
+which are independent of analyst's input.
+The data set which the system processes in a specific task is the
+only differing factor of the same independent tasks of analysis of
+the same type. As the number of tables in a database is limited,
+it is possible to pre-compute intermediate results of commonly
+used types of analyses of all large tables.
+However, there still remain tasks depending on parameters which
+cannot be pre-computed. Thus, it is necessary to split analyses
+into parts that can be processed anticipatory and tasks that
+cannot, as illustrated in figure \ref{somekddinstances}. A single
+instance of a supporting process pre-computes intermediate results
+for two instances of an altered //kdd// process that consists only of
+parameter-specific tasks. Later sections will introduce the
+supported process as parameter-specific //kdd// process and the
+supporting process as parameter-independent //kdd// process.
+The instance of the parameter-independent //kdd// process
+pre-computes intermediate results and keeps them up-to-date when a
+table which has been used to pre-process has changed, i.e. there
+are new, removed, or altered tuples in the table.
+The instances of the parameter-specific //kdd// process use the
+pre-computed intermediate results either to process the final
+result for higher performance or to bias the computation to
+receive better results. Improving speed with pre-computed
+intermediate results corresponds with the second research
+question, while improving quality by biasing corresponds with the
+third research question.
+The above-mentioned concept of anticipatory pre-computing ideally
+support analysing data warehouses containing few but very large
+fact tables. As the combination of data warehouses and //kdd// is a
+common combination in companies having mass data, this concept
+offers much potential for many companies.
+Some approaches of related work already arise from the same idea
+of pre-processing but are limited to complete tasks---most of them
+are tasks of the pre-processing phase. This dissertation
+introduces an approach that splits tasks into smaller junks and is
+able to increase the number of pre-computable tasks that way.
+Additionally, this approach extends the idea of pre-computing to
+the data mining phase. Hence, it is more broadly applicable than
+existing work.
+{{ wiki:somekddinstances.png | Improving the //kdd// process by using pre-computed
+intermediate results and auxiliary data instead of tuples }}
+This dissertation will present a technique to increase the
+performance of instance-specific pre-processing tasks to make them
+so fast that analysts can work on analyses without being slowed
+down by the data mining system.
+In addition to the \emph{pre-processing} phase, the \emph{data
+mining} phase also has got high potential for improvement.
+This dissertation presents techniques to enhance the performance
+of data mining techniques and intends to improve the //kdd// process
+in terms of quality and speed that way. For being more specific,
+it is useful to distinguish the different data mining techniques
+association rule mining, classification, clustering, and
+regression at this point.
+Association rule mining requires finding frequently occurring sets
+of items before finding association rules among them. If the
+frequent item sets that are valid in the set of all transactions
+are once determined it is possible to determine the frequent item
+sets that are valid in subsets of all transactions. This has been
+sufficiently shown in other work. This work will discuss those
+works for reasons of completeness but will contribute only a minor
+new idea.
+Classification algorithms use a set of data to train a function
+that is called a classifier which is able to predict the class of
+a tuple using the tuples' attribute values. The set of data that
+is used for training typically contains only few
+tuples---consequentially, scaling algorithms is not a problem. The
+quality of the resulting classifier is the most concerning issue
+of classification. This work will discuss quality-improving
+classification approaches and contribute some new ideas how to
+improve the quality of a classifier by using auxiliary data gained
+as by-products of other //kdd// process instances.
+Clustering algorithms produce a set of clusters. All clustering
+algorithms but density-based clustering algorithms represent
+clusters with a set few describing attributes which varies from
+algorithm to algorithm. For instance, \EM clustering algorithms
+generate a probabilistic model consisting of $k$ random variables
+and store mean and deviation of each variable, while $k$-means and
+related algorithms represent cluster by their euclidian mean. It
+is easy to compute mean and standard deviation of a cluster, when
+the number of the cluster's tuples, the linear sum of its tuples
+and their sum of squares is known, which will be demonstrated
+later.
+If the tuples within a subset of the data are very similar to each
+other, they will be part of the same cluster in most times. Thus,
+it is possible to build a subtotal of all of these tuples.
+Determining the mean by summing up all tuples or summing up
+subtotals will produce the same result as long as there are no
+tuples included that are part of a different cluster. If there are
+such tuples then the result might but need not be erroneous. The
+experimental results will show that this error is tolerable.
+This dissertation will discuss approaches improving clustering by
+aggregating tuples to sub-clusters in detail and will present new
+ideas. Especially, the re-use of sub-clusters will be enhanced in
+comparison to existing approaches like [[diss_bibliography#zhang96birch|zhang96birch]] or
+[[diss_bibliography#bradley98scaling|bradley98scaling]] such that it is possible to find
+sub-clusters in anticipation.
+Additionally, this dissertation will also discuss other popular
+approaches improving the performance of analyses such as sampling.
+A running example shall illustrate the concepts of this
+dissertation. Therefore, the next section introduces a running
+example which is used to demonstrate more complex parts of this
+dissertation's concepts.

DrGollerResearchWiki

User Tools

Site Tools

Differences

Page Tools