User Tools

Site Tools


diss_introduction

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

diss_introduction [2016/09/04 22:39] (current)
mgoller created
Line 1: Line 1:
 +====== Introduction ======
 +
 +Knowledge discovery in databases (KDD) and data mining offer
 +enterprises and other large organisations a set of methods to
 +discover previously unknown relations among data---which are often
 +called patterns, e.g. [[diss_bibliography#​han|han]],​ [[diss_bibliography#​Frawley|Frawley]],​ [[diss_bibliography#​ester|ester]]. With
 +these patterns enterprises gain deeper insight into their context
 +and are able to use them for better decision-making.
 +
 +Frawley, Piatesky-Shapiro and Matheus define //knowledge
 +discovery// as follows: ​
 +> Knowledge discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [p. 3][[diss_bibliography#​Frawley|Frawley]]. ​
 +It is common practise
 +to use the term //knowledge discovery in databases// or short
 +//kdd// instead to stress the most commonly used use case. Yet, it is
 +convenience to use the term //kdd// also for knowledge discovery in
 +data that are not kept in databases but in texts or data streams.
 +
 +The term //data mining// denotes the step of the //kdd// process
 +in which sophisticated methods analyse the data for patterns of
 +interest [[diss_bibliography#​han|han]],​ [[diss_bibliography#​ester|ester]]. The herein used techniques are
 +further called //data mining techniques//​. The techniques
 +clustering, classification,​ and association rule mining are the
 +most-commonly used data mining techniques.
 +
 +Quality of patterns found by data mining algorithms and the time
 +needed to find them are the measures of interest in //kdd// projects.
 +The quality of a found pattern determines its usability---for
 +instance, to guide a decision. Additionally,​ when mining data in
 +large databases, time is an issue, too.
 +
 +This dissertation focusses on the problem of pattern quality and
 +time of //kdd// projects. Recently, many approaches have been
 +introduced to improve specific //kdd// analyses. However, these
 +approaches try to optimise a single analysis.
 +
 +In contrast to existing work, this dissertation presents a novel
 +way to improve //kdd// projects in terms of shorter project time and
 +higher quality of results by optimising a set of analyses instead
 +of optimising each analysis individually.
 +
 +Therefore, the remainder of this section is organised as follows:
 +To be able to show open issues of past work, it first introduces
 +the //kdd// process as is. Later, this section shows where there are
 +open issues of the //kdd// process. Last but not least, this section
 +gives the basic idea of how to improve the //kdd// process in terms
 +of runtime or quality of results.
 +
 +{{ wiki:​kdd.png | KDD process}}
 +
 +===== A First Introduction into the kdd Process =====
 +
 +This subsection gives a short introduction into the //kdd// process.
 +It gives the information that is necessary to show open issues
 +which is topic of the next subsection. A more detailed discussion
 +of the //kdd// process will follow in section \ref{kdd process
 +discussion}.
 +
 +Figure \ref{kdd process} shows the //kdd// process with all its
 +phases. The //kdd// process consists of the phases pre-processing,​
 +data mining, and evaluating resulting patterns. Using evaluated
 +patterns for decision-making is not part of the //kdd// process but
 +Figure \ref{kdd process} contains this item to denote that the
 +//kdd// process does not exist per se but is embedded in a broader
 +economic context.
 +
 +The first step in a KDD process is selecting data that might
 +contain patterns one is interested in. Some data mining algorithms
 +require data as an input that are consistent with a specific data
 +schema. The //​pre-processing//​ phase consists of all activities
 +that intend to select or transform data according the needs of a
 +data mining analysis.
 +
 +In the //data mining// phase a data mining expert---further
 +referenced as analyst---analyses the data using sophisticated data
 +mining algorithms. The outcome of these algorithms is a set of
 +patterns that contain potential new knowledge.
 +
 +{{ wiki:​iteratingnecessary.png | Issue A: Insufficient results inflict additional
 +iterations }}
 +
 +The resulting patterns must be evaluated before they can be used.
 +This happens during the //​evaluating//​ phase in which an
 +analyst tests whether the patterns are new, valid and interesting
 +or not.
 +
 +If there are no patterns that fulfill all the above conditions,
 +the KDD process or a part of it has to be repeated once more. The
 +reason why a KDD process has to be iterated can be a bad choice of
 +parameter values or a bad selection of data.
 +
 +===== Open Issues of the kdd Process =====
 +
 +Although there exist approaches that improve the runtime of
 +instances of the //kdd// process, time is still an open issue of the
 +//kdd// process---quality of results is the other one. Hence, we will
 +discuss open issues concerning the runtime of instances of the
 +//kdd// process first before discussing issues related with quality.
 +
 +Algorithms with long runtime and many iterations during an
 +instance of the //kdd// process increase the runtime of this
 +instance. If the data set which an algorithm is analysing is very
 +large then even the runtime of algorithm of well-scaling
 +algorithms is high. The consequence of a long runtime of an
 +algorithm is that the analyst who started it is slowed down in
 +ones work by the system until the algorithm terminates.
 +
 +Iterating steps in an instance of the//kdd// process is necessary
 +because the best values of the used algorithm'​s parameters and the
 +best set of data for a specific analysis are initially unknown.
 +Assuming the opposite would mean that an analyst performs an
 +analysis the result he/she already knows. Yet, that would conflict
 +with the definition of //kdd// that patterns must be new.
 +
 +{{ wiki:​multipleinstances.png | Issue B: Multiple isolated //kdd// instances use the same
 +data }}
 +
 +Consequentially,​ it is common practise to run data mining
 +algorithms several times---each time with a different combination
 +of parameter values and data. More precisely, the analyst chooses
 +the data and values that will most likely bring the best results
 +best to one's current knowledge. Even an unsuccessful attempt
 +commonly gives the analyst a better insight into the relations
 +among the analysed data. Future analyses will be more likely
 +successful. Figure \ref{iteratingnecessary} depicts an instance of
 +the //kdd// process that requires two iterations.
 +
 +When an analyst re-does an analysis with different settings, these
 +analyses are not identical but several tasks are very similar.
 +More specifically,​ both analysts share several intermediate
 +results as later chapters of this dissertation will show.
 +Obviously, processing the same intermediate results more than once
 +means to waste time. Yet, previous work introduces techniques
 +which are able to re-use results of a previous analysis when
 +parameters change. If an analyst uses a different subset of the
 +data he/she must compute all results of this new analysis from
 +scratch.
 +
 +Two analyses that share the computation of some intermediate
 +results is not limited to analyses of the same instance of the
 +//kdd// process. Moreover, a set of analyses can also share several
 +tasks or parts of them. Especially if these analyses use the same
 +data set and also share the same type of analysis, several
 +intermediate results can be identical. Figure
 +\ref{multipleinstances} shows two instances of the //kdd// process
 +that use the same data set to analyse. Although the goals of both
 +instances are different, both instances need to cluster the same
 +table. Yet, parameter values and used attributes differ. Later
 +chapter will show that is possible to pre-compute common
 +intermediate results for analyses when only attributes and
 +parameters vary.
 +
 +Re-doing the same steps of analyses would be tolerable if the time
 +needed to compute an analysis is negligible---which is not the
 +case, as illustrated in the next paragraphs.
 +
 +//​Pre-processing//​ and most //data mining methods// require
 +accessing all data---which is very time-consuming in large
 +databases exceeding 1GB in size. Some algorithms are inapplicable
 +to large data sets without supplying proper database index
 +structures. For instance, a single run of the DBSCAN [[diss_bibliography#​DBSCAN|DBSCAN]]
 +clustering algorithm scanning ten thousand tuples (less than 1 MB
 +in size) endured approximately eleven hours on a SUN Sparc
 +workstation [pp. 149-151][[diss_bibliography#​stoettinger|stoettinger]]. The reason for this
 +long runtime is that DBSCAN requires multiple scans of the
 +selected data.
 +
 +Although some data mining algorithms like the clustering
 +algorithms BIRCH [[diss_bibliography#​zhang96birch|zhang96birch]] and CLARANS
 +[[diss_bibliography#​ng94efficient|ng94efficient]] are very efficient and scale better in the
 +number of tuples, the minimum time needed for
 +//​pre-processing//​ and //data mining// is still too long for
 +interactively analysing the data. During the tests for this
 +dissertation the Oracle 9i database required few minutes to answer
 +a simple //SELECT COUNT(*) FROM table// query when the size of
 +the table exceeded half a gigabyte---each time the query required
 +accessing values of a specific attribute response time increased
 +significantly to up to ten minutes.
 +
 +The databases of mail order companies or large warehouses exceed
 +one gigabyte by far. For instance, think of a company operating
 +supermarkets. When there are only $300$ small supermarkets in
 +which there are only $300$ customers buying $5$ items per day and
 +the database requires only $512$ byte to store a sold item, the
 +data set that is daily stored requires more than $0.25$ GB disc
 +space.
 +
 +A long runtime of an algorithm or a pre-processing task means that
 +an analyst has to wait longer for a computer to finish its job
 +than he/she can analyse the resulting patterns.
 +
 +Zhang et alii state that the most time-consuming part of
 +data-mining algorithms is scanning the data while operations in
 +main memory are negligible [[diss_bibliography#​zhang96birch|zhang96birch]]. This statement is
 +consistent with the test results of this dissertation. As a
 +consequence,​ approaches of improving data mining algorithms must
 +focus on minimising the number of required database scans.
 +
 +Avoiding scanning the database at least once is impossible. Even
 +taking a sample requires at least one scan of the
 +data---otherwise,​ the so-taken sample would not be representative.
 +
 +Some data mining algorithms already require only the minimum
 +number of one database scan. BIRCH is a hierarchical clustering
 +algorithm requiring only one scan, Bradley et alii have proposed
 +in [[diss_bibliography#​Bradley|Bradley]] an EM (expectation maximisation) clustering
 +algorithm, or Kanungo et alii have presented in
 +[[diss_bibliography#​kanungo-efficient|kanungo-efficient]] a k-means-like algorithm, to name only
 +few examples of clustering approaches that require only one scan
 +of the data.
 +
 +If existing data mining algorithms are nearly as fast as the
 +theoretical limit but even this minimal time is too long for
 +interactive working, the source of major improvements in runtime
 +must be elsewhere. The next subsection gives the basic idea of the
 +approach of this dissertation that is capable to reduce the needed
 +runtime below the current limit. The research question of this
 +dissertation are a direct consequence of this approach.
 +
 +In many approaches which we will discuss later, one can decrease
 +runtime on behalf of quality of the results of the //kdd// process.
 +Hence, quality is always an additional issue when discussing
 +improvements of runtime. Yet, improving quality and runtime do not
 +exclude each other. Thus, the effect on quality is an additional
 +research question of any approach changing the //kdd// process. Yet,
 +the benefit of improving quality of results depends on the
 +specific instance of the //kdd// process. For instance, if a company
 +can improve the classification of customers who are about to
 +cancel their contracts with the company then each increase in
 +quality, i.e. the accuracy of classification,​ is beneficial.
 +
 +===== Basic Idea of Approach and Research Questions =====
 +
 +The //kdd// process as it is shown in figure \ref{kdd process}
 +describes the sequence of steps of a single analysis. Especially,
 +the //kdd// process does not take into account that analysts do
 +several analyses---not necessarily all of them at the same time
 +but many analyses within a year. The discussion of the last
 +subsection indicated that further improvement of analyses is
 +necessary to enable working continuously on data analyses.
 +However, improving a single analysis is improper because the
 +theoretical limit of one scan is not fast enough. Changing this
 +isolated point of view can be a source of major improvement which
 +will be shown later.
 +
 +//The core idea of this dissertation is to re-think the
 +knowledge discovery in databases process. Instances of the KDD
 +process should no longer be regarded as single independent process
 +instances but should be regarded as dependent instances that can
 +profit from each other.//
 +
 +//Further, if the specific setting of a future KDD process
 +instance is unknown but several tasks of the instance are known,
 +these known tasks should be done in anticipation.//​
 +
 +If two or more //kdd// process instances share the same
 +time-consuming tasks or parts of them then it is reasonable to do
 +those tasks only in the first occurring instance---it would be
 +wasted time to re-do such a task in the second and future
 +instances.
 +
 +//The underlying basic assumption is that there are
 +process instances that have time-consuming tasks in
 +common.//
 +
 +The research questions this dissertation intends to answer are
 +consequents of the above-mentioned basic idea and basic
 +assumption. To be specific, the research questions of this
 +dissertation are as follows:
 +
 +  - How do //kdd// analyses differ from each other?
 +  - Are there intermediate results shared by multiple analyses?
 +  - Can (intermediate) results of an analysis affect the quality of the result of another analysis?
 +
 +If the variability of analyses is known, it is possible to
 +identify items that are common in all analyses of a specific type
 +of analysis. Additionally,​ one can identify the range of some
 +parameters. If there are only a few values for a specific
 +parameter the analyst can choose of, one can pre-compute the
 +results for all of these potential values. When the analyst
 +finally chooses the actual value for an analysis, the result has
 +already been computed by the system.
 +
 +If there are intermediate results of an analysis that other
 +analyses might need than these other analyses can profit of these
 +intermediate results. Hence, the answer to the second research
 +question is needed to improve the //kdd// process: If common
 +intermediate results are identified, one can pre-compute them in
 +anticipation to save time in succeeding analyses.
 +
 +The third research question focusses on the aspect of quality of
 +analyses. Taking several analyses into account can also affect the
 +quality of the results of each analysis. For instance, knowledge
 +gained in an analysis can help getting better results in another
 +analysis by biasing the succeeding analysis.
 +
 +A specific type of analysis needs to perform tasks of a specific
 +type. Some of these tasks depend on parameters of the analyst,
 +others do not. It is possible to pre-compute the results of tasks
 +which are independent of analyst'​s input.
 +
 +The data set which the system processes in a specific task is the
 +only differing factor of the same independent tasks of analysis of
 +the same type. As the number of tables in a database is limited,
 +it is possible to pre-compute intermediate results of commonly
 +used types of analyses of all large tables.
 +
 +However, there still remain tasks depending on parameters which
 +cannot be pre-computed. Thus, it is necessary to split analyses
 +into parts that can be processed anticipatory and tasks that
 +cannot, as illustrated in figure \ref{somekddinstances}. A single
 +instance of a supporting process pre-computes intermediate results
 +for two instances of an altered //kdd// process that consists only of
 +parameter-specific tasks. Later sections will introduce the
 +supported process as parameter-specific //kdd// process and the
 +supporting process as parameter-independent //kdd// process.
 +
 +The instance of the parameter-independent //kdd// process
 +pre-computes intermediate results and keeps them up-to-date when a
 +table which has been used to pre-process has changed, i.e. there
 +are new, removed, or altered tuples in the table.
 +
 +The instances of the parameter-specific //kdd// process use the
 +pre-computed intermediate results either to process the final
 +result for higher performance or to bias the computation to
 +receive better results. Improving speed with pre-computed
 +intermediate results corresponds with the second research
 +question, while improving quality by biasing corresponds with the
 +third research question.
 +
 +The above-mentioned concept of anticipatory pre-computing ideally
 +support analysing data warehouses containing few but very large
 +fact tables. As the combination of data warehouses and //kdd// is a
 +common combination in companies having mass data, this concept
 +offers much potential for many companies.
 +
 +Some approaches of related work already arise from the same idea
 +of pre-processing but are limited to complete tasks---most of them
 +are tasks of the pre-processing phase. This dissertation
 +introduces an approach that splits tasks into smaller junks and is
 +able to increase the number of pre-computable tasks that way.
 +Additionally,​ this approach extends the idea of pre-computing to
 +the data mining phase. Hence, it is more broadly applicable than
 +existing work.
 +
 +{{ wiki:​somekddinstances.png | Improving the //kdd// process by using pre-computed
 +intermediate results and auxiliary data instead of tuples }}
 +
 +
 +This dissertation will present a technique to increase the
 +performance of instance-specific pre-processing tasks to make them
 +so fast that analysts can work on analyses without being slowed
 +down by the data mining system.
 +
 +In addition to the \emph{pre-processing} phase, the \emph{data
 +mining} phase also has got high potential for improvement.
 +
 +This dissertation presents techniques to enhance the performance
 +of data mining techniques and intends to improve the //kdd// process
 +in terms of quality and speed that way. For being more specific,
 +it is useful to distinguish the different data mining techniques
 +association rule mining, classification,​ clustering, and
 +regression at this point.
 +
 +Association rule mining requires finding frequently occurring sets
 +of items before finding association rules among them. If the
 +frequent item sets that are valid in the set of all transactions
 +are once determined it is possible to determine the frequent item
 +sets that are valid in subsets of all transactions. This has been
 +sufficiently shown in other work. This work will discuss those
 +works for reasons of completeness but will contribute only a minor
 +new idea.
 +
 +Classification algorithms use a set of data to train a function
 +that is called a classifier which is able to predict the class of
 +a tuple using the tuples'​ attribute values. The set of data that
 +is used for training typically contains only few
 +tuples---consequentially,​ scaling algorithms is not a problem. The
 +quality of the resulting classifier is the most concerning issue
 +of classification. This work will discuss quality-improving
 +classification approaches and contribute some new ideas how to
 +improve the quality of a classifier by using auxiliary data gained
 +as by-products of other //kdd// process instances.
 +
 +Clustering algorithms produce a set of clusters. All clustering
 +algorithms but density-based clustering algorithms represent
 +clusters with a set few describing attributes which varies from
 +algorithm to algorithm. For instance, \EM clustering algorithms
 +generate a probabilistic model consisting of $k$ random variables
 +and store mean and deviation of each variable, while $k$-means and
 +related algorithms represent cluster by their euclidian mean. It
 +is easy to compute mean and standard deviation of a cluster, when
 +the number of the cluster'​s tuples, the linear sum of its tuples
 +and their sum of squares is known, which will be demonstrated
 +later.
 +
 +If the tuples within a subset of the data are very similar to each
 +other, they will be part of the same cluster in most times. Thus,
 +it is possible to build a subtotal of all of these tuples.
 +
 +Determining the mean by summing up all tuples or summing up
 +subtotals will produce the same result as long as there are no
 +tuples included that are part of a different cluster. If there are
 +such tuples then the result might but need not be erroneous. The
 +experimental results will show that this error is tolerable.
 +
 +This dissertation will discuss approaches improving clustering by
 +aggregating tuples to sub-clusters in detail and will present new
 +ideas. Especially, the re-use of sub-clusters will be enhanced in
 +comparison to existing approaches like [[diss_bibliography#​zhang96birch|zhang96birch]] or
 +[[diss_bibliography#​bradley98scaling|bradley98scaling]] such that it is possible to find
 +sub-clusters in anticipation.
 +
 +Additionally,​ this dissertation will also discuss other popular
 +approaches improving the performance of analyses such as sampling.
 +
 +A running example shall illustrate the concepts of this
 +dissertation. Therefore, the next section introduces a running
 +example which is used to demonstrate more complex parts of this
 +dissertation'​s concepts.
  
diss_introduction.txt · Last modified: 2016/09/04 22:39 by mgoller