This is an old revision of the document!
For the purpose of illustration this dissertation contains a running example. It is used to demonstrate all concepts presented in this document where it is applicable. Concepts where applying this example does not make sense will be illustrated with their own example—such as the necessary changes for applying the ideas of this dissertation to data streams.
Assume there is a large publishing company maintaining a data warehouse for both strategic and operational decision making. The data warehouse contains a vast set of data mainly concerning the sales of the company's products—books, magazines, journals and on-line articles. Data about transactions with printing companies form another large part of the data warehouse.
The company has optimised its data warehouse for OLAP but not yet for data mining. Later sections of this dissertation will present ways the company can use to improve its performance of \kdd analyses in terms of quality and speed.
\begin{figure} \centering \includegraphics[width=12cm]{weltkarte.eps} \caption{network of a globally operating company} \label{globalmap} \end{figure}
The publishing company is globally operating. Therefore, there are six major locations that administrate the operations of smaller units. Each of these six administration centres operates a data warehouse that contains the data of those units that report to that centre of administration.
The location of the company's headquarter coincides with one of the centres of administration.
Figure \ref{globalmap} shows the distribution of the six major locations, which are symbolised by rectangles with crosses, and their affiliated units, which are represented by circles, around the globe. Arrows and lines denote the amount of traffic of data exchange between locations.
\begin{figure} \centering \includegraphics[width=12cm]{sales_golfarelli.eps} \caption{schema of cube sales in Golfarelli notation} \label{sales_golfarelli} \end{figure}
The global data warehouse and all local data warehouses share the same schema. Each data warehouse contains several cubes but the `sales' cube is by far the cube having the highest number of tuples—its size exceeds the sum of the sizes of all other cubes. Table \ref{schema} shows the relational schema of the tables that store the data of the `sales' cube.
The schema of the `sales' cube is a starflake schema as it is shown in figure \ref{sales_golfarelli}. Figure \ref{sales_golfarelli} uses the notation of data warehouse schemas introduced by Golfarelli in \cite{golfarelli}. The `sales' cube has two measurements, namely the quantity of items sold and the price per sold unit. The cube has the four dimensions product, customer, time, and unit.
Products are assigned to product groups which themselves are hierarchically ordered in groups and sub-groups—the number of levels in this hierarchy is potentially unlimited. Hence, dimension product has a hierarchy with a recursion.
Dimension customer is structured by the location of the customer and alternatively by the type of customer. There are the two types individual and corporate customer, where the group of corporate customers includes all kind of organisations such as companies but also incorporated societies or non-government organisations. The company delivers books and other printed material only by mail. On-line purchases require a previous registration in which a valid address is required. Thus, there are no anonymous sales.
Dimension time is organised conventionally except years are omitted for simplicity.
A roll-up operation in any dimension causes the quantity measurement being summed-up while an average price per unit is determined.
The data warehouse schema is realised in a set of relations of a relational database. Table \ref{schema} contains the schema of these relations.
\begin{table}
\centering \begin{tabular}{|lp{95mm}|} \hline
^Relations & {\tt sales(units, priceperunit, timestamp, \it{prid}, \it{shopid}, \it{cid})}
& {\tt product(\underline{prid}, name) } \\ & {\tt productgroup(\underline{groupid}, groupname), {\it ispartof}} \\ & {\tt isproductofgroup(\underline{\it{prid}},
\underline{\it{groupid}})}
& {\tt unit(\underline{shopid}, name, citycode, statecode)} \\ & {\tt customer(\underline{cid}, name, customertype, citycode, statecode)} \\ & \\ \hline Inclusion & {\tt sales(prid) $\subseteq$ product(prid)} \\ Dependencies & {\tt isproductofgroup(prid) $\subseteq$ product(prid)} \\ & {\tt isproductofgroup(groupid) $\subseteq$ productgroup(groupid)} \\ & {\tt productgroup(ispartof) $\subseteq$ productgroup(groupid)} \\ & {\tt sales(cid) $\subseteq$ customer(cid)} \\ \hline \hline Legend& relations are shown in the form \\ &{\tt <relation name>(<attribute list>)}\\ & attributes of the primary key are underlined, attributes of foreign keys are written in italics\\ \hline \end{tabular} \caption{relational schema of the running example}\label{schema} \medskip
\end{table}
\begin{table}
\centering \begin{tabular}{|c|c|} \hline \emph{prid}&\emph{name}\\ \hline 1&Theory and Implementation of Anticipatory Data Mining\\ \hline \end{tabular} \caption{Sample Entries of Table \emph{product}} \label{sampleProduct}
\end{table}
\begin{tabular}{|c|c|c|}
groupid | groupname | ispartof |
---|---|---|
1 | book | NULL |
2 | belletristic | 1 |
3 | scientificmonography | 1 |
4 | article | NULL |
5 | webarticle | 4 |
6 | online journal | NULL |
Sample Entries of Table productgroup
\end{table}
Smaller units daily send their data about sales of that day to the centre of administration they report to. There, a large ETL\index{ETL} (Extract-Transform-Load) process inserts these data into the local data warehouse.
The amount of daily arriving data each centre of administration exceeds 1GB.
Beside OLAP-queries the company regularly performs data mining analyses using the data that is kept in the data warehouse. The types of these analyses are two-fold: A lot of these analyses are re-runs of former analyses to validate their results while others are completely new.
Both analysts residing in the headquarter and local analysts use the contents of the data warehouses for data mining analyses regularly. The analyses of local analysts might require all data stored in all data warehouses or specific parts of it. Especially, an analysis is not limited to the content of a local data warehouse.
To be more specific, suppose the following analyses are about to be performed on the current date:
temporally. Many factors such as publication type or reputation
of the author influence the demand for a specific publication.
The publishing company uses multi-variate linear regression method to estimate the current demand of their publications. The resulting regression model predicts the number of purchases for each publication. It is a model of the life-cycle of a given type of publication. As the preferences of customers might change, the company regularly re-constructs the linear regression model to cope with changes in customer purchase behaviour.