User Tools

Site Tools


diss_runningexample

This is an old revision of the document!


Running Example: Sales in a Data Warehouse

For the purpose of illustration this dissertation contains a running example. It is used to demonstrate all concepts presented in this document where it is applicable. Concepts where applying this example does not make sense will be illustrated with their own example—such as the necessary changes for applying the ideas of this dissertation to data streams.

Assume there is a large publishing company maintaining a data warehouse for both strategic and operational decision making. The data warehouse contains a vast set of data mainly concerning the sales of the company's products—books, magazines, journals and on-line articles. Data about transactions with printing companies form another large part of the data warehouse.

The company has optimised its data warehouse for OLAP but not yet for data mining. Later sections of this dissertation will present ways the company can use to improve its performance of \kdd analyses in terms of quality and speed.

\begin{figure} \centering \includegraphics[width=12cm]{weltkarte.eps} \caption{network of a globally operating company} \label{globalmap} \end{figure}

The publishing company is globally operating. Therefore, there are six major locations that administrate the operations of smaller units. Each of these six administration centres operates a data warehouse that contains the data of those units that report to that centre of administration.

The location of the company's headquarter coincides with one of the centres of administration.

Figure \ref{globalmap} shows the distribution of the six major locations, which are symbolised by rectangles with crosses, and their affiliated units, which are represented by circles, around the globe. Arrows and lines denote the amount of traffic of data exchange between locations.

\begin{figure} \centering \includegraphics[width=12cm]{sales_golfarelli.eps} \caption{schema of cube sales in Golfarelli notation} \label{sales_golfarelli} \end{figure}

The global data warehouse and all local data warehouses share the same schema. Each data warehouse contains several cubes but the `sales' cube is by far the cube having the highest number of tuples—its size exceeds the sum of the sizes of all other cubes. Table \ref{schema} shows the relational schema of the tables that store the data of the `sales' cube.

The schema of the `sales' cube is a starflake schema as it is shown in figure \ref{sales_golfarelli}. Figure \ref{sales_golfarelli} uses the notation of data warehouse schemas introduced by Golfarelli in \cite{golfarelli}. The `sales' cube has two measurements, namely the quantity of items sold and the price per sold unit. The cube has the four dimensions product, customer, time, and unit.

Products are assigned to product groups which themselves are hierarchically ordered in groups and sub-groups—the number of levels in this hierarchy is potentially unlimited. Hence, dimension product has a hierarchy with a recursion.

Dimension customer is structured by the location of the customer and alternatively by the type of customer. There are the two types individual and corporate customer, where the group of corporate customers includes all kind of organisations such as companies but also incorporated societies or non-government organisations. The company delivers books and other printed material only by mail. On-line purchases require a previous registration in which a valid address is required. Thus, there are no anonymous sales.

Dimension time is organised conventionally except years are omitted for simplicity.

A roll-up operation in any dimension causes the quantity measurement being summed-up while an average price per unit is determined.

The data warehouse schema is realised in a set of relations of a relational database. Table \ref{schema} contains the schema of these relations.

\begin{table}

\centering
\begin{tabular}{|lp{95mm}|}
  \hline

^Relations & {\tt sales(units, priceperunit, timestamp, \it{prid}, \it{shopid}, \it{cid})}

  & {\tt product(\underline{prid}, name) } \\
  & {\tt productgroup(\underline{groupid}, groupname), {\it ispartof}} \\
  & {\tt isproductofgroup(\underline{\it{prid}},

\underline{\it{groupid}})}

  & {\tt unit(\underline{shopid}, name, citycode, statecode)} \\
  & {\tt customer(\underline{cid}, name, customertype, citycode, statecode)} \\ & \\
  \hline
  Inclusion & {\tt sales(prid) $\subseteq$ product(prid)} \\
  Dependencies & {\tt isproductofgroup(prid) $\subseteq$ product(prid)} \\
  & {\tt isproductofgroup(groupid) $\subseteq$ productgroup(groupid)} \\
  & {\tt productgroup(ispartof) $\subseteq$ productgroup(groupid)} \\
  & {\tt sales(cid) $\subseteq$ customer(cid)} \\
  \hline
  \hline
  Legend& relations are shown in the form \\
  &{\tt <relation name>(<attribute
  list>)}\\
  & attributes of the primary key are underlined,
  attributes of foreign keys are written in italics\\
  \hline
\end{tabular}
\caption{relational schema of the running example}\label{schema}
\medskip

\end{table}

\begin{table}

\centering
\begin{tabular}{|c|c|}
\hline
\emph{prid}&\emph{name}\\
\hline
1&Theory and Implementation of Anticipatory Data Mining\\
\hline
\end{tabular}
\caption{Sample Entries of Table \emph{product}}
\label{sampleProduct}

\end{table}

\begin{tabular}{|c|c|c|}
groupidgroupnameispartof
1bookNULL
2belletristic1
3scientificmonography1
4articleNULL
5webarticle4
6online journalNULL

Sample Entries of Table productgroup

\end{table}

Smaller units daily send their data about sales of that day to the centre of administration they report to. There, a large ETL\index{ETL} (Extract-Transform-Load) process inserts these data into the local data warehouse.

The amount of daily arriving data each centre of administration exceeds 1GB.

Beside OLAP-queries the company regularly performs data mining analyses using the data that is kept in the data warehouse. The types of these analyses are two-fold: A lot of these analyses are re-runs of former analyses to validate their results while others are completely new.

Both analysts residing in the headquarter and local analysts use the contents of the data warehouses for data mining analyses regularly. The analyses of local analysts might require all data stored in all data warehouses or specific parts of it. Especially, an analysis is not limited to the content of a local data warehouse.

To be more specific, suppose the following analyses are about to be performed on the current date:

  1. The number of sale transactions of a specific publication is varying

temporally. Many factors such as publication type or reputation

of the author influence the demand for a specific publication.
The publishing company uses multi-variate linear regression method
to estimate the current demand of their publications. The resulting
regression model predicts the number of purchases for each publication.
It is a model of the life-cycle of a given type of publication.
As the preferences of customers might change, the company regularly
re-constructs the linear regression model to cope with
changes in customer purchase behaviour.
  • Customers can buy online articles individually or as part of an annual subscription of an online journal. An annual subscription is implemented as a separate product. The publishing company wants to find a model to predict which subscriber will not extend one's subscription. The company pays much attention to those customers because the regular turnover from subscriptions offers high margins. Additionally, retaining unsatisfied customers is known to be generally cheaper than acquiring new customers.
  • The marketing division compiles catalogues and hand-outs for specific customer segments that have not been identified, yet. An analyst of the company shall find a suitable segmentation ofcustomers. Clustering is his preferred method to complete this task.
diss_runningexample.1473194607.txt.gz · Last modified: 2016/09/06 22:43 by mgoller