Data Warehouse

The Enhanced Star-Schema: Part 1 - Dimension Basics

Here is an ER-diagram of the basic model for dimensions in a relational data warehouse core:

Dimension Basics

Before you start to wonder at the cryptic names of the tables and columns, I’d like to stress the importance of naming conventions.

Here, all the tables are assigned to a schema called “cmdm”, where “mdm” means “multi dimensional model” and “c” is the prefix for “Cubica”, the name of the data warehouse framework I have developed.

The ER-diagram contains three tables, which will be the mininum number of required tables for a dimension. Please note, that this model is very basic and does not contain any tables or columns for versioning or time-variance. As we go along and cover further aspects of data warehousing the model will successively be enhanced.

So why are there at least three tables instead of just one dimension table like in a “normal” star-schema? Here’s why:

  1. there can be raw data on any hierarchy level
  2. one raw data member can be a part of any number of consolidation paths
  3. there can be non-leaf members with data and three different ways to aggregate them
  4. the hierarchy can be irregular and include diamond shapes
  5. the data model should be as generic as possible

Let’s examine the tables in detail:

Continue Reading »

Data Warehouse
Data Modeling

Comments (4)

Permalink

DWH Modeling Rule #2: Build a generic, data-driven core

Almost any Data Warehouse architecture can be divided into five sections:

  1. Raw data from external data sources
  2. Staging area
  3. Consolidated and enriched raw data
  4. Multidimensional data (dimension tables, fact tables, etc.)
  5. Data marts

When we take a look at the data flow in the warehouse process we can find the most client-specific or application-specific requirements at the transitions 1->2, 2->3, and 4->5. In most cases, the transition 2->3 is particularly specific, complex, and elaborate.

On the other hand, the transition 3->4 is a good candidate for a generic approach. The complete process can be defined by a set of meta data and a set of procedures, which, based on the meta data, can dynamically build and execute the required SQL statements. The transition 3->4 can still be very complex and elaborate, but it is by far not as application-specific as the other transitions.

Sections 3 and 4 form what I call the Warehouse Core. This is the place where you can normally find some sort of star- or snowflake-schemas. The warehouse process of the core is primarily made up of the following steps:

  • Aggregations
  • Management of historical changes
  • Management of structural changes

One might say, that these operations can be very application-specific. This is absolutely right, but, compared to the other sections, they can easily be customized and configured by enhancing the star-schema and the meta data a little bit.

This post shall just set the stage for the upcoming detailed description of the enhanced star-schema and the associated meta data , which are both part of the Data Warehouse Framework I have developed over the past years. With the help of this framework I have succesfully implemented a number of Data Warehouse solutions, many of them containing irregular and ragged hierarchies and non-additive measures.

Data Warehouse
Data Modeling

Comments (1)

Permalink

DWH Managing Rule #1: The single most important prerequisite for success is a complete set of meta data

In my opinion, one of the very first things a DWH project manager should strive for is the definition of a complete and consistent set of meta data.

If this is done, requirements engineering, specification, documentation, and project management is nothing more than collecting meta data and assessing the completeness of the meta data set. Through priorities and processing sequences it is possible to completely define a procedural model for the DWH project.

When I speak of meta data, I do not only mean the more or less technical data, which describes dimensions and facts, but also data, which describes the warehouse process (ETL), and, most important, “political” data like target groups, stakeholders, team members, and other important people.

To get the most out of the meta data and to alleviate the collecting and administration of the meta data set, I frequently use a relational database. That allows me to generate a GUI for entering data and a number of different reports. Plus, this database can be used as a central repository for each member of the project team. For the project manager it can be of great help, if it contains typical project information like target date, status, estimated effort, remaining effort, responsibilities, etc. for the relevant entities.

A big advantage, which is based on the completeness of the meta data set, is, that certain pitfalls and showstoppers can be identified at a very early stage of the project.

Here is an example from one of my projects: I’m always especially paranoid with historical variability like slowly moving dimensions (which often turn out to be rapidly changing dimensions). Hence there are a number of attributes in my meta data model, which describe SMDs. In the (meta data based) process of specification and requirements engineering I asked the client about the historical variability of the product hierarchy. The people I asked were very amazed and apparently, nobody in the company had ever though about it. The question was: What happens with historical data when the product hierarchy changes? Has the change to be applied to the historical data (especially aggregated data)? Through the procedural model implied by the meta data we were able to address the implications of the historical variability at a very early stage in the project and we could force the client’s management to make a reliable decision. Very often, these kinds of aspects finally occur when the BI system is already in production, jeopardizing the success of the entire project.

In one of my next posts, I’m going to describe the meta data model in more detail by identifying the different sections of the model and describing the attributes, which make up the different meta data entities .

Data Warehouse
Management
Meta Data

Comments (2)

Permalink

DWH Modeling Rule #1: Most aggregations have to be done in the Data Warehouse directly

If you have read my first post about “my real world experience” with out-of-the-box BI systems like Cognos, you might have gotten the impression, that I was bashing Cognos. This is definitely not the case, since Cognos and other BI systems are great software products, which offer a wide range of functionality. The point I was trying to make is, that even the leading product in the BI market was and still is not able to cope with certain data structures. It’s not that these data structures are especially weird or uncommon, no, they have occurred in each data warehouse project I have been involved so far.

Hierarchy with diamond shape The picture on the left depicts a typical hierarchy, which can often be found as the structure of a sales force.

C1-C4 are clients, who are assigned to the sales reps R1 and R2. The sales reps are both managed by regional manager M1.

A quite important measure for sales reps, managers, sales unit, and, of course, the company as a whole is the number of associated clients.

How would a typical BI tool be set up to calculate the number of clients based on the hierarchy on the left?

  • The client level with members C1-C4 is defined as the raw data level. Each member has a client-id as a primary key.
    .
  • The measures for upper levels for the sales reps and the managers are aggregated by the system. These aggregations are either pre-calculated or take place on-the-fly.
    The aggregation rule is “count(distinct client-id)”.
    .
  • First, the measures for the sales reps are calculated with the following results: R1: 2, R2: 3
    .
  • Based on the results for the reps, the measures for the managers are calculated. The result for M1 would be 2+3=5, which is obviously wrong!

Continue Reading »

Data Warehouse
Data Modeling

Comments (2)

Permalink