DWH Modeling Rule #1: Most aggregations have to be done in the Data Warehouse directly

If you have read my first post about “my real world experience” with out-of-the-box BI systems like Cognos, you might have gotten the impression, that I was bashing Cognos. This is definitely not the case, since Cognos and other BI systems are great software products, which offer a wide range of functionality. The point I was trying to make is, that even the leading product in the BI market was and still is not able to cope with certain data structures. It’s not that these data structures are especially weird or uncommon, no, they have occurred in each data warehouse project I have been involved so far.

Hierarchy with diamond shape The picture on the left depicts a typical hierarchy, which can often be found as the structure of a sales force.

C1-C4 are clients, who are assigned to the sales reps R1 and R2. The sales reps are both managed by regional manager M1.

A quite important measure for sales reps, managers, sales unit, and, of course, the company as a whole is the number of associated clients.

How would a typical BI tool be set up to calculate the number of clients based on the hierarchy on the left?

  • The client level with members C1-C4 is defined as the raw data level. Each member has a client-id as a primary key.
  • The measures for upper levels for the sales reps and the managers are aggregated by the system. These aggregations are either pre-calculated or take place on-the-fly.
    The aggregation rule is “count(distinct client-id)”.
  • First, the measures for the sales reps are calculated with the following results: R1: 2, R2: 3
  • Based on the results for the reps, the measures for the managers are calculated. The result for M1 would be 2+3=5, which is obviously wrong!

OK, most Data Warehouse and BI people know that problem, which is often referred to as “diamond shapes in hierarchies”. This term is derived from the typical diamond shape, which arises when two knots have a child knot and a parent knot in common. You can (vaguely ;-) ) see the diamond formed by C2, R1, R2, and M1 in the picture above.

The diamond shape problem is sometimes solved with weighting factors for each parent, which share a common child. To achieve correct results on the higher levels, the sum of the weighting factors has to be 1. In the hierarchy above, R1 und R2 would both have a factor of 0.5.
The weighted measure would be:

  • M1: 1+1+0.5+0.5=4, which is absolutely right!
  • R1: 1+0.5=1.5; R2:1+1+0.5=2.5, which is, at least in my opinion, not only wrong, but also very cruel! Human beings are cut in half!

Another fine example for diamond shapes: In my first post I mentioned the “legacy” analytical software we wanted to replace with a new solution. One of the eye-opening flaws of that software occured when we were aggregating sales data for Bayer. As you might know, Bayer is the producer of Aspirine. Now, Aspirine can be used for different indications, e.g. as a pain reliever (analgetic) or as a blood thinner.A typical example for a hierarchy knot with more than one parent. I don’t have to tell you, that all the calculated measures on the upper levels were wrong.

The quintessence of all this is, that, in order to achieve correct aggregations, you have to calculate them yourself in the Data Warehouse before you render the results in a BI application or report. You cannot expect out-of-the-box systems to be able to deliver the correct results.

How to do it right and how to model your relational Data Warehouse by using an enhanced star schema will be described and discussed in the upcoming posts.

I will also address the problem of having raw data on different hierarchy levels.