Here is an ER-diagram of the basic model for dimensions in a relational data warehouse core:
Before you start to wonder at the cryptic names of the tables and columns, I’d like to stress the importance of naming conventions.
Here, all the tables are assigned to a schema called “cmdm”, where “mdm” means “multi dimensional model” and “c” is the prefix for “Cubica”, the name of the data warehouse framework I have developed.
The ER-diagram contains three tables, which will be the mininum number of required tables for a dimension. Please note, that this model is very basic and does not contain any tables or columns for versioning or time-variance. As we go along and cover further aspects of data warehousing the model will successively be enhanced.
So why are there at least three tables instead of just one dimension table like in a “normal” star-schema? Here’s why:
- there can be raw data on any hierarchy level
- one raw data member can be a part of any number of consolidation paths
- there can be non-leaf members with data and three different ways to aggregate them
the hierarchy can be irregular and include diamond shapes
- the data model should be as generic as possible
Let’s examine the tables in detail:
- Root Dimension Table (Prefix “dim_r_”)
A word about terminology first: I consider the lower, fine-grained levels of a hierarchy as root levels. In the Microsoft world they are called leaf levels (see also the term “non-leaf members with data”). The name “Root Dimension Table” comes from the fact, that most raw data can be found at the lower (root or leaf) levels of a hierarchy. If you like you can also say “Raw Dimension Tables” (as long as you keep the “R” as the first letter ).The Root Dimension Table contains all hierarchy members, which potentially can be assigned to raw data. E.g. in a sales force, clients can be visited by regular sales reps as well as by regional managers, who are placed on a higher level in the regional hierarchy. That means, that sales raw data can occur on different hierarchy levels. Other members of the hierarchy might never be assigned to raw data and their data is exclusively calculated be aggregating the raw data of their descendants. Those members are not included in the Root Dimension Table.As you can see, the Root Dimension Table is absolutely flat. There is no hierarchy or some other kind of relationship between the data rows. It is just a plain collection of all members, which can potentially be assigned to raw data. Here is a description of the three fields of a Root Dimension Table:
Root Surrogate Key: The primary key of the Root Dimension Table. It is an artificial (surrogate) key and wherever the name “r_skey” can be found in the data model it always has the same meaning: the primary surrogate key of a Root Dimension Table.
This is an integer field to determine the type of entity (or hierarchy node) of the hierarchy member. Normally the hierarchy level can be used, but there can be very weird hierarchies that do not have classical levels. Thus, the term “node type” is used for the sake of generity.
Entity Key: Each node type refers to a specific entity type. In a geographical dimension there can be stores, districts, cities, regions, states, etc., which are all entities of different types. To provide a connection to the “outside” world, each dimension member must have a business key. Because of the different node types the number and the data types of the key components are not the same for each member. To handle that, the components of the business keys are concatenated (using an appropriate seperator between the components) into one varchar field (of arbitrary size). Alternatively it is also possible to have a collection of all business keys of all entity (or node) types.
- Dimension Table (prefix: “dim_”)
This table is very similar to a dimension table in a normal star-schema. It contains all the members of the dimension including the members with raw data.The Dimension table contains the following attributes:
Surrogate Key: The primary key of the Dimension Table. It is an artificial (surrogate) key and wherever the name “skey” can be found in the data model it always has the same meaning: the primary surrogate key of a Dimension Table.
Same as above
The skey of the parent of the current member. That means, that the Dimension Table is generally designed as a parent-child hierarchy with a self-referencing foreign key.
By now, you already know what an r_skey is: the primary surrogate key of a Root Dimension Table. It’s a foreign key referencing the Root Dimension Table. This key is not null for each row in the Dimension Table that refers to a member that can potentially have raw data. For all the other members r_skey is set to null.
The business keys of all relevant entitiy (or node) types. Because the Dimension Table can be accessed from outside the data warehouse core, it is appropriate to a have seperate fields for each relevant entity business key. The field types must not necessarily be varchar (as in the above example).
You might ask yourself why we have two member keys here: skey and r_skey. This is because a Dimension Table does not contain dimension members!
“WHAT?” - you might ask. In the common data warehouse literature a dimension table is always defined as a collection of dimension members. This is with no doubt true in many cases, but not in a generic multi-dimensional data model.
I will discuss that in my next post, and I’m also going to describe the third table in the ER-diagram above. Until then, stay tuned.