The ETL process – Part 2 – Guidelines & Standards – cont.

Guidelines lead the way

Guidelines lead the way

In one of the previous posts I have introduced the first two topics of the top four guidelines and standards for the ETL process.

Especially point one regarding big-scale ETL systems may have provoked some disagreement, disconcernment, or confusion. Everything I said is only relevant when there is the possibility and willingness of having a (at least temporary) team of dedicated developers. This might not be feasible for smaller projects. However, if there are any issues a commercial ETL tool cannot cope with, it would often be better to develop a new and independent system than tweaking the tool with weird workarounds. I have also heared of clients that have abandoned requirements, just because the ETL system could not fulfill them.

Here are topics three and four:

  • Don’t throw away anything that ever entered the data warehouse.

    Psychologists know the syndrome of compulsive hoarding. While this is a serious disease, it might come in handy in data warehousing. A general goal should be to be able to restore any state of the warehouse that ever existed. Quite frequently, the client establishes new structures in dimensions or hierarchies. There might even be major paradigm shifts in business models. In this context, clients often decide to forget about the structures and data of the past, just to find out later that some comparison or A/B analysis of present and past could be quite enlightening. It could also be interesting to aggreate and analyze historic data according to actual structures.
    I have also undergone scenarios where the client recognized that the whole restructuring or paradigm shift didn’t lead to the desired results. So they eventually decided to do a rollback to the old structures.
    Another scenario is the worst case of a technical failure that leads to a defunct database. To make things worse, an up-to-date backup is also no longer available.
    In all those cases, a well designed and fostered archive of all the interface data can be a lifesaver.
    There will be a detailed discussion about archiving in one of the upcoming posts.

  • Exclusively use human-readable files as input for interfaces.

    First of all: No direct-reads from data sources. This is not only a direct consequence of the previous topic, but there are also some more things to consider about directly reading from the data source:

    • The data source could be in an inconsistent state.
    • Querying the data source could seriously slow down other attached online systems.
    • The state of the data in the data source might be volatile.
    • The dependency on the structure of the data source requires a higher grade of coordination and management.
    • Auditing of the data transferred is more difficult.
    • Special connectors to specific data sources are necessary.

If it is only possible, try to convince the people concerned of delivering all interface data in human-readable flat files. Possible formats are CSV, TXT, or more nosql-oriented formats like JSON and XML. External data providers often have their own proprietary formats, but most of them are at least human-readable.
Why human-readable? Because this makes life much easier when auditing or lookups are necessary. We can open an interface file in an editor and look directly at or for the interesting data. Plus, in case of a data quality problem that needs immediate mending, we could quickly patch the data in the interface file. As mentioned in point two of guidelines & standards, please apply this only as an absolute exception and only to avoid showstoppers, since the responsibility for data quality should be in the realm of the respective data source.
Last but not least, it’s quite easy to compress, archive, backup and restore flat files. Conversely, archiving and backing up direct-read data requires extra steps to export it from the warehouse.

This concludes the top four list of guidelines & standards for the ETL process. The next posts are all about metadata for the ETL process.

Please feel free to register any time. 🙂