The ETL process – Part 2 – Guidelines & Standards

Guidelines lead the way

Guidelines lead the way

After achieving some results from the analysis of the ETL process (Part 1 – The Analysis), it quickly becomes evident that it is not sensible to finally come up with an “egg-laying woolly-milk-sow” (as we say in German: “eierlegende Wollmilchsau” :-)).
However, if the analysis has been complete and painstaking, the requirements for the ETL process should be clear after that. Despite the fact that there should be room for extensions and new functionalities (especially in an agile environment), clear-cut red lines should be drawn.

Here are the first two of my top four list of guidelines:

  • Beware of too much proprietary or closed-source software.

    My best friends, the big-scale ETL tools and frameworks, fall into that category.
    Some people, especially staff-members of big consulting companies, would strictly contradict. Usually, their chief argument is that the use of tools or frameworks decreases the degree of dependence on software developers. That is nothing but the truth!
    But what’s finally also true is that the use of tools or frameworks increases the degree of dependence on staff-members of big consulting companies or software vendors, and on special developers, who are proficient in those tools and frameworks.
    Without a doubt you won’t find people like these around every corner. To top it all off they are usually significantly more expensive than developers, who are not that highly specialized in the development or customizing for a very specific product. And finally there is the cost for the products themselves.
    On the other hand, chances of finding some really brilliant developers with excellent skills in the fields mentioned below are much better. Even after most of them have left the team after the system has gone into production, it should not be too difficult to find new developers when it’s necessary. If there hasn’t been a knowledge drain and the system is well documented, the integration of new team members should be quite smooth.

    Skills needed or appropriate for development of the entire ETL process (strongly IMHO):

    • SQL, T-SQL, PL/SQL for Stored Procedures
    • C# for SQL Server CLR Stored Procedures
    • Java for Oracle Java Stored Procedures
    • bash or power shell for shell script programming
    • Shell tools like grep, awk, sed, etc.
    • ftp script programming
    • cron job or scheduler programming

  • Data sources are responsible for their own data quality.

    In literally all of my data warehouse projects and without exception we have discovered data quality issues in the source systems. Fortunately, this usually happens at a quite early stage of the project. The start of a data warehouse project can eventually even take credit for the discovery of serious flaws in source and/or legacy systems.
    The data warehouse process must not be the sweeper that eliminates the slip-ups from earlier stages of the data flow. We all know that nothing lives longer than a quick workaround. There is no doubt about the necessity to provide these workarounds to avoid showstoppers; especially at an early stage of the production phase. But, by all means try to get rid of them as quickly as possible. They turn out to be a heavy burden in the long run.
    Sometimes the need to promote data quality on an enterprise-wide level might occur. It could be necessary to escalate those issues through the corporate or department hierarchy; sometimes even up to the CIO and/or the CTO.

The subject of next post will be the remaining points of the top four guidelines list.

Please feel free to register any time. 🙂