Copyright © 2018 DataScience.US All Rights Reserved.
ETL is not quite dead
I’d like to set some context around Paul Balas’ latest article, https://www.linkedin.com/pulse/etl-long-tooth-paul-balas?trk=hb_ntf_MEGAPHONE_ARTICLE_POST. He and I have had several conversations in the past using the phrase “ETL is dead” while neither of us believes it is nor should be truly dead. We have been using that as a catch-phrase to help people realize that in many cases they are over using ETL (or ELT). You should be very selective when making another copy of the data. Make sure this copy has some value supporting your strategic business objectives.
Note: in this discussion, when I use the term data warehouse, I’m also including the more modern idea of a data lake where data is landed in files and folders in whatever format you choose.
First point – ETL should be minimized for BI projects. There are several valid reasons for using ETL which I will enumerate later, but for BI projects I like to get as close to the concept of “land the data once and consume it directly”. If you can achieve this, then ETL is not needed. In this case you find yourself following the Self-Service BI principles where the users get direct access to the original source systems of the data. For most projects this works well if you have good front end tools, users who are comfortable with massaging and merging data from different sources and if you have a good meta data repository. A good data stewardship program is also invaluable to make this work. Successful Self Service BI programs usually have a good Master Data Management solution in place.
Second point – Land the data once and consume it directly. The first attempt at self service BI usually is done by giving users direct access to source systems. Many times this works well but there are times when it is too hard or too slow. Performance is often the first challenge, giving you a good reason to move the data from various source systems into a common data warehouse. A prime example of this is when you have source systems spread around the world and querying them from a centralized point is very slow. This is a valid use case for ETL. And it gives you an opportunity to do some data cleansing if needed. It also gives you the opportunity to transform the data to make it easier and faster to query; i.e. star schemas or simply creating aggregate tables. Security is often another challenge because not all source systems have the same security features. Some will have row level security, some will have cell level security and some will have neither. Moving the data into a common data warehouse makes it easier to control and manage the security.
Another topic under this point is when using any of cloud vendors to host your BI solutions. If your compute and query tools are in the cloud it makes sense to move your data to the cloud. Not only is it cheaper, but it is too slow to constantly retrieve the same data over and over again via a WAN or internet link, even if you have fast connections.
[bctt tweet=”The data fabric is the next middleware.” –Todd Papaioannou” username=”TeamBisilo”]
Third point – valid uses for ETL. As mentioned above, I will enumerate use cases where ETL is fine to use:
- M&A – when companies merge or are acquired the first thing they do is get control of the money in a centralized location. So the first systems to consolidate are the AR, AP, Payroll, Finance, etc. We’ll let the business managers figure out which company’s is the survivor, but once this decision is made, ETL is a primary tool to move and transform the data.
- GEO-Distributed systems – if your source systems are located around the globe you usually end up creating a centralized data warehouse strictly for performance reasons. This is too bad. If we had better bandwidth, would this even be necessary?
- Performance impact on source systems – enough said.
- Best way to support business requirements – security, compliance, performance and other factors weigh in the decision to create a centralized data warehouse.
In conclusion, if you are doing ETL, please think about why you are doing it and whether it is really necessary.