Progressive organizations recognize data as a strategic asset and rely upon it for critical decision making. Business intelligence spending has been steadily increasing and is forecast to be upwards of $16 billion worldwide in the next year. Major investment and effort are spent on data extraction, transformation and load (ETL) from source systems into data warehouses and data marts. Incorrect decisions based on poor data can be disastrous, so how can we ensure that we are utilizing the proper data to begin with? To be able to do so, we must be able to address the following data quality considerations:
- Is the data accurate?
- Is the data timely?
- Is the data complete?
- Is the data consistent?
- Is the data relevant to the decision?
- Is the data fit for use?
This challenge has been complicated further by exponential data growth. Some studies show that up to 90% of the world’s data has been created in the past 2 years alone. This trend is accelerating, making data quality assurance even more challenging. This is compounded by increasing complexity in the data landscape. Most corporations have a variety of software applications and database silos scattered across a variety of heterogeneous platforms, utilizing a spider web of point-to-point interfaces to move data back and forth. This includes ERP solutions and externally hosted SAAS solutions. The result is usually a high level of data redundancy and inconsistency.
It becomes extremely complex to trace data to its origin, or source of truth. In response to this, many of the ETL routines mentioned above usually perform some type of data cleansing to make the data usable at the point of consumption. However, this can be quite risky if we do not truly understand the data and the changes that have occurred on its journey through the organization’s systems. Unless the lineage is mapped and understood, ETL will simply be transforming the wrong information, or an outdated version of it that does not fit with the intended purpose.
This is analogous to the problems that occurred in manufacturing production lines prior to the early 1980’s: complex products were built from thousands of parts and sub-assemblies, then inspected for quality conformance after they rolled off the assembly line. Inspection does not improve the product. It simply identifies the defects that need to be addressed. Defective items were scrapped or reworked at significant cost, but the origin of the defects often went undetected. Thus, the problems ensued. To address this, the quality movement of the 1980’s focused on many aspects, but a few are stated here because they are very relevant to our discussion regarding data:
- Validation of the inputs to every discrete process, preventing usage of defective components
- Traceability of components and subassemblies within finished goods to point of origin
- Empowerment of front line workers to address problems, even if it meant halting the entire production line.
- Continuous improvement of all processes.
Unlike physical products, it can be extremely difficult to detect and identify defects in data. However, we can utilize the approach and lessons learned by the quality discipline. In order to succeed, a collaborative culture must be established with a commitment to data quality, from senior executives through to the front line workers that create and modify data on a daily basis. Procedures must be put in place to ensure that data is accurately captured and recorded as it is created and modified through each business process. From a data architecture perspective, we must map the lineage of the data and the processes that act upon it, including the transformations. We must also be able to provide the context of why the data is being modified. Also, it is crucial that workers must be empowered to correct any data that is wrong as part of their daily job function (with proper audit trails). If data originates outside the organization, it must be validated prior to use. Data governance and stewardship must be established so that responsibilities are understood and agreed to by all parties.
The primary challenge is to first understand and map the current data landscape, but retain the flexibility to easily adapt and update it as the business and the underlying landscape continue to change over time. The most effective means of doing so is through data models to describe the data (and metadata) as well as process models to describe business processes that create, consume and change the data. This allows data to be understood in context, and is the basis to identify redundancy and inconsistency. All manifestations of each critical business data object must be identified and cataloged. Typically the most critical business data objects are also master data, as they are utilized in most transactions (for example: customer, product, location, employee, etc.). Without context, it is extremely difficult to ensure that the proper data is being utilized for reporting and analytical purposes, and hence, decision making. In order to complete the understanding, the models must be supported by integrated business glossaries and terms that are owned by the business stakeholders responsible for each area. It is imperative that the business team is able to utilize tools that allow them to collaborate amongst themselves, as well as with technical staff that are assisting them.
Business analysts, data analysts, modelers and architects build the required conceptual and logical models based on continual consultation with business stakeholders. Physical data models are used to describe the underlying systems implementations, including data lineage. When combined with data flows, true enterprise data lineage can be understood and documented. This is the point at which we have established true traceability, which is vital for comprehension.
To achieve true collaboration, all of the models, metadata and glossaries must be integrated through a common repository. Approved artifacts need to be easily published in a medium that is easily consumed, typically through a web-based user interface. In addition, the models themselves become the means to analyze, design, evaluate and implement changes going forward.
Due to the size and complexity of most environments, this must be done on a prioritized basis, starting with the most critical business data objects. Metrics are established to quantify relative importance as well as to evaluate progress. As with any continuous improvement initiative, breadth and depth are increased incrementally. Establishing a data culture and improving data quality is not a one-time project. It is an ongoing discipline, that when executed correctly, delivers breakthrough results and competitive advantage.