Enterprise Data Confusion – Did We Bring This Onto Ourselves? – authored by David Loshin, Knowledge Integrity, Inc.

by Sep 25, 2014

This is the first in a series of six posts from David Loshin. Read his next post.

A few years back, I had the opportunity to work with Embarcadero Technologies, having been invited to put together some white papers and participating in some web seminars on the relationship between data governance, stewardship, and good data modeling practices. I am quite excited to be working with Embarcadero again, especially because I have been asked to share some thoughts about innovative ways to use metadata to stimulate conversations that bridge the gaps between the data practitioners and the business domains.

To begin, let’s ask a straightforward question: why are there gaps between the data practitioners and the business domains? Understanding the root cause of variances among value formats, data element structure, and business term definitions requires a little bit of a history lesson, examining the organic development of the enterprise application ecosystem.David-Loshin-Webinar---Breaking-Down-the-Fence-159x228-watch-now-Webinar-banner-20140828
During the early days of computing, the risk of data variation was relatively small. All applications were run in batch on a single mainframe system. Data was stored in flat files accessed by the batch applications.
However, as workgroup computing came into vogue, business areas began to invest in their own infrastructure and their own developed applications. Each business area’s set of applications were intended to support transactional or operational processing, and the functional requirements reflected the specific needs for initiating business process workflows and completing the corresponding transactions.
As an example, consider a typical bank deposit transaction in which the customer presents a check for deposit in a named bank account. This specific process is focused on the operational aspect of logging the transaction by finding the named account and increasing the stored balance by the amount of the deposit. The identity of the customer, her behavior patterns, and other analytical aspects were irrelevant to achieving the outcome of the transaction, and so there was no demand for high fidelity in capturing or even documenting that information.
However, as operational reporting and business intelligence evolved, there has been a greater interest in repurposing data created or captured as a side-effect of the transaction. For example, the bank might want to know which accounts are associated with which individuals, and what the total transaction volume by customer is, not just by account.
Data sets collected from across different functional silos are intended to be repurposed for reporting and analytics. The challenge is that because of their isolated design and development, each of these business function areas has potentially used the same or similar concepts in their data models and applications in ways that differ slightly with respect to definition, data formats, and allowable values. In other words, organic evolution of siloed business applications focused on functional requirements for satisfying transaction or operational needs has allowed inconsistency and low fidelity across uses of the same or similar business terms and corresponding data element representations.
These differences do not matter significantly as long as the applications continue to operate within the functional context and within their own domains. But when the data sets are accumulated for reporting and analysis, small differences suddenly have significant impact. A very simple example involves sharing records containing unique identifiers that have been assigned by different authorities. This happens frequently when data records created in different environments are exchanged across domains, such as in healthcare identifiers, banking account numbers, social service case identifiers, or where similar sets of records are accumulated for reporting and analysis.
As these data sets are forwarded to analytical environments (data marts, data warehouses, or even streamed into desktop self-service BI tools), the types of variances begin to unfold – differences in the values (such as names being spelled differently), differences in the structure (a last name field in one data set is 25 characters long while a last name field in another data set is 30 characters long), and differences in the definitions (“last name” refers to a residential customer’s family surname in one data set while “last name” means the previous name used for a commercial customer in the other data set).
And you might say that as data professionals, while we don’t necessarily need to shoulder the blame for these variances, in a modern information-aware enterprise, it is our responsibility to seek out ways to mitigate the impacts of legacy variation. In my next post, we will explore the concept of metadata and how the intent of a metadata repository might originally have been to address the aforementioned challenges. But as we will see, the answer is not so simple…
 
This is the first in a series of six posts from David Loshin. Read his next post.
Want to learn more about ER/Studio? Try it for yourself free for 14 days!
 
About the author:
David Loshin, president of Knowledge Integrity, Inc. (www.knowledge-integrity.com), is a recognized thought leader and expert consultant in the areas of analytics, big data, data governance, data quality, master data management, and business intelligence. Along with consulting on numerous data management projects over the past 15 years, David is also a prolific author regarding business intelligence best practices, with numerous books and papers on data management, including the second edition of “Business Intelligence – The Savvy Manager’s Guide”.