Integrating Data from Multiple Sources: Making Sense of Conformable Data – by David Loshin, Knowledge Integrity

by Mar 4, 2015

This is the third post in this series. Read the previous post.

In my last post, the discussion focused on the ability to infer metadata about data sources to be ingested into the enterprise for which there is little (or no) provided metadata. Yet the topic of our discussion is not purely ingestion of data into the enterprise, but integration of that data. So presuming that we were able to assemble a comprehensive view of the metadata for the data sets to be brought into the organization, we would still need a way to assert that when the data sets are instantiated within our data management platforms, there are no fundamental barriers to integrating the sets into a cohesive information asset.ERbanner-LoshinMetadataWhitepaper159x228jorge
Integrating two data sets implies some expectations. Any corresponding pair of data elements taken from the two data sets to be aligned for transaction processing or analysis purposes must:
  • Share the same data element concept. For example, if data set A has a data element named “CountryOfOrigin” and data set B has a data element named “CountryOfManufacture,” they would both have to share the same data element concept of “Country.”
  • Share the same value domain. Using the same example, we might expect that the two data elements both take their values from the ISO 3166 standard set of three-character country codes.
  • Share the same meanings. In our case, the definitions must align in terms of the data element designating the country in which a product was originally manufactured.
That being said, while these expectations are reasonable, we may find ourselves limited in the ability to integrate many data sets from multiple sources when minor variations exist. That is the reason for loosening the constraints by introducing a concept of “conformability” of metadata for the purposes of integration, in which the data element concept, domain, and semantics of the two data elements are close enough that applying an interpretation or a transformation will provide the necessary alignment for the specific business purpose.
The simplest example would be where the data element concepts are the same, but the value domains differ. In our case, that might be where CountryOfOrigin uses the ISO 3166 three-character country codes but CountryOfManufacture uses the full country name from the ISO 3166 standard. Here, a mapping from three-character code to full name is provided in the standard, and the data elements can be tested to see if every place a code appears in the first data element, the mapped country name appears for the other data element.
On the other hand, differences in meaning may show that the data sets cannot be easily integrated. An example might be where CountryOfManufacture means the location in which the product was made, but CountryOfOrigin is the location from which the product was shipped. In some cases the difference in definition might be irrelevant, but perhaps for calculating tariffs the difference might be critical.

The upshot is that the process of integrating data from multiple sources needs a definition and a policy of conformability accompanied by a step-by-step process for assessing that conformability. One can begin with the simple suggestions in this post, and branch out from there by adding additional governance around the policy of conformability and the processes for assessment. However, realize that you will need the right kinds of tools at your disposal; neither the policy nor the process can be instituted without a foundation and tool for capturing structural and business metadata. 

______________________________________________________________________________
Want to learn about ER/Studio? Try it for yourself free for 14 days!
You can also read this White Paper from David Loshin: Make the Most of Your Metadata
About the author:
loshinDavid Loshin, president of Knowledge Integrity, Inc. (www.knowledge-integrity.com), is a recognized thought leader and expert consultant in the areas of analytics, big data, data governance, data quality, master data management, and business intelligence. Along with consulting on numerous data management projects over the past 15 years, David is also a prolific author regarding business intelligence best practices, with numerous books and papers on data management, including the second edition of “Business Intelligence – The Savvy Manager’s Guide”.