Integrating Data from Multiple Sources: Inferring Source Metadata – by David Loshin, Knowledge Integrity

by Feb 12, 2015

This is the second post in this series. Read the previous post and the next post.
In my last post, we began to review the data element conformance challenges associated with integrating data from a variety of original sources. The biggest issues have to do with the ability to conform data sets, or determine whether they can be blended together in a cohesive and semantically consistent way. The challenge is that in some, if not many, of the cases, there is little or no metadata, leaving you, the user, in the proverbial lurch. So what is a data practitioner to do?
Before we start to “pre-engineer” a solution, it may be worth taking a step backward to survey the landscape and see what resources you have at your disposal. Also, for the sake of simplicity, let’s limit this discussion to structured data, since that is already challenging enough.
The first resource is the data source – did the originator provide any information that might say where the data came from as well as any information about the data set’s data elements? Even a rudimentary data dictionary or data element catalog is useful if it provides data element name, data type, size, and perhaps some kind of definition.
Alternatively, in the complete absence of any prescribed metadata, the data set becomes the de facto authority. Often the first row of a data set contains the column headers, and that provides some context about the data elements. Couple that with the use of a good data profiling tool that can be used to survey the data set’s elements statistically and provide a report on some key facets of the corresponding value sets, including:ERbanner-LoshinMetadataWhitepaper159x228jorge
  • Inferred data types – String patterns can be matched against standard forms. For example, if all of the values map to a series of alphanumeric or just alphabetic characters, the data type is likely to be some kind of character string. Other standard format patterns include integers, real/decimal numbers, and dates.
  • Data lengths – The maximum lengths of the column’s data values tells you the maximum length of the data element, which can dictate the inferred data element length.
  • Inferred domain – Mapping a data element’s values set to a commonly-used reference domain can provide some insight especially if the reference domain is related to the column name.

The process can be repeated for all of the data sources, and, as that information is gleaned, reviewed, and vetted, it should be captured in a robust metadata repository that can be shared among the teams that collaborate in the data integration processes. As the metadata is incrementally accumulated for all of the multiple sources, you evolve a degree of visibility into ways that the data sources can be ingested, absorbed, and ultimately interpreted. At that point you can apply a set of tests to assess conformability across the different data sets – a topic we’ll explore in my next post.

______________________________________________________________________________
Want to learn about ER/Studio? Try it for yourself free for 14 days!
You can also read this White Paper from David Loshin: Make the Most of Your Metadata
About the author:
loshinDavid Loshin, president of Knowledge Integrity, Inc. (www.knowledge-integrity.com), is a recognized thought leader and expert consultant in the areas of analytics, big data, data governance, data quality, master data management, and business intelligence. Along with consulting on numerous data management projects over the past 15 years, David is also a prolific author regarding business intelligence best practices, with numerous books and papers on data management, including the second edition of “Business Intelligence – The Savvy Manager’s Guide”.