This is the first post in this series. Read the next post.
The continued fascination with expanding an organization’s analytics capabilities often centers on the desire to identify interesting and new data sources originating outside the enterprise that can be integrated into a big data analytics application. The presumed wealth of information hidden within third-party or otherwise externally-sourced data sets has inspired many aspiring data scientists to rapidly create processes for accessing and ingesting data sets whose volumes range from the massive down to the relatively small.
However, ingesting external data sets poses some unique challenges that are often overlooked by those analytical algorithmists with little formal training in data management. In particular, the fact that the origination points are outside of an organization’s administrative domain has some curious implications:
- Absence of governance – First, there is little or no knowledge about any controls for data usability imposed on the data set. In some cases where the data is acquired directly from the original owner, there may be some exchange regarding data organization (such as the stereotypical spreadsheet-based data dictionary providing column names, data types, and generally less-than-useful definitions). In most other cases, though, such as when data sets are harvested from various points along the World Wide Web, there is no meta-information provided to accompany the data sets.
- Editorial bias – Second, the creation of a data set for external consumption, by necessity, involves editorial decisions and biases that are not necessarily revealed to the downstream data consumers. Choices are not only made about the physical structure of the data (comma-separated values, quotes as field delimiters, etc.), but also about which data elements are included and which ones are excluded from the final artifact.
- Absence of metadata – Third, there are external data sets that are cobbled together dynamically (such as scraped data harvested using HTTP requests across different sites) that are completely devoid of any format structure or metadata.
The issues created by these implications do not affect the ingestion and integration of the data into one’s use of the data within your organization's analytical context. One might think this is relatively straightforward, but even apparently simple aspects hide some critical complexity. Here is a quick example:
The US Department of Health and Human Services provides a number of data sets associated with health care providers via their Centers for Medicare and Medicaid Services web presence (cms.gov). One data set includes information about a medical provider’s National Provider Identifier (NPI). A different data set, called the Open Payments data set, contains information about pharmaceutical company payments to health care providers.
In the data dictionary for the NPI data, there is a data field called “Provider First Line Business Mailing Address,” which is (naturally) defined as the “provider’s first line business mailing address.” In the data dictionary for the Open Payments data, there is a column called “Recipient_Primary_Business_Street_Address_Line_1,” which is described as “The first line of the primary practice/business street address of the physician or teaching hospital (covered recipient) receiving the payment or other transfer of value.” As it turns out, even though in one case the data element is qualified as being a “provider” address and in the other case as a “recipient” address, these two data elements are not only conformant (the recipient is the health care provider receiving a payment), the Open Payments data value is probably sourced from the NPI data!
This example shows just one case where relying on equivalent or similar column names to find conformant data elements is insufficient. There are numerous other cases, and in my next set of blog posts we will explore ways to assess the usability of data from external sources, and how to capture knowledge that can enable more effective analysis once those data sets are integrated into the enterprise.
Want to learn about ER/Studio? Try it for yourself free for 14 days!
White Paper from David Loshin: Make the Most of Your Metadata
About the author:
David Loshin, president of Knowledge Integrity, Inc. (www.knowledge-integrity.com), is a recognized thought leader and expert consultant in the areas of analytics, big data, data governance, data quality, master data management, and business intelligence. Along with consulting on numerous data management projects over the past 15 years, David is also a prolific author regarding business intelligence best practices, with numerous books and papers on data management, including the second edition of “Business Intelligence – The Savvy Manager’s Guide”.