The Importance of Data Lineage

by Oct 31, 2018

How many times have you been querying some data, or looking at a report, and asked yourself something like “Should I believe this data? I wonder where this data came from?” Well, what you are asking about is the lineage of the data.

The Merriam-Webster Dictionary defines lineage as “descent in a line from a common progenitor,” and the term is most often used in ancestry, tracking your family tree. Data lineage applies the same concept to your data.

A comprehensive log of your data’s lineage would track its life cycle from the data's origins and where it moves over time. It would include any transformations and changes that occur along the way, as well as details about the processes that moved and transformed the data.

Without knowledge of where your data has been and what happened to it – its lineage – how can you be certain that the data is trustworthy for analytics, machine learning, reporting, or indeed, any business activity that uses the data?

An Analogy

Have you ever watched the Antiques Roadshow program on television? In this show people bring their personal items to professional antique dealers to have them examined and evaluated. The participants hope to learn that their items are long-lost treasures of immense value. Sometimes they are, oftentimes they are not.

 

The antique dealers always spend a lot of time examining the item, but also talking to the owners. They ask questions about the item’s provenance, such as “Where did you get this item?” and “What can you tell me about its history?” Now, the item is sitting right there in front of them, yet they ask these questions. Why?

The details about an item’s provenance – its lineage, if you will – can provide knowledge about the authenticity and nature of the item. The dealer also carefully examines the item looking for markings and dates that provide clues to the item’s origin.

Sometimes the item has a letter, or other document that provides additional information about the item. Such cases usually cause the antique dealer to be more excited about an item because it adds more details about its lineage: where and when it was purchased, who purchased it, who owned it, and so on.

Just like valuable antiques, it is important for the lineage of valuable data to be tracked and managed to know its authenticity and how it can be used. So, using our Antiques Roadshow example, the item being evaluated is the “data.” The answers to the antique dealer’s questions, the markings on the item, and any additional letters or documentation are the metadata that explains the lineage. Value can be assigned to an item, or our data, only after the metadata is discovered and evaluated.

Capturing Data Lineage

Data lineage shows the movement of data throughout its existence. You must be able to capture information about the data’s creation, which will be easier if your organization creates the data; if you acquired the data from an external source, its origin will be that source, unless the source provides data lineage details.

You must then create processes that capture information about the data as it moves throughout your organization. Automated capture is best, but manually recording the details is better than nothing. You want to know when the data is moved or available to any system including ETL, transactional, analytical, and any other type of system you may be using.

Before pursue capturing data lineage for any data, you need to understand the granularity of lineage required. Coarse-grained lineage may be sufficient for many types of data, where you only capture high-level details of the data’s lifecycle. However, for some types of data – for example, highly-regulated data such as personally-identifiable information (or PII) – you may need fine-grained data lineage, including each change to the data, who made the change, when it was made, and perhaps even the before- and after-image of the data. Such a requirement causes the discussion to move from data lineage to data access auditing.

 

Typically, a data modeling tool or metadata repository is used to track data lineage. With data lineage details available in your toolset, you can create diagrams to show where the data came from and its path through your systems from your data models… and thereby have better knowledge of your data and how it can be used at your organization.

Data Lineage in ER/Studio Data Architect

One popular product for data modeling is IDERA’s ER/Studio Data Architect, which provides an easy-to-use visual interface for data modeling professionals to document, understand, and publish information about data models and databases. Using ER/Studio you can document data lineage showing the provenance of the data and its movement from point to point, and any intermediate steps in between.

There are several ways to capture and document data lineage with ER/Studio Data Architect. For example, you can import the external metadata from your ETL tools to capture the movement of data. Figure 1 shows the various resources that can be captured, such as MSIS, Microstrategy, and many more.

Figure 1. Capturing data lineage from ETL.

 

And the result of capturing ETL flows will be a data lineage diagram that shows the movement and transformation of your data. For example, Figure 2 shows the movement of currency rate data in your organization.

  

Figure 2. Example data lineage diagram.

 

You can also track the lineage of data in your data models recording information about how often the data is sourced, the last time it was updated, and the rules that are applied to the movement of the data, as shown in Figure 3.

 


Figure 3. Data lineage details.

 

There are many powerful capabilities for tracking data lineage in ER/Studio Data Architect and this brief overview has just cracked the surface.

Summing It Up: Data Lineage

So, do you know the lineage of your important business data? Where did it come from? When was the last time it changed? What processes change it?

These are important questions that factor into the quality and usefulness of data. If you cannot answer these questions can you truly trust the data?