If you’ve ever worked with data stored in JSON files, you probably know that extracting data from nested JSON files for processing can be daunting and time-consuming. Data stored in JSON files make up a whopping 80% or more of data collected by large-scale enterprises. In addition, standard BI Tools are designed for relational data and struggle to process JSON data.
The easiest way to do so is to extract the data from the files and store the data in a relational database like Snowflake, Google BigQuery, Databricks, etc. where it can be easily queried and processed using standard tools.
This blog explores an integrated automation solution combining features from ER/Studio, Collibra and Bluemetrix Data Manager, which simplifies this extraction process and allows you to extract data at scale from nested JSON files.
Code-Free JSON Data Extraction with Bluemetrix, IDERA and Collibra
Reporting on a JSON data is not easy. So, our mission is to take data from a large number of hierarchical JSON files, either in a simple folder structure or on a data lake platform, unpack it and load it into a collection of related tables in a database. Then we can run the BI tools we normally use against the data and even link that data with existing data sets.
There are three parts to the challenge:
- To design the structure of our relational database that will contain our data
- To unpack the data in our JSON files and load it into those tables
- To ensure that sensitive or private data is protected along the way
Doing this without tools would be time consuming and error prone. The right tools can make this a breeze. Let’s learn a little about those tools.
Bluemetrix Data Manager (BDM) is a platform-agnostic Spark-based ETL solution that simplifies the creation of data pipelines for the movement and transformation of data. It works with multiple data sources and targets, including JSON files.
ER/Studio is a data modelling tool for the design and documentation of data assets such as databases and JSON files using graphical business-friendly models.
Collibra Data Intelligence Cloud is the leading enterprise cloud platform for capturing and recording data governance.
Combining BDM with the data modelling functionality available in ER/Studio and the Data Governance features available in Collibra, provides a no-code automation solution that simplifies and expedites the extraction of data from nested JSON files.
Let’s examine in more detail how this works.
Step 1: Create Target Data Store and Models for Nested JSON Files
Our nested JSON files have data stored in a hierarchical structure. ER/Studio will reverse engineer this JSON and create a physical model of the source JSON and a logical model of the information within it. The user will enhance the logical model by adding primary keys and moving it through the process to 3rd Normal Form.
Step 2: Create Physical Model
Next, we automatically generate a physical model choosing Snowflake as the target technology from the logical model created in Step 1. You can then generate DDL code and deploy this to your Snowflake instance.
Step 3: Synchronize the Business Glossaries of Collibra and ER/Studio Team Server
ER/Studio Team Server has built in features to synchronize with Collibra. In this example, we have a business glossary term ’patron number’ which has been flagged as PII and has a Sensitivity Level as Critical. These values will have been set by Data Stewards in Collibra.
Step 4: Use Data Architect in ER/Studio
We map fields in ER/Studio’s physical and logical models to these glossary terms. You can also apply classification properties directly to fields.
Step 5: Publish in Collibra
The logical model of the information in our JSON along with the physical schemas of the JSON file and the target Snowflake database are now published to Collibra along with mappings to the Business Terms and direct classifications.
Step 6: Read These Models from Collibra into Bluemetrix Data Manager (BDM)
In BDM we read the schema information for the JSON files from Collibra, for the source and target database. With this information we then read the source and ingest the data into memory.
Step 7: Read the JSON files and use the models to explode the data and transform it into relational tables in Snowflake.
Once the data is available to us in memory, BDM will then explode this data and write it into a data frame for further processing.
Step 8: Identify PII tag in Collibra. BDM will read the data policies which apply to this data source from Collibra. In this example, we can see that the column patron number has PII tags, which means the data in this column is sensitive and needs to be tokenized.
We know this data needs to be Tokenized because a Ruleset has been created in BDM which states that any data that is tagged PII must be Tokenized. This means that before the data in this column is written to the table in Snowflake it will be automatically tokenized.
Step 9. Run the pipeline
Now we will run the pipeline and write the JSON data from the data frame into a relational table in Snowflake. And we can see here that the data has been converted into a tabular form, and a foreign key relationship has been created between the tables using the ISBN number. This means that any BI tool user can now execute standard queries against this normalized relational data. And per the data policy rules, the patron number data has been tokenized when it was written into the table.
Step 10: Capture the Lineage in Collibra
And finally, we will look at Collibra once more, where we can see the lineage that has been captured for this pipeline. We can see that the JSON files ingested have been turned into relational tables in Snowflake, while the appropriate fields have been tokenized before the data was written into Snowflake in compliance with the Data Policies in Collibra.
How can data-driven organisations benefit from the integration?
This integrated solution from Collibra, IDERA and Bluemetrix strives to make the impossible possible and the hard easy. We make extracting data from complex JSON files at scale simple and quick, while doing so in a secure manner which is compliant with Data Governance policies, as outlined in the previous 10 steps.
By converting the data into a tabular form, while establishing a foreign key relationship between the tables using the ISBN number, this means that any BI tool user can now execute standard queries against this normalized relational data. All of this allows analysts and business users analyse the data using their existing BI tools and skill sets, allowing them to extract the value from the data that was previously locked in the JSON files.
They can carry out all this work in a controlled and compliant manner, ensuring that the data is processed in compliance with the Data Policies in Collibra, where all the lineage is captured ensuring that activities can be tracked and available for auditing for GDPR purposes.
Bluemetrix, Collibra and IDERA’s partnership takes a proactive approach to JSON data extraction and simplifies the responsibilities for the Data Engineering and Data Ops teams, while ensuring data is accessible and can be shared seamlessly by the data users. To see how the integration works in action, watch the video here.