More and more of our clients are investing in Data Warehousing. Also, more are moving to cloud-based warehouses to enjoy benefits such as agility, cost reductions and availability. We now have some great products, like WhereScape, to help you to automate the design and execution of data warehouses, further reducing costs and improving efficiency.
But we need to be mindful of what we put into our shiny new warehouses. We have all seen ungoverned warehouses become bloated with data that is not understood, putting us at risk. Misunderstood or poor quality data can be very dangerous.
The warehouse can advertise data that is labelled poorly, has no context, is a calculated field using the wrong formula or just of poor quality. We also need to be mindful of security or regulation. Are we exposing confidential or sensitive data? If we are working in the cloud are we putting personal data in a location that may contravene our rules. Planning is the key.
Catalog your assets
What data assets exist in your organization? For each asset do we have useful knowledge such as the context of the data, the owner of the asset, a summary of content and any rules that may apply? ER/Studio can help you build a list of data sources and publish that knowledge to the organization.
Classify Your Assets
For each asset we need to know what data is within it. The metadata of the asset may not help you as it could be physical, technical names. Most database products have restrictions on naming, so compressed physical names are used.
EXTACNO_4 is a constriction of External Account Number. Other systems like SAP just have codes for tables and columns that have zero meaning. They are tools out there that will look at the data itself and try to guess the meaning of the data, some using AI technologies.
There is little substitute for proper documentation. Data Architects have this at the core of their world. They are the team that designs and documents data assets. ER/Studio is at the heart of this. Each data asset can have a physical model that shows the tables and columns with their physical names and also a logical model in business-friendly language with useful business metadata.
Define Your information
We all throw around language and assume we all mean the same thing but danger lies here. When you are asked for North American sales figures what do you really mean? How should this be calculated? A published glossary of accepted terms can help here to reduce ambiguity and therefore reduce risk.
Key Performance Indicators can be defined along with their calculations. Also information need classifying in terms of sensitivity, confidentiality etc. Some information may be related to people and be classified as personally identifiable information (PII).
Rules specifying the usage of the data should be specified. ER/Studio has a built-in Business Glossary where Business Terms can be defined and allows Data Architects to classify data assets against them. You can then see which data assets contain instances of a piece of information.
So ER/Studio can help in this planning process by producing and publishing a Data Catalog that can help answer the important questions:
For the data that we will store in the warehouse what information do we need? What data is required for calculated fields? What rules should we consider for that data?
Additionally, where is this information stored? Which of the data assets has the data in the right context and quality?