Data Modeling Still Important in a Big Data, Analytics World

by Oct 23, 2018

As more organizations embrace big data and analytics to gain insight from extremely large data sets the tools and systems that we use to manage data have grown, changed, and multiplied. Instead of just relational, SQL database systems, we now use NoSQL databases and Hadoop file systems to store increasingly large amounts of corporate data.

Now you would think with the towering importance of data in today’s modern organization that data modeling would be viewed as extremely important by management and IT professionals. So it is somewhat ironic that the age of big data has coincided with a long-term slide in data administration and modeling in our enterprises.

This is not a situation that should continue to be tolerated.

What is Data Modeling?

Data modeling is the process of analyzing the “things” of interest to your organization and how these things relate to each other. The data modeling process results in the discovery and documentation of the data resources of your business. As you create your conceptual and logical data models, you are developing the lexicon of your organization's business.

A data model is built using components that act as abstractions of real-world things. The simplest data model consists of entities and relationships. As work on the data model progresses, additional detail and complexity are added including attributes, domains, constraints, keys, cardinality, requirements, and relationships. And importantly, definitions of everything in the data model.

I would imagine that everybody should agree that all of these things are important. If we want to understand what data we have – and how to use it – a foundational model is required. The alternative is that the knowledge remains embedded in the brains of those who use the data on a daily basis. Or even worse, you are using data that you don’t really know and therefore shouldn’t really trust!

Issues with Big Data

Big data and analytics are an important part of modern IT. The amount of data created has grown, and continues to grow. Analysts at IDC estimate that the amount of data we use and manage doubles annually. Eric Schmidt, of Google, put it another way: “There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.”

Any way you look at it, the amount of data “out there” continues to expand. And performing analytics on that data can uncover heretofore unknown insights that lead to competitive advantage.

Furthermore, the big data used to power analytics is being adapted for use by AI and machine learning software that will further improve the return on our computing investment through automating processes and tasks, thereby increasing productivity and operational efficiencies.

So big data and analytics are here to stay. The issue that can arise is when flexible schema technologies like NoSQL and Hadoop are used. This is often a requirement when large amounts of data are being discovered, ingested, and moved into an organization. When one row (or record) of data can have a different schema than the next, you cannot apply a fixed schema model to the data.

Nevertheless, the programmer has to know what the data looks like. You cannot just throw a big lump of data at somebody and say “Here is the data now write me a program.” Well, you can say that, but then the programmer (or somebody) has to analyze and document the structure of the data.

Hmmm. That sounds a lot like a data model, doesn’t it? Well, it should, because it is. Instead of up-front modeling, before any code is written, as is common in the relational world, big data modeling is sometimes performed based on application queries in program code or tools. What we want to avoid is having all of the knowledge of the data embedded in application programs, like was common before relational became popular in the 1980s.

Data modeling creates a system of record for enterprise data that is accessible by all, and not just those who understand the programming language du jour.

Why data modeling is still needed

If I haven’t convinced you that data modeling is still important, then let me try one more tactic. Think about regulatory compliance. This is the necessary processes and procedures that your organization takes to assure that it adheres to governmental laws and applicable industry regulations.

This includes regulations like HIPAA which provides data privacy and security provisions for safeguarding medical information, PCI-DSS which governs the technical and operational system related to (credit, debit) cardholder data, and GDPR, which protects the data and privacy of individuals within the European Union. These, and many other regulations, specify particular types and instances of data that must be protects or controlled in specific ways.

Without a data model that identifies and defines what data you have (including where it came from and who uses it), how can you ever hope to be in compliance with the industry and governmental regulations that apply to your business?

Start Today

If you do not model and define your organization’s data today, you can always start doing so on a project-by-project basis. Incorporate the documentation of data as a component of every new project you start. Educate your development teams on the importance of data modeling And documentation and be sure that a data model is required for every project moving forward.

Over time you can build data modeling into the fabric of your organization.

Craig S. Mullins
Mullins Consulting, Inc.
http://www.mullinsconsulting.com