Reducing costs of Azure Databricks (FinOps in practice)
Big data is what powers companies. Lots of data mean lots of insights, which enables better decision-making. Having ...
Why is it important to keep your data consistent, and how to achieve it?
We are nearing the end of our Modern Data Warehouse journey. Today we’ll discuss the final aspect of your data project.
If you’re new to the series, just click here to check out the first article and start from the beginning!
If you’re in a hurry, check out the sentences in bold font for the essentials!
It all starts with reporting
Master Data Management (MDM) is a mature and well-developed area in data projects. And yet, it is often skipped or marked as a low priority feature. It often comes up when the objective is to show some data on dashboards. Sometimes, the story might start like this:
„Hey! We have all the necessary data. Let’s build some reports! There is no need to do any data adjustment. Everything that we need is there”.
We’ve heard this statement quite a few times. Often, there were data quality issues due to the lack of master dictionary management. This makes creating the necessary reports difficult.
Essentially, Master Data Management is a process of unifying your data, so that it is consistently labeled and mapped to the relevant entities across all your systems.
It allows all your services to use the same data set for all purposes so that there are no gaps, inconsistencies, or duplicates in your records.
Why is it important?
The problem with Master Data might arise once your organization has at least two systems (e.g. CRM, accounting) that share common business objects, e.g. employee, country, account, etc. When you’re building a Modern Data Warehouse, what often happens is that many different business areas intersect and need to be somehow combined and connected.
If your organization doesn’t incorporate MDM into its data strategy, then at some point, difficulties and obstacles to using your data will appear. They will limit business development opportunities. They may also prevent your company from making the most of its data and getting ahead of your competitors.
Do you really need MDM?
Of course, it is possible to operate without an MDM solution to some extent. Especially at the beginning, where there is no much data and dictionaries are small. Often potential issues can be easily fixed, e.g. a data discrepancy between a small dictionary which can be aligned manually between all source systems.
A one-time alignment like this might be a solution, but in reality, usual problems are more complex, dictionaries are large and often updated, and occur in multiple systems with different sets of attributes. The higher the number of objects, the bigger their shared aspects which should be centrally maintained.
Otherwise, eventually, processes and workflows will become unmanageable. Data quality will be low, business users will be confused, trust in the data will decrease. Preparing a single report will require a lot of effort. Your organization will stay within data silos between departments and lose the opportunity to have a shared big-picture view of what is going on.
This is just one example where you would need a Master Data Management system. But of course, not everyone has the same requirements.
As an organization grows, so does the awareness about Master Data Management. And not everyone is at the same stage.
Additionally, companies don’t always start building their system assets with MDM included right from the start. There are many reasons for this, such as time, money, priorities, maturity stage, etc. Still, the longer it takes to introduce an MDM solution, the more complex the integration might be.
MDM is an inseparable part of a Modern Data Warehouse engagement. Here, it is a must, not an option. Without it, the whole program will face issues from the very beginning, and will ultimately fail.
How you create your solution depends on many factors, for instance:
It is not just about the right tool, but also the business processes (including business rules) and responsibilities of dedicated roles. A proper tool should help you support not just the ETL, but also data management.
There might be three different scenarios of how master data (business objects/dictionaries) is distributed, synchronized, and handled between particular sites (systems).
Our business case
Let’s consider the following situation. There are three different systems (System A, System B, System C) that operate on the same business object, i.e. Employee. There is an MDM solution that integrates data between these mentioned systems. At the final stage, there is also Data Warehouse which ingests dictionary data from it. The goal is to achieve a global version (single source) of truth for the Employee entity.
Employee objects are maintained separately in each system. There is no direct synchronization between them. Therefore, each source system has its own version of Employee data. Part of the Employee records exists in all systems (they are shared), while another part is unique to each system.
Let’s talk about how the Employee object might flow between systems A, B, C, MDM, and Data Warehouse.
Scenario 1 – no source system
In this approach, all Employee data from our systems is transferred to Master Data. None of the sources can be indicated as a master source system for the Employee entity, and none of them contain all of its records.
Within Master Data, everything is combined into one data set (or “superset”, “the master set”), which includes all records and excludes duplicates. In the final step, the information is distributed further for consumption.
In Modern Data Warehouse engagements, one of the destination systems will be a Data Warehouse. However, the master set can also be ingested by any other system by gathering it from the Master Data.
This situation occurs the most often. Objects are unified at the Master Data level and then propagated further (to a Data Warehouse) without feedback to source systems. In other words, data is ingested and the integration process does not perform any adjustments at the data source level.
Scenario 2 – two-directional data update
Similarly to scenario 1, in this approach, Employee data from all three systems is transferred to Master Data. None of them can be indicated as a master source system for the Employee entity, and none of them contains all employee records.
Again, within Master Data, all data is combined into a single dataset including all records and excluding duplicates.
Here is where things are different. For Systems A and B, there is two-directional synchronization, i.e. unified records from Master Data are sent back to source systems, to update the source information.
This way, we can update these source systems with additional information.
Scenario 3 – single data source
In this approach, we can identify System C as a source system that simultaneously contains all Employee entries. System C is therefore the master data source for Employee data.
The data is transferred into the Master Data system and then propagated to Systems A and B, and the Data Warehouse.
Theoretically, all data synchronization from System C to Systems A and B could be done directly rather than through Master Data. Be aware though, that this is the only example of one specific entity. In a real Modern Data Warehouse project, there would be several dozen or several hundred of them. This more structured approach will make processing much more efficient, and reduce the risk of errors.
Going deeper into master data set management, when the data is loaded into the MDM system, the next step is to unify and manage it. This process is done in two ways, i.e. manually and automatically.
In an ideal scenario, data unification would happen automatically based on the developed logic. For example, Employee records from different systems are merged into one final entity, where all records receive a globally unique identifier. Afterward, the Employee entity can be consumed for processing by other systems.
Unfortunately, automated business unification is not always possible. There will always be some records that need to be reviewed, and some decisions that must be taken up by someone with a dedicated role.
The role is that of a Data Steward, whose responsibility it is to manage the particular entities. Data Steward is the person who knows and understands particular MDM entities and their attributes very well from a business perspective. They should be the person who takes care of a particular dictionary’s final shape.
A frequent scenario where a Data Steward has to take action is when we need to do the mappings, e.g. between some objects from the source system and the master data dictionary.
Imagine a situation where you’ve built the master dictionary for your products. The master Product entity is defined within the MDM system. It contains the Products that we agreed to name in a consistent way across the enterprise.
Now, the source systems might contain Product names that are completely different. Therefore, to be able to analyze data according to a defined global Product definition, the source system Product records need to be mapped to global Product definitions. This should be done by a subject matter expert in this area (i.e. Products area), that is the Data Steward.
How did you like this series?
And so we’ve arrived at the end of our Modern Data Warehouse journey! Starting from the concept of a Data Warehouse and why it’s important to modernize it, through the MVP phase of the project, establishing a framework for your engagement and organizing your work, right to today’s article on managing master data.
Did you enjoy this series? Is there anything else you’d like to know? Feel free to comment below and let us know your thoughts! Or, click here to get in touch with us.
Keep an eye out for our ebook on Modern Data Warehouse, coming soon! Follow us on LinkedIn to be the first to know when it’s released ?
Written in collaboration with Paweł Borowiecki.
Read similar articles