Reducing costs of Azure Databricks (FinOps in practice)
Big data is what powers companies. Lots of data mean lots of insights, which enables better decision-making. Having ...
How to run a Modern Data Warehouse project effectively?
Modern Data Warehouse development and deployment should be treated not as a single project but rather a program, which consists of multiple parts.
During the entire lifecycle, it is important to work according to a defined standard. This is why we developed Predica Data Domain Framework (PDDF), which is the focus of this article.
In a hurry? Focus on the sentences in bold font for a summary!
A Modern Data Warehouse project does not finish after the first release, it continues forever exactly like every Business Intelligence engagement.
The reason for this lies in constantly changing business requirements and an ever-evolving business environment.
Also, a Business Intelligence project touches all the organization and every department. The subsequent stages are developed and iteratively extend solution capabilities.
Imagine a team of 10 developers whose task is to integrate 10 relational database systems. Each one has achieved their goal, the systems are integrated. Therefore, from a business perspective, the requirement is met.
Nevertheless, if you look under the hood, there would be probably a lot of discrepancies as to how the solution was developed and delivered.
You could say: „So what? It works and business departments are satisfied. You are just looking for problems!”. I agree, user satisfaction is crucial, but the focus here is on the development team who deliver the solution.
A situation like this would impede their work and further development, and consequently, delivery time. Therefore, business users would get the requested solution later.
In the future, things like adding extensions, implementing updates, performing maintenance, etc., will become more complex and difficult. At some project stage, there would be no other way than to rewrite the solution.
This might be a catastrophic scenario for the project, its budget, and timeline, and it should be avoided at all costs.
To keep consistency and avoid discrepancies, you need to define a unified approach to development. If an integration of 10 relational database systems is taking place, then this is something repeatable. Some things can be automated, while others can be shared and reused.
This is not just about writing common, reusable code. It is also important to define naming conventions, folder structure, and layouts, etc. Standards should be defined from two perspectives.
This concerns the key elements of the project. An example can be a naming convention for the source systems. Throughout the project, they will be referred to multiple times at different stages. A good example is how should a data warehouse or other source systems be called and referred to, e.g. dw, dwh, or wh?
This perspective specifically concerns used services. For example, if Azure Data Factory is used to orchestrate the data pipelines, then how are the pipelines named and placed into folders? In addition, how are Azure Data Factory datasets named? How should we define linked services?
An answer to this is a proper framework. And, luckily, we have one ready-made!
Using these principles, we invented and developed the Predica Data Domain Framework. It provides standards and guidelines for building and deploying Business Intelligence solutions in general, and specifically for Data Warehouse Modernization projects.
To some extent, it also provides some generic functionalities like Testing Framework and Adoption Measurement.
The Predica Data Domain Framework (PDDF in short) can be defined as a set of rules, standards, and patterns that describe and indicate the path on how Data Domain projects should be organized, structured, and conducted.
It allows you to:
During each project, we always apply the best practices and standards in every area. For example, the infrastructure is not the main specialization of Data Domain, but we always design and build Azure architecture according to standards defined by Predica CoE.
Additionally, each new Predica employee is trained and committed to working according to these standards.
As mentioned at the beginning, a Modern Data Warehouse project is a continuous and long-term engagement. The PDDF works according to a similar principle. It also has its lifecycle and operates in line with MDW engagements.
During the MDW lifecycle, new challenges arise, e.g. new systems are integrated, and new solutions are used. For them, we need to define the new standards, new rules, and new naming conventions, and unify them with the existing framework. It is not possible to define everything in advance, but strong fundamentals give huge benefits and advantages during project development.
Here are the roles and processes which govern the framework, and, consequently, data projects.
Predica Data Domain Framework consists of three main areas, which are:
The person responsible for each of the listed areas is the Area Administrator. They lead, create visions, and plan standards that should be developed and introduced in a short- and long-term future.
There are also additional roles, such as:
A process developed 6 months ago might be obsolete or need updates. To cope with the dynamically changing situation, the PDDF includes a definition of the lifecycle process.
This way, we can introduce changes or new features to the framework in an organized and controlled manner.
Depending on the situation, the change might be small and can be introduced within 30 minutes. In more complex cases, or even building something totally from scratch, it might take several iterations to introduce it.
During dedicated meetings (elaboration workshops), new standards are designed or updated. The crucial element is to involve the right people. Except the Area Administrator, people who have experience within a specific area should be engaged in brainstorming sessions.
It is always good to look at the topic from multiple perspectives. Once the first version is ready or an update is finished, the new standard is added to Predica CoE repository and made available within the organization.
All subsequent projects are run according to the new standard. As for ongoing projects, the decision stays with Project Owners.
This is not the end of the story. The next step is the propagation of new knowledge within the organization.
To guarantee successful PDDF adoption within an organization, several steps and actions were taken. The whole process is strictly related to our organizational structure.
We put a lot of effort into building awareness of the framework. In our opinion, this is the first step to successful adoption.
PDDF is a set of rules which adapt on a daily basis and are not fixed. It can be extended, changed, or aligned to project needs. However, any changes or extensions should be applied according to the above process.
Frameworks like Predica Center of Excellence or PDDF allow us to build a unique, agile, and fast-acting team. If you want to build strong competence within Data Domain, adopting this approach is the necessary step.
The business world is continuously changing, and fast adoption is essential to stay ahead of the competition. PDDF lets our team communicate and work in a clear, unified way, to easily accomplish it.
Organizational measures are a good start, but it’s the project where the framework has the greatest impact.
It is not possible to monitor everything manually during a project. This is especially problematic for large engagement.
It might be a tedious, time-consuming, and error-prone task. Especially if a large amount of code is produced every day, and someone has to verify a large number of rules and requirements.
Therefore, to improve efficiency, we introduced a dedicated automation processes. It verifies whether standards are applied correctly.
Automation is applied during a code pull request to the repository. Appropriate policies are introduced to monitor if any rules are broken. In case of any policy breach, the code is automatically rejected.
A Testing Framework allows you to define your test code, which will be triggered by an Azure Data Factory pipeline, and log its results in a persistent SQL table.
All core functionalities are covered by the framework, but it’s perfectly possible to build custom logic on top of it, to automate data quality assessments even further.
The original implementation was included in data reloading of ADF pipelines during one of the data warehouse projects.
As a result, after each deployment, all relevant data quality test where ran fully automatically and provided easy to interpret results that where pointing directly to possible issues and errors in the updated ETL logic.
Assuming that each developer is responsible for providing tests that verify the quality of their work, the need to manually maintain a library of up-to-date testing scripts is gone with the only requirement being to assure the proper configuration of the framework’s workflow.
Another core assumption is that Tests consist of a finite number of steps, and the Test itself passes successfully only when they all finish without errors. Consider this when writing your logic, to make sure that the correct result is passed outside of your Stored Procedure or Notebook.
Tests also allow the use of Parameters which in essence are JSON files containing key/value pairs that allow the use of generic code in a different context.
For example, you can write a query that accepts a dynamic list of tables to calculate row counts. The said list can be provided for each execution within a JSON, thus allowing you to maintain a single Stored Procedure that can run your data quality checks against an endless amount of table combinations.
Of course, this example is simplified on purpose – each business case you will try to support can require a JSON parameter value in different shapes and forms.
The current version of the framework comes with a built-in Power BI report that displays an easy to digest summary of all tests run across all your environments. The example metrics include:
Adoption Measurement helps to understand how your organization utilizes the available data models and dashboards. It allows you to gain better knowledge on which reports are most popular, what is the peak usage time, which analytical tools are most used, and more.
On the other hand, you can also see which reports are not used and maybe should not be maintained anymore.
Adoption Measurement is build based on several Azure services. The solution collects user activity logs from all indicated Azure Analysis Services instances and Power BI services. The entire data history is stored and organized within a dedicated database.
On top of the mentioned data, the dedicated Azure Analysis Services model stores the most crucial and important measures and attributes. With that, power users can do a more sophisticated analysis.
However, within the Adoption Measurement solution, there is also a dedicated dashboard that provides comprehensive insight on user behavior from the three perspectives: Azure Analysis Services, Power BI Service Activities, and Power BI Service Users.
This, in essence, is the Predica Data Domain Framework. Adopting a standard like this at the outset of your Data Warehouse modernization project will help you maintain a unified standard throughout the engagement. This will help you avoid any incompatibility or inconsistency issues further down the line.
Check out the next article, where we describe step by step how to organize your project, from setting up your team to preparing your tools, to designing your infrastructure. Click Ctrl + D to bookmark this article for easy access!
Got some questions? Click here to get in touch! Or, you can check out our Data Warehouse Assessment Service here.
Written in collaboration with Paweł Borowiecki.
Read similar articles