ArrowLinkToArchivePageBlog

Everything you need to know about Modern Data Warehouse (Part 4) Data Warehouse Project Framework

How to run a Modern Data Warehouse project effectively?

Modern Data Warehouse development and deployment should be treated not as a single project but rather a program, which consists of multiple parts.

During the entire lifecycle, it is important to work according to a defined standard. This is why we developed Predica Data Domain Framework (PDDF), which is the focus of this article.

In a hurry? Focus on the sentences in bold font for a summary!

Key points
  • Why do you need a framework for a data project?
  • What elements does it consist of?
  • How does it help you during and after a project?

A Modern Data Warehouse project does not finish after the first release, it continues forever exactly like every Business Intelligence engagement.

The reason for this lies in constantly changing business requirements and an ever-evolving business environment.

Also, a Business Intelligence project touches all the organization and every department. The subsequent stages are developed and iteratively extend solution capabilities.

Why is it important to have a common standard for a data project?

Imagine a team of 10 developers whose task is to integrate 10 relational database systems. Each one has achieved their goal, the systems are integrated. Therefore, from a business perspective, the requirement is met.

Nevertheless, if you look under the hood, there would be probably a lot of discrepancies as to how the solution was developed and delivered.

You could say: „So what? It works and business departments are satisfied. You are just looking for problems!”. I agree, user satisfaction is crucial, but the focus here is on the development team who deliver the solution.

A situation like this would impede their work and further development, and consequently, delivery time. Therefore, business users would get the requested solution later.

In the future, things like adding extensions, implementing updates, performing maintenance, etc., will become more complex and difficult. At some project stage, there would be no other way than to rewrite the solution.

This might be a catastrophic scenario for the project, its budget, and timeline, and it should be avoided at all costs.

HOW TO MANAGE A DATA PROJECT PROPERLY?

To keep consistency and avoid discrepancies, you need to define a unified approach to development. If an integration of 10 relational database systems is taking place, then this is something repeatable. Some things can be automated, while others can be shared and reused.

This is not just about writing common, reusable code. It is also important to define naming conventions, folder structure, and layouts, etc. Standards should be defined from two perspectives.

Project setup

This concerns the key elements of the project. An example can be a naming convention for the source systems. Throughout the project, they will be referred to multiple times at different stages. A good example is how should a data warehouse or other source systems be called and referred to, e.g. dw, dwh, or wh?

Tools

This perspective specifically concerns used services. For example, if Azure Data Factory is used to orchestrate the data pipelines, then how are the pipelines named and placed into folders? In addition, how are Azure Data Factory datasets named? How should we define linked services?

WHICH TOOL CAN YOU USE TO HELP YOU MANAGE A DATA PROJECT?

An answer to this is a proper framework. And, luckily, we have one ready-made!

“In computer programming, an application framework consists of a software framework used by software developers to implement the standard structure of application software.”

“. . . a software framework is an abstraction in which software providing generic functionality can be selectively changed by additional user-written code, thus providing application-specific software. It provides a standard way to build and deploy applications and is a universal, reusable software environment that provides particular functionality as part of a larger software platform to facilitate development of software applications, products and solutions.”

Source: Wikipedia

Using these principles, we invented and developed the Predica Data Domain Framework. It provides standards and guidelines for building and deploying Business Intelligence solutions in general, and specifically for Data Warehouse Modernization projects.

To some extent, it also provides some generic functionalities like Testing Framework and Adoption Measurement.

What is Predica Data Domain Framework and why is it useful?

The Predica Data Domain Framework (PDDF in short) can be defined as a set of rules, standards, and patterns that describe and indicate the path on how Data Domain projects should be organized, structured, and conducted.

It allows you to:

  • Organize the project in a logical, clean and consistent way
  • Take a coherent and consistent approach and actions
  • Reuse repeatable patterns (instead of reinventing the wheel)
  • Increase and assure high quality
  • Speed up the delivery process
  • Stand out from the crowd.
Note Predica Data Domain Framework is a part of a wider concept at the Predica Group level. PDDF is a subset of Predica Center of Excellence (Predica CoE), where unified approach and standards are elaborated for each of our activity areas i.e. Apps, SecureInfra, DevOps, and Data.

During each project, we always apply the best practices and standards in every area. For example, the infrastructure is not the main specialization of Data Domain, but we always design and build Azure architecture according to standards defined by Predica CoE.

Additionally, each new Predica employee is trained and committed to working according to these standards.

A screenshot from Predica Center of Excellence notebook

Our Center of Excellence collects best practices and standards for project management

The lifecycle of a data warehouse project

As mentioned at the beginning, a Modern Data Warehouse project is a continuous and long-term engagement. The PDDF works according to a similar principle. It also has its lifecycle and operates in line with MDW engagements.

During the MDW lifecycle, new challenges arise, e.g. new systems are integrated, and new solutions are used. For them, we need to define the new standards, new rules, and new naming conventions, and unify them with the existing framework. It is not possible to define everything in advance, but strong fundamentals give huge benefits and advantages during project development.

Here are the roles and processes which govern the framework, and, consequently, data projects.

Roles

Predica Data Domain Framework consists of three main areas, which are:

  • DataOps – Modern Warehouse and Big Data
  • MLOps – Machine Learning and Artificial Intelligence
  • VAOps – Visualization and analytics layer.

The person responsible for each of the listed areas is the Area Administrator. They lead, create visions, and plan standards that should be developed and introduced in a short- and long-term future.

There are also additional roles, such as:

  • PDDF Global Administrator – responsible for general maintenance and development of PDDF, not related to any specific PDDF area
  • Area SME (Subject Matter Expert) – a dedicated person(s) responsible for a specific topic (e.g. Database Project standards), cooperating closely with Area Administrators.

Process

A process developed 6 months ago might be obsolete or need updates. To cope with the dynamically changing situation, the PDDF includes a definition of the lifecycle process.

This way, we can introduce changes or new features to the framework in an organized and controlled manner.

Depending on the situation, the change might be small and can be introduced within 30 minutes. In more complex cases, or even building something totally from scratch, it might take several iterations to introduce it.

During dedicated meetings (elaboration workshops), new standards are designed or updated. The crucial element is to involve the right people. Except the Area Administrator, people who have experience within a specific area should be engaged in brainstorming sessions.

It is always good to look at the topic from multiple perspectives. Once the first version is ready or an update is finished, the new standard is added to Predica CoE repository and made available within the organization.

All subsequent projects are run according to the new standard. As for ongoing projects, the decision stays with Project Owners.

This is not the end of the story. The next step is the propagation of new knowledge within the organization.

HOW TO ADOPT THE DATA DOMAIN FRAMEWORK WITHIN THE ORGANIZATION?

To guarantee successful PDDF adoption within an organization, several steps and actions were taken. The whole process is strictly related to our organizational structure.

We put a lot of effort into building awareness of the framework. In our opinion, this is the first step to successful adoption.

The key steps of Predica Data Domain Framework adoption

PDDF is a set of rules which adapt on a daily basis and are not fixed. It can be extended, changed, or aligned to project needs. However, any changes or extensions should be applied according to the above process.

Frameworks like Predica Center of Excellence or PDDF allow us to build a unique, agile, and fast-acting team. If you want to build strong competence within Data Domain, adopting this approach is the necessary step.

The business world is continuously changing, and fast adoption is essential to stay ahead of the competition. PDDF lets our team communicate and work in a clear, unified way, to easily accomplish it.

HOW TO ADOPT THE DATA DOMAIN FRAMEWORK WITHIN THE PROJECT?

Organizational measures are a good start, but it’s the project where the framework has the greatest impact.

It is not possible to monitor everything manually during a project. This is especially problematic for large engagement.

It might be a tedious, time-consuming, and error-prone task. Especially if a large amount of code is produced every day, and someone has to verify a large number of rules and requirements.

Therefore, to improve efficiency, we introduced a dedicated automation processes. It verifies whether standards are applied correctly.

Automation is applied during a code pull request to the repository. Appropriate policies are introduced to monitor if any rules are broken. In case of any policy breach, the code is automatically rejected.

Want to keep in touch? Get curated insights on the hottest topics in the industry, delivered directly to your inbox every two weeks! Sign me up!

TESTING FRAMEWORK AS PART OF PDDF

A Testing Framework allows you to define your test code, which will be triggered by an Azure Data Factory pipeline, and log its results in a persistent SQL table.

All core functionalities are covered by the framework, but it’s perfectly possible to build custom logic on top of it, to automate data quality assessments even further.

The original implementation was included in data reloading of ADF pipelines during one of the data warehouse projects.

As a result, after each deployment, all relevant data quality test where ran fully automatically and provided easy to interpret results that where pointing directly to possible issues and errors in the updated ETL logic.

A Testing Framework provides an easy way to execute both generic and custom data quality tests

What else can a Testing Framework be useful for?

Assuming that each developer is responsible for providing tests that verify the quality of their work, the need to manually maintain a library of up-to-date testing scripts is gone with the only requirement being to assure the proper configuration of the framework’s workflow.

Another core assumption is that Tests consist of a finite number of steps, and the Test itself passes successfully only when they all finish without errors. Consider this when writing your logic, to make sure that the correct result is passed outside of your Stored Procedure or Notebook.

Tests also allow the use of Parameters which in essence are JSON files containing key/value pairs that allow the use of generic code in a different context.

For example, you can write a query that accepts a dynamic list of tables to calculate row counts. The said list can be provided for each execution within a JSON, thus allowing you to maintain a single Stored Procedure that can run your data quality checks against an endless amount of table combinations.

Of course, this example is simplified on purpose – each business case you will try to support can require a JSON parameter value in different shapes and forms.

How to check the results of the Tests?

The current version of the framework comes with a built-in Power BI report that displays an easy to digest summary of all tests run across all your environments. The example metrics include:

  • A summary of your framework’s implementation. You can see the total amount of defined tests, the average time of execution, count of logged execution, and a percentage ratio of tests that passed their success criteria
  • The split of test outcome across environments, dates, and a summarized view per each test object (stored procedure, Databricks notebook)
  • A simplified version of the error log with descriptive information on what failed and why that happened. This should point you to the issues at hand.

ADOPTION MEASUREMENT AS PART OF PDDF

Adoption Measurement helps to understand how your organization utilizes the available data models and dashboards. It allows you to gain better knowledge on which reports are most popular, what is the peak usage time, which analytical tools are most used, and more.

On the other hand, you can also see which reports are not used and maybe should not be maintained anymore.

Adoption Measurement gives you an easy insight into metrics that are the most important on a daily basis

Adoption Measurement is build based on several Azure services. The solution collects user activity logs from all indicated Azure Analysis Services instances and Power BI services. The entire data history is stored and organized within a dedicated database.

On top of the mentioned data, the dedicated Azure Analysis Services model stores the most crucial and important measures and attributes. With that, power users can do a more sophisticated analysis.

However, within the Adoption Measurement solution, there is also a dedicated dashboard that provides comprehensive insight on user behavior from the three perspectives: Azure Analysis Services, Power BI Service Activities, and Power BI Service Users.

Summary

This, in essence, is the Predica Data Domain Framework. Adopting a standard like this at the outset of your Data Warehouse modernization project will help you maintain a unified standard throughout the engagement. This will help you avoid any incompatibility or inconsistency issues further down the line.

Check out the next article, where we describe step by step how to organize your project, from setting up your team to preparing your tools, to designing your infrastructure. Click Ctrl + D to bookmark this article for easy access!

Got some questions? Click here to get in touch! Or, you can check out our Data Warehouse Assessment Service here.

Written in collaboration with Paweł Borowiecki.

Key takeaways

  1. You need to manage your data project according to a single standard. It is a long-term, iterative engagement, so if you have inconsistencies among different parts, it will cause issues, and, in the worst-case scenario, you might need to start over.
  2. Our framework for data projects encompasses a range of roles, processes, and tools to help you manage your project. This includes procedures for implementing it within the organization and the project and monitoring project effectiveness after deployment.
  3. The framework ensures that everyone follows the same best practices and processes in their work, so even if you have separate teams, they share a common language. This way, there are no discrepancies between different parts of the project, so all the components connect seamlessly.

See other articles in our Modern Data Warehouse series:

  1. Introduction to Modern Data Warehouse
  2. How to design a Modern Data Warehouse architecture
  3. How to use the cloud to deliver actionable business insights
  4. How to keep development under control
  5. How to incorporate Master Data Management

Sign up for Predica Newsletter

A weekly, ad-free newsletter that helps cutomer stay in the know. Take a look.

SHARE

Want more updates like this? Join thousands of specialists who already follow our newsletter.

Make data-backed changes with an Enterprise Data Warehouse

Find out how to build a data hub for your business.

Stay up to date with the latest cloud insights from our CTO