Everything You Need To Know About Modern Data Warehouse (Part 2)

Modern Data Warehouse Architecture

How to design a modern data architecture using Azure?

Last time, we talked about Data Warehouses, what they are, and why modernizing them is important. Today, we can get down to business, and talk projects!

I will go over a typical MVP project phase and discuss some of the challenges which are likely to appear. Don’t worry – we’ll go over the solutions too.

Key points:

  • How to get started with an MVP?
  • Which services are best to migrate and store data?
  • How to deliver the information on time?
  • Useful tips for Modern Data Warehouse projects.

Our topic for this article is building a unified data source. We’ll find out how to migrate the data and subsequently manage it. Let’s get started!

The goals of the first project phase (MVP)

The average Enterprise Data Warehouse project duration is about three years. It is not possible to do everything in one go. A solution like this has to be delivered in smaller parts. The goals of the first phase (MVP) might look a bit like this:

  • Build a global unified data model (data warehouse) which embraces data from a single business area (e.g. sales, marketing, HR, or other), but allows for expansion in the future
  • Integrate data from the largest Line of Business (LoB) system (~75% of data)
  • Data should be available by a specified local time (regardless of the time zone of the raw data source)
  • Fresh raw data from LoB systems is to be loaded regularly
  • Data should be available for consumption by all employees in scope (up to 10% of the total)
  • The analytical model will contain multiple years’ worth of data
  • Design a business area overview dashboard with all security measures applied (a priority aspect), that responds in less than a few seconds.

These are pretty standard requirements, but even so, they require a lot of work. As you probably expect, they also come with some challenges. Let’s see what they look like and how to address them.

How to move several TB of data from on-premises to the cloud, and how to store them effectively? Challenge 1

One of the aims of the MVP is transferring a large amount of data from source systems’ backups to the new cloud environment. The data is required for initial processing.

In one of the cases, we had to move around 50 TB of data from the biggest LoB system to the Azure Cloud. The whole process took a few weeks. The first step was to restore Oracle data backups, i.e. 5 files of 10 TB each. We did this in an on-premises infrastructure.

The next step was to extract all data from all source tables into files. At this point, three important things had to be planned:

  • Where will the files be stored, and how will they be structured and organized?
  • How should the files be divided? How much data should each file contain?
  • What file format should data be extracted to?

Let’s go over them one by one.

Where to store the files, and how to structure & organize them?

It is important to plan globally how raw data will be stored. Usually, when building a Modern Data Warehouse on Azure, the choice is to keep files in a Data Lake or Blob storage. So, our choice was to utilize Azure Data Lake Storage Gen2 to collect and store all raw data from all source systems.

The crucial next step is to plan and design the Data Lake folder structure and layout. We need to decide how files from particular source systems should be organized, i.e. in what folders, how these folders should be partitioned, how folders and individual files should be named.

All these details are very important because they determine the ease of navigation through a Data Lake.

How should the files be divided and how much data should each of them contain?

The main factor which helps plan the right file split is data volume and its growing pace. When it comes to dictionary data, the choice is always the same, i.e. one file for all data from a dictionary table, even if a table stores 2 million records or so.

Transaction data requires more attention. It’s possible to have transaction data files that store data just for one day, one month, or one year. So far, a granularity lower than one day was not needed.

What file format should we extract the data to?

The file format has a significant impact on the final data size. One of our attempts of data extraction involved the move from Oracle to CSV files, where the data size grew by 40-50%. At this point, we decided to try different formats. Otherwise, the initial 50 TB would grow up to 75 TB.

We moved the data into the Apache Parquet format instead. This way, we managed to reduce the data size by 90%. This was a significant change and as a result, the transfer to the cloud could be done in a relatively short time.

During the data extraction process, the files were partitioned and put into respective folders according to our Data Lake design. As the final step, they were transferred to the Azure Data Lake Store using AzCopy application, which took around 8 hours. The AzCopy can reflect the copied file and folder structure in the destination. Therefore, there is no need to organize files again.

What have we learned from this challenge?

Here are the key lessons we’ve learned from this step.

Key lessons from challenge 1: Design the Data Lake carefully, Analyze the data thoroughly, Think about the data file format, Compress data before migration, Use AzCopy for efficient transfer

Lessons learned from migrating data

How to control large amounts of data? Challenge 2

The aforementioned numbers might not be impressive at all. There are many other solutions with a much larger volume of processed data, no doubt. The point is that if the scale crosses a certain threshold, you no longer care if this is 10, 100 billion records, or more.  You need to refine the applied techniques and optimize constantly.

The Microsoft Azure ecosystem offers several solutions that could be utilized to handle such volumes of data. The current state is not everything. You need to take care of the future and choose a service that you can easily scale and accommodate to new realities.

For one of the projects, we initially chose Azure Synapse Analytics (at that time it was called Azure SQL Warehouse Gen2 or ADW in short). It seemed like a good choice because it provided processing and storage engine all in one.

We adopted the ELT (Extract, Load, Transform) approach. This means we first extracted all data from an on-premises Oracle-based source system into Azure Data Lake Store Gen2 (ADLS).

Data extraction and processing in the cloud

Data extraction was coordinated by the Azure Data Factory Copy Activity process. The raw data from transactional and dictionary tables was stored in the Apache Parquet format.

Note:

Storing data in a Data Lake has several benefits. For example, when you have to integrate multiple systems or souces, you certainly need to reload data from scratch. This happens especially when structures in the model need to be changed or some fixes need to be applied. In that scenario, ADLS lets you:

  • Offload source systems (there is no need to extract anything from source systems)
  • Reload data faster. Not all transactional source systems store the whole data history. Often a large volume of backups needs to be copied and restored. This is a time-consuming operation and you do not want to repeat it. With ADLS, you can avoid it. Do it once during the initial load, and then reload data from ADLS. Such an approach is much more efficient if you do it right
  • Save costs and infrastructure resources. Restoring a backup is a resource-intensive and costly operation. Maintaining and archiving backups also costs time and resources (people and infrastructure, especially in on-premises environments where they might be limited).

Using the PolyBase feature (from ADW), the data from ADLS was loaded into staging tables on ADW. Next, the ADW did the main data processing, i.e. calculating fact tables, and mapping surrogate keys for dimensions (from Master Data system). In the end, the fact and dim data was extracted to Azure Analysis (AAS) structures.

The final data destination is a Power BI dashboard, which consumes data from the AAS model using Live Connection. Data coordination, orchestration, and processing pipelines were handled by Azure Data Factory. In this process, we did everything according to recommendations and practices from the official documentation.

Almost done? Not really!

Extracting and data processing on time required many probing attempts to extract at maximum speed and deliver data on time. For that particular project milestone, the described procedure worked as expected and was sufficient. Nevertheless, it was the integration of just one system, with a few dozen more on the way.

After the MVP phase, we decided to upgrade our architecture and apply some modifications which we will describe in detail in the next section.

What have we learned from this challenge?

Here are the main conclusions from this stage:

Key lessons from challenge 2: Be careful of how much data you try to process at once, Check which service is best for copying your data files

Lessons learned from extracting data

How to deliver data on time? Challenge 3

A crucial business requirement is usually to deliver data to a Power BI dashboard at a specific time. Therefore, the solution had to get data from the source system, calculate dimension and fact tables, and process the analytical model that the Power BI service is connected to.

One of the difficulties we encountered was the limited time window when data could be processed. In one case, the data was collected from hundreds of locations, all operating at different times. As a result, data for all of them wasn’t available simultaneously.

Want to stay in touch? Get curated insights on the hottest topics in the industry, delivered directly to your inbox every two weeks! Sign me up!

Copying data to the cloud

Azure Data Factory has a core feature of Copy Activity, which lets you copy data between source and destination. It supports around 100 different systems. It is often included in Microsoft architectures as the first component that links your on-premises environment to Azure infrastructure.

ADF is presented as the service that moves your on-premises data to the cloud. And that is completely true, however performance, i.e. the speed of data extraction and transfer, is an issue here.

ADF Copy Activity performance is one of the slowest when the source is RDBMS and the sink is Azure Data Lake Storage. But we needed to copy raw data (saved as Apache Parquet files) from Oracle to ADLS, and then using the PolyBase feature, load them to Azure SQL Data Warehouse.

Data transfer capacity comparison

Data transfer capacity by service (adapted from Microsoft materials)

Surprise, surprise, the performance was really bad.

  • Copying data from one business day directly from Oracle to Apache Parquet/CSV file took between 4-6 minutes (from 3 to 7 million records) from the largest table. Oracle was not able to extract data faster. The script was run locally on the Oracle server.
  • Copying data directly from Oracle to ADLS using Azure Data Factory with Self-hosted Integration Runtime took 3 to 4 times longer (from 16 to 24 minutes)

The time window for the daily data process is about 1 hour and 15 minutes. Therefore, copying data for one table might take up to 30% of the available time just for one business day’s worth of data.

The whole daily ETL process includes many other activities that also need to be performed. The available processing window is tight, especially if it involves data from several systems.

We made many different attempts to improve ADF’s Copy Activity performance using settings like Data integration units (DTU), degree of copy parallelism, running multiple Copy Activities simultaneously, etc.

We also checked network connectivity between Azure, and on-premises and ADF Integration Runtime VM. We gained some advantage when the Oracle driver was updated on ADF. However, the improvement was about 10-15%, which was still far away from the target minimum.

At that moment we were sure this approach wouldn’t work in a longer perspective, especially if we connected more data sources.

Another possible solution that is often practiced is extracting data from the source system to files (CSV, Apache Parquet, etc.) using the on-premises infrastructure and exposing it on a network drive or FTP server. It could ensure that data is extracted in the fastest possible manner. However, this process was disconnected and not controlled by the ADF pipeline.

The real game-changer was the switch to Azure Databricks. We rebuilt and improved the data flow pipeline. We also shortened the path which data needs to go through during the ETL/ELT process.

Most architectures recommend putting data into Data Lake first. This approach just adds an additional step in data processing and simply makes it longer.

ADLS is still a crucial component of MDW architecture, but data is saved using Azure Databricks. Simultaneously, for full reloads it is possible to gather data from ADLS rather than from a source system.

Additionally, the whole burden of data processing is moved to Azure Databricks. Azure SQL Warehouse is only used as data storage for a unified warehouse model.

Below is the final architecture of a sample MVP:

Modern Data Warehouse MVP

An example Modern Data Warehouse MVP architecture diagram (click to view full-size)

What have we learned from this challenge?

Here are our key findings from this stage:

Key lessons from challenge 3: Use Databricks for copying and processing data, Databricks loads data faster from Azure Data Lake Store

Lessons learned from data processing

Go to part 3 for more!

Let’s wrap things up for today – I’m sure you have a lot to think about. We’ll pick this up in the next article and discuss analytics and security. Press Ctrl+D to add this post to favorites – and check out part 3 here.

Got any questions about this article? Just click here to send them – we will get back to you!

Written in collaboration with Paweł Borowiecki.

Key takeaways:

  1. Take the time to plan and design your Data Lake. The way in which you name and structure your data will determine how easy it will be to use it later.
  2. Azure Data Lake Store is great for storing data, providing benefits like speeding up data reload, and lowering costs.
  3. Azure Databricks is an unmatched service for processing data in the cloud.