wAIste: using OpenAI to tackle food waste and promote sustainability
Tackling the problems of food loss and waste reduction can bring benefits like increasing food availability to the m...
Last time, we talked about Data Warehouses, what they are, and why modernizing them is important. Today, we can get down to business, and talk projects!
I will go over a typical MVP project phase and discuss some of the challenges which are likely to appear. Don’t worry – we’ll go over the solutions too.
Our topic for this article is building a unified data source. We’ll find out how to migrate the data and subsequently manage it. Let’s get started!
The average Enterprise Data Warehouse project duration is about three years. It is not possible to do everything in one go. A solution like this has to be delivered in smaller parts. The goals of the first phase (MVP) might look a bit like this:
These are pretty standard requirements, but even so, they require a lot of work. As you probably expect, they also come with some challenges. Let’s see what they look like and how to address them.
One of the aims of the MVP is to transfer a large amount of data from source systems’ backups to the new cloud environment. The data is required for initial processing.
In one of the cases, we had to move around 50 TB of data from the biggest LoB system to the Azure Cloud. The whole process took a few weeks. The first step was to restore Oracle data backups, i.e. 5 files of 10 TB each. We did this in on-premises infrastructure.
The next step was to extract all data from all source tables into files. At this point, three important things had to be planned:
Let’s go over them one by one.
It is important to plan globally how raw data will be stored. Usually, when building a Modern Data Warehouse on Azure, the choice is to keep files in a Data Lake or Blob storage. So, our choice was to utilize Azure Data Lake Storage Gen2 to collect and store all raw data from all source systems.
The crucial next step is to plan and design the Data Lake folder structure and layout. We need to decide how files from particular source systems should be organized, i.e. in what folders, how these folders should be partitioned, how folders and individual files should be named.
All these details are very important because they determine the ease of navigation through a Data Lake.
The main factor which helps plan the right file split is data volume and its growing pace. When it comes to dictionary data, the choice is always the same, i.e. one file for all data from a dictionary table, even if a table stores 2 million records or so.
Transaction data requires more attention. It’s possible to have transaction data files that store data just for one day, one month, or one year. So far, a granularity lower than one day was not needed.
The file format has a significant impact on the final data size. One of our attempts of data extraction involved the move from Oracle to CSV files, where the data size grew by 40-50%. At this point, we decided to try different formats. Otherwise, the initial 50 TB would grow up to 75 TB.
We moved the data into the Apache Parquet format instead. This way, we managed to reduce the data size by 90%. This was a significant change and as a result, the transfer to the cloud could be done in a relatively short time.
During the data extraction process, the files were partitioned and put into respective folders according to our Data Lake design. As the final step, they were transferred to the Azure Data Lake Store using AzCopy application, which took around 8 hours. The AzCopy can reflect the copied file and folder structure in the destination. Therefore, there is no need to organize files again.
Here are the key lessons we’ve learned from this step.
The aforementioned numbers might not be impressive at all. There are many other solutions with a much larger volume of processed data, no doubt. The point is that if the scale crosses a certain threshold, you no longer care if this is 10, 100 billion records, or more. You need to refine the applied techniques and optimize constantly.
The Microsoft Azure ecosystem offers several solutions that could be utilized to handle such volumes of data. The current state is not everything. You need to take care of the future and choose a service that you can easily scale and accommodate to new realities.
For one of the projects, we initially chose Azure Synapse Analytics (at that time it was called Azure SQL Warehouse Gen2 or ADW in short). It seemed like a good choice because it provided processing and storage engine all in one.
We adopted the ELT (Extract, Load, Transform) approach. This means we first extracted all data from an on-premises Oracle-based source system into Azure Data Lake Store Gen2 (ADLS).
Data extraction was coordinated by the Azure Data Factory Copy Activity process. The raw data from transactional and dictionary tables was stored in the Apache Parquet format.
Storing data in a Data Lake has several benefits. For example, when you have to integrate multiple systems or souces, you certainly need to reload data from scratch. This happens especially when structures in the model need to be changed or some fixes need to be applied. In that scenario, ADLS lets you:
Using the PolyBase feature (from ADW), the data from ADLS was loaded into staging tables on ADW. Next, the ADW did the main data processing, i.e. calculating fact tables, and mapping surrogate keys for dimensions (from Master Data system). In the end, the fact and dim data was extracted to Azure Analysis (AAS) structures.
The final data destination is a Power BI dashboard, which consumes data from the AAS model using Live Connection. Data coordination, orchestration, and processing pipelines were handled by Azure Data Factory. In this process, we did everything according to recommendations and practices from the official documentation.
Extracting and data processing on time required many probing attempts to extract at maximum speed and deliver data on time. For that particular project milestone, the described procedure worked as expected and was sufficient. Nevertheless, it was the integration of just one system, with a few dozen more on the way.
After the MVP phase, we decided to upgrade our architecture and apply some modifications which we will describe in detail in the next section.
Here are the main conclusions from this stage:
A crucial business requirement is usually to deliver data to a Power BI dashboard at a specific time. Therefore, the solution had to get data from the source system, calculate dimension and fact tables, and process the analytical model that the Power BI service is connected to.
One of the difficulties we encountered was the limited time window when data could be processed. In one case, the data was collected from hundreds of locations, all operating at different times. As a result, data for all of them wasn’t available simultaneously.
Azure Data Factory has a core feature of Copy Activity, which lets you copy data between source and destination. It supports around 100 different systems. It is often included in Microsoft architectures as the first component that links your on-premises environment to Azure infrastructure.
ADF is presented as the service that moves your on-premises data to the cloud. And that is completely true, however performance, i.e. the speed of data extraction and transfer, is an issue here.
ADF Copy Activity performance is one of the slowest when the source is RDBMS and the sink is Azure Data Lake Storage. But we needed to copy raw data (saved as Apache Parquet files) from Oracle to ADLS, and then using the PolyBase feature, load them to Azure SQL Data Warehouse.
Surprise, surprise, the performance was really bad.
The time window for the daily data process is about 1 hour and 15 minutes. Therefore, copying data for one table might take up to 30% of the available time just for one business day’s worth of data.
The whole daily ETL process includes many other activities that also need to be performed. The available processing window is tight, especially if it involves data from several systems.
We made many different attempts to improve ADF’s Copy Activity performance using settings like Data integration units (DTU), degree of copy parallelism, running multiple Copy Activities simultaneously, etc.
We also checked network connectivity between Azure, and on-premises and ADF Integration Runtime VM. We gained some advantage when the Oracle driver was updated on ADF. However, the improvement was about 10-15%, which was still far away from the target minimum.
At that moment we were sure this approach wouldn’t work in a longer perspective, especially if we connected more data sources.
Another possible solution that is often practiced is extracting data from the source system to files (CSV, Apache Parquet, etc.) using the on-premises infrastructure and exposing it on a network drive or FTP server. It could ensure that data is extracted in the fastest possible manner. However, this process was disconnected and not controlled by the ADF pipeline.
The real game-changer was the switch to Azure Databricks. We rebuilt and improved the data flow pipeline. We also shortened the path which data needs to go through during the ETL/ELT process.
Most architectures recommend putting data into Data Lake first. This approach just adds an additional step in data processing and simply makes it longer.
ADLS is still a crucial component of MDW architecture, but data is saved using Azure Databricks. Simultaneously, for full reloads it is possible to gather data from ADLS rather than from a source system.
Additionally, the whole burden of data processing is moved to Azure Databricks. Azure SQL Warehouse is only used as data storage for a unified warehouse model.
Below is the final architecture of a sample MVP:
Here are our key findings from this stage:
Let’s wrap things up for today – I’m sure you have a lot to think about. We’ll pick this up in the next article and discuss analytics and security. Press Ctrl+D to add this post to favorites – and check out part 3 here.
Got any questions about this article? Just click here to send them – we will get back to you!
Written in collaboration with Paweł Borowiecki.
Read similar articles