wAIste: using OpenAI to tackle food waste and promote sustainability
Tackling the problems of food loss and waste reduction can bring benefits like increasing food availability to the m...
Welcome to the third part of our data journey. Now you know what a Modern Data Warehouse is and how to successfully migrate data to the cloud, we can talk about getting our insights!
We will continue going through a typical Modern Data Warehouse project. If you missed out on the last part, get up to speed here.
Today we will find out how to analyze data, create user-friendly visualizations, and keep data safe for your users. Let’s begin!
Once the dimensions and facts are calculated, data is exposed for analysis using a dedicated service, such as Azure Analysis Services.
AAS provides additional analytical features that let users slice and dice data from different contexts, define aggregation measures, play with time intelligence, and look at the data from a hierarchical perspective.
In general, analytical databases form another abstraction layer that enhances your data analytics and reporting capabilities.
When it comes to the available data storage services on the Azure cloud, most of them have a sufficient amount of available space. For example, Azure SQL Database Hyperscale can store up to 100 TB of data, and Azure Synapse Analytics up to 1 PB. In the case of Azure Analysis Services, the biggest available service tier can hold up to 400 GB.
The important thing here is that you should leave some memory free (about 20-25%), otherwise data processing will fail. Therefore, the available memory space is effectively around 300 GB. As a result, we had yet another challenge – how to fit several terabytes of data into ~0,3 TB space.
The answer was quite simple. With the removal of one unnecessary column, we were able to reduce the model size by 70%.
The implemented model stores 5 billion records and occupies 200 GB of storage. This might not be a tremendous amount of records but still, without a few adjustments, it would not even fit within 1 TB. Great work here is done by the VertiPaq engine, which compresses data at a high ratio.
However, this is not everything. When it comes to very large analytical models, you need to analyze the problem from a wider perspective. If you successfully fit the data inside the model, you’re only halfway there. Still, a good model performance is another milestone that has to be reached.
Here are the main takeaways we took from this stage.
Fast and responsive reports are probably the most crucial element of the whole Modern Data Warehouse engagement. This is the major connection point where the users gain business insights from all the data collected.
Anything behind reports and dashboards is hardly visible, yet it probably consumes 90-95% of the time and effort to provide and display results. However, it is the remaining 5-10% that determines the success or failure of a project.
If your reports contain errors and do not refresh within a few seconds, then your project will fail and 100% of your effort will land in the trash. Users will not be interested and will go back to the old way of reporting, however old-fashioned it might be.
Model optimization is key
A large amount of data always leaves quite a mark on query performance. If the model is not optimized and measure definitions are poor, then query performance might decrease exponentially as the data volume increases.
However, a well-designed model will not have any significant impact on performance even if data grows from 2 to 4, to 8 billion records.
Once the model is done, it might still need additional improvements to produce a fast and responsive report. Each selected visualization and related query might require additional analysis and optimization, or changes at a model level.
What is also important is how many visualizations you have and how they are organized. Sometimes it is better to combine multiple visualizations into one. This step reduces the number of queries sent to the model and lowers the number of queries.
In one of the cases, the users needed Power BI dashboards with a live connection to Azure Analysis Services. The reports were used by people at every organizational level, starting with employees located at a particular location, all the way up to top management. In the end, they were utilized by several thousands of users daily.
The expectation was to get the reports refreshed within 3 seconds after an interaction, such as changing a filter value. Following the tips above, we were able to achieve this result even though the queried model contained 5 billion rows.
Here are our key findings from this step.
It seems that everything is done. Data flow pipelines are stable and process data at set time intervals. Insights are available on time and can be consumed from dedicated reports or through analytical models.
But there is one more important thing to do. We need to grant access to this information for a large number of users. Moreover, access levels should be different, depending on the role within the organization.
In one of our cases, the daily number of active users stood at several thousand. The expectation was to grant access to data based on the organization’s hierarchical structure. Each user could have a unique set of requirements.
Altogether, there were around 500,000 different security combinations that needed to be applied. More importantly, with over 25,000 employees, the company had a lot of personnel changes. Some people would leave the company, others would join it, someone would be promoted, etc.
Security measures needed to take these constant changes into account. They had to be reflected in the security model, and all adjustments should be done automatically.
Creating dedicated AD groups for this many cases is not a good idea. There might be some external solutions that can handle and automate them. We went for a more convenient option.
Our solution was to implement row-level security in the analytical model, combined with the organizational structure. Security is driven by the place and position to which a particular employee was assigned. We used the central employee database as a data source.
As a result, security requirements are evaluated automatically, based on the changes within the organization. Furthermore, additional efforts were made to manually add exceptions. In some cases, we needed to add different security rules. Nevertheless, everything is managed and driven from one central place.
Using this approach, every employee has access to the information they need. An employee at a local branch sees data related to that branch. District Managers can see all data from their districts, and Brand Managers see all data related to their brand.
Here are our conclusions from this final stage.
Don’t miss part 4!
And that’s a wrap on the first stage of our journey! Now you know what a Data Warehouse is, why it’s important to modernize it, and how the Azure cloud can help with it. We’ve also gone over an MVP phase of a Data Warehouse modernization project, and how to tackle the most common challenges.
Click here for part 4, where we go over a Data Domain Framework, to show you how you can make managing data projects much easier.
Got any questions? Just click here to send them – we will get back to you!
Written in collaboration with Paweł Borowiecki.
Read similar articles