7 myths about analyzing data in the cloud

Cloud data analytics

Will machine learning solve all your problems? Is AI the future of your business? Maybe. Or maybe not. In this article, I will share with you seven myths and misconceptions about analyzing data in the cloud.

For every myth, I will show you what reality looks like and how to tackle the problem. These tips will help you to be more effective in planning your projects.

Let the myth-busting begin!

Key points:

  • What common challenges arise when we want to analyze our data?
  • How to overcome these challenges to get reliable results?

Myth 1. Data can be easily transferred to the cloud

REALITY

The reality is quite different. One obstacle is that people still have a lot of distrust of the cloud, and don’t want to put their data “somewhere on the web”. It seems that the key factor in trusting the cloud is the localization of the data center.

Additionally, there is a lot of controversy about the transfer of personal data. Financial institutions are unwilling to keep sensitive information in the cloud.

Another known problem is the dispersion of data sources. Different formats and different owners of the data add that little extra bit to the complexity of the migration process.


No time to read? Watch the video instead!

SOLUTION

Without the transfer being settled, we can’t even think about advanced analytics. So, what is the solution?

First, we should maintain the right approach to PII (Personally Identifiable Information). If an organization is concerned about personal data, we may consider migrating just the IDs and create calculations based on that. Only when we need more detailed information, like name or address, we take these IDs back to on-premises and translate them.

Data migration should be planned way ahead of an initiative. Even when we start with a POC project, we have to remember that regular data transfer to the cloud, for the solution to be delivered to production, will require a lot of planning and agreements.

It’s very important that cloud providers such as Microsoft maintain proper communication with their customers. Cloud vendors should care about explaining the mechanisms and capabilities of the cloud in terms of security.

Another way to eliminate the concern is to visit Microsoft Trust Center. There is a lot of information there about how the data is stored and processed in the cloud.

Myth 2. Data is immediately integrated and ready for analysis

REALITY

I wish it was! I have already mentioned the difficulty arising from the differences between data formats. We have Excel data, SQL databases, PDFs, JPGs, and all kinds of incomplete or inconsistent data with ambiguous sources of truth, which have to be cleaned and integrated prior to analytics.

The complex loading process is another hurdle to overcome. Data sources will require different tools to periodically load the information to the cloud.

And finally, sometimes the data volume is so large, that even with cloud-based solutions we must create appropriate settings and optimization mechanisms to migrate it.

Want more updates like this? Leave your email address to get the latest insights every two weeks. Subscribe

SOLUTION

You certainly have heard that before going with advanced analytics, like machine learning, we should create a data warehouse. What we usually do when we start a new initiative, is deploy a Modern Enterprise Data Warehouse.

This is a step where we gather all the data sources and create proper loading processes. Only when all the data sources are structured and ordered, we can use machine learning to get reliable results.

While working with data, we follow certain patterns, which are a part of the Predica Data Domain Framework. It’s a complete repository, based on our expertise and the best DataOps practices. It contains rules and patterns that every one of us tries to follow while working with data. The PDDF includes common nomenclature, processes, and approaches to creating the services and environments.

Predica Data Domain Framework

Predica Data Domain Framework

This is what it looks like. You can see there is a lot of information. We have even created our own Testing Framework, just to monitor how the Modern Data Warehouse is performing. Only then can we be sure that the data is valid, and that we can use it in machine learning to get reliable results.

Myth 3. Machine learning models can be applied quickly and business decisions based on them can be made immediately

REALITY

Here’s the ugly truth: they can’t. Machine learning gets easier and easier. Modern services like Azure ML allow us to create multiple machine learning models with a single click, but this can be treacherous. Without knowing what’s happening under the hood, it is very easy to misinterpret results and make wrong decisions.

Furthermore, sometimes the tools like the Azure ML can be used only in very specific cases. We cannot rely on automated services in all of them.

SOLUTION

So, what can we do? First of all, if we really want to make proper business decisions based on the models’ results, we have to be able to monitor their effectiveness. Power BI dashboard is a good example of a solution that will help us understand the insides of the machine learning model.

Furthermore, we cannot rely on a single metric. We have to take into consideration different measures, specific to our business and our situation.

Finally, it is good to have a benchmark. Only then will we know what is the positive value of the machine learning model.

Myth 4. Models are implemented right away and work in production

REALITY

Definitely not. In reality, the models often end at the POC stage! Why is that? Because the deployment of the model and putting it into production is a lengthy process. A lot of agreements, a lot of ownerships, a lot of decisions have to be made way ahead of the actual project. Many companies simply lose their patience.

The second issue – after the models are put into production, they are very rarely monitored. Meanwhile, the conditions may change, influencing the models and their efficiency.

COVID-19 is a perfect example of a “conditions change”. The pandemic influenced machine learning models a lot, eliminating the possibility to forecast anything. This is a major example, but there are a lot of smaller ones.

And if the models that are being used in production are not monitored, they will give false results. Sometimes they don’t go to production at all, because the decision-makers are discouraged.

Finally, when we create a model, our goal is to feed the results of the model to other systems. This rarely happens.

The models are calculated somewhere aside from anything working in the enterprise. They are not used in an automated way, as an input to a marketing campaign for instance. The model is simply getting lost somewhere in transit.

SOLUTION

Even when starting modeling at the POC stage, we have to plan the production right ahead and explain to all the decision-makers that stopping at the POC essentially means losing money. We invest in a solution that is yet to bring us any value.

The second point I want to emphasize here is the mechanism of viewing results. Again, after putting the models into production, we want to be able to monitor them easily to inform the management what kind of value this model brings. The answer is MachineLearningOps (MLOps). MLOps is an approach to managing, building, and monitoring the models.

Myth 5. The organization eagerly adopts models and wants to implement machine learning

REALITY

Machine learning hype is very strong, but often the enthusiasm fades with the first bill. Let’s be honest, we need some investment to take advantage of advanced analytics. And sometimes we have to wait a long time for machine learning to bring any value.

I know countless organizations that tried out many machine learning models, and seeing that the results are not convincing, they stopped. But the truth is that if they waited a bit, or (perhaps) invested more, they could benefit from ML.

That’s why, when promoting machine learning in an organization, the results need to be very well explained. Usually, they are not. A black box with single numbers coming out of a machine does not tell us anything in particular.

The last thing, also very important – we have to give the model some time. It may happen that we do not see results straight away. This is the reality. That’s why machine learning cannot be a savior in all of your cases.

SOLUTION

How can we solve that? One good idea is to start in the areas where machine learning will bring value quickly.

One of our recent projects was for a finance department. We used machine learning not to create an advanced mechanism, e.g. to get more knowledge about the customers but to look for anomalies in the payments.

After four weeks we were able to catch a lot of duplicates, which led to saving a lot of money. This was very convincing for decision-makers, and right now we are continuing our journey with machine learning in this company.

Again, if you want to promote machine learning, support those numbers with a Power BI dashboard, even with a Tableau dashboard, any visualization tool (Excel will do as well!) that will help the decision-makers to understand how it works under the hood.

Finally, the roadmap. If we do not want to stop at the POC, we must (apart from creating the model itself) create a roadmap. Map out how the solution can be developed, how it can be beneficial for business departments, how you can improve it over time, and how to track the results.

Myth 6. A data analytics project can be planned from A to Z in one go

REALITY

A machine learning project can take an unexpected direction – sometimes it turns out that the applied method does not appropriately address the business problem we want to solve.

It may occur that the quality of data is too low to be easily modeled. If the examined phenomenon is rare, it will take some time to figure out how to model it.

The machine learning model, once built, has to be constantly adjusted and monitored. This requires mobilizing a team, and proper observations. That’s why we cannot always foresee what will happen in analytics projects.

SOLUTION

How can we try to solve that? We should not rely on ready-made solutions all the time, and instead, adjust the modeling to our problems.

What is more, we should introduce continuous integration monitoring. The agile approach is crucial because we can get the results of the modeling very quickly. It allows us to constantly improve the model and to make sure that the final result will be satisfactory for everyone.

If this requires a change of the direction of the project, let’s do that. Let’s face it, sometimes we fail, especially in data analytics, but this is when we have to take all our expertise, take different directions, and believe that this time will be better.

Just to show you how it works in our case – I told you about Predica Data Domain Framework. You can see that in our “Wikipedia” we have a section on database projects, Databricks projects, there are also machining learning projects. They always tell us what to do in case of doubts – how to navigate and change the project direction, if needed.

Predica Data Domain Framework

Myth 7. Machine learning is basically magic and can solve every problem

REALITY

So, in some cases machine learning or advanced analytics are not necessary – the problem can be solved with much easier methods. Introducing ML only for the hype will not give practical benefits.

The implementation may also fail when it kicks off with a too complex approach​. Even when you have a legitimate reason to go with it, do not start with complicated methods.

Finally, some organizations are simply not ready for machine learning, because they do not have the internal competencies. Even if they start with machine learning with the external vendor, it will be difficult for them later to develop the models and adjust them to changing business conditions.

SOLUTION

I told you about our frameworks, the DataOps and the MLOps. It would be wise to start with just that – a proper approach to data. If we have data and want to analyze it, let’s do it with a proper visualization, and use Power BI instead of machine learning, as a first step.

Furthermore, machine learning and data science, in general, is a process that requires a group of people. So, we need to make sure that we have people with different competencies, who will be able to handle all the advanced tasks. In the end, if you want to make use of machine learning, you have to invest in the skill sets, and make sure that the competencies will gradually increase.

Another example. Below is the Power BI dashboard for one of our Qatari clients. In this case, the sole power of visuals has helped us to answer a lot of questions – with no ML needed!

Example dashboard in Power BI

Example dashboards in Power BI

I really encourage you to start with the easier methods first. I’m not saying that machine learning is wrong, because it’s very helpful, but in some cases, it should be a second step.

Key lessons about cloud data analytics

To sum up, I would like to tell you about our findings over the last three years. We have delivered multiple projects on the Azure platform, based on data and AI, and there are four key findings, which came out of those projects.

  1. Invest in DataOps and Machine Learning Ops. Try to come up with a framework, which will help you to handle all the models, monitor, manage, deploy, and improve them.
  2. Create a Data Framework. I have explained to you how Predica Data Domain Framework works. It’s a set of articles, solutions, and rules, which we all follow when creating new machine learning and data projects in general. Without it, the quality of our work would not be as good as it is.
  3. Engage data engineers in your team. Trust me, there are not many people who are great both at machine learning models and data preparation. That’s why a collective effort is necessary.
  4. Start small, with the simplest methods. Do not always try to answer your problems with machine learning.

If you are more interested in this topic specifically or anything related to data and AI, go on and contact me. I will be happy to answer any questions you may have.

Key takeaways:

  1. It’s easy to have high expectations when it comes to advanced data analytics, but very often this is not the best solution you can choose to get insights from the data.
  2. Machine learning projects need a lot of preparation, investment, and time. Prepare for that to not be disappointed.
  3. The best advice is to start small – try simpler options like visualization tools – and if that won’t provide you the answers, make sure that you are prepared for machine learning (in terms of your datasets, your frameworks, and your people) before you start such a project.