The information we gather as organizations can give us a lot of answers, but only if we know how to listen.
After exploring the worlds of DevOps and security, let’s talk about data.
Data is the new oil! But is it just any data?
How do you know if the information you are using is the right kind of fuel for your organization?
Every company has three main assets:
DevOps and SecOps are about ensuring that technology to meet business goals is deployed and secure.
What about the data that flows through the systems?
Organizations use data for one of three reasons:
How advanced a company is in using its information depends on its maturity stage. But almost all business use cases for data fall into one of these three categories.
First, the organization becomes “data-aware.” There is this feeling that there is more to your data than what you are using at the moment. Questions are being asked and answered by digging through the information at individual level in Excel files.
There is no standard data model. A lot of time is being spent on finding out how to get the right data from many sources, and use it.
When an organization grows, the amount of information it gathers and processes grows as well. The way data is collected and used also matures.
After the initial data processing in silos, there is the next stage. Technology is used to build data warehouses and data marts. New people are on-boarded with the task of building data capabilities. You have, or even are yourself, a Chief Data Officer.
The usage of data at the organization matures still. There are more business cases. More people use it daily to find new insights or control operations.
The company becomes “data-guided”!
Here’s the catch. As the usage of information matures, its maintenance becomes more complicated.
Does it sound familiar? Aren’t those the same issues that troubled application development? When speed and integration became an issue, we turned to DevOps to solve it.
Why not turn to DataOps? Does such a thing even exist?
Yes, it does!
DataOps is a process, like DevOps, used by data and analytical teams. Its purpose is “to improve quality and reduce the time cycle of data analytics.” I got this quote from Wikipedia.
Let’s try to explain it in a more straightforward way. Think about your organization. You know its business. You have an idea that based on data, you can improve a process and provide more value for the core business.
How do you test and confirm it? What do you need to check it? Your data warehouse test environment? A copy of your production data model and data itself?
How long will it take for you to get it? Hours? Days? Weeks?
Let’s assume you’ve got a test environment. You developed your change to the data model, and the move is brilliant!
Do you want everyone to use it? How do you do it, and how long does it take?
Do you have to go through a test environment deployment and manual validation? Would it take days? Weeks?
As an alternative: can you commit your change to a code repository? Can you have it deployed to your data warehouse by tomorrow?
When the change is deployed, there might be some side-effects. Errors, even.
How do you know if your data flow will still make sense from the business perspective? The fact that it works doesn’t mean it contains logical and business knowledge.
The answers to these questions are what you need to take care of to make sure your business uses the correct data.
All of it takes time in the standard model to develop, test, and deploy.
All the time that is used for it slows down your business in taking advantage of data.
And all of these issues raise your business risk.
Ready for a change? This is where DataOps steps in.
Data operations is about innovating your value chain of data. It facilitates easy testing and validation of ideas and bringing it as a value to your organization.
Let’s discuss how you can introduce it to your processes.
Every journey starts with directions. The common language also helps to navigate it. Build a shared repository of information and practices in your organization around data engineering.
Knowledge is no longer spread across many places and in people’s heads. It lives in a shared repository, where everyone can find and update it.
What is the goal? Your data team starts to organize and standardize the approach to your solutions. Instead of 100s different ways of doing something, now you have your own data framework!
We’ve built our own Predica Data Domain Framework: a shared repository of best practices, available for everyone working with data projects.
A screenshot of our repository
Pro tip: don’t aim to standardize everything in one shot. Don’t let this effort stop you from doing work. It is a living standard. Build it along the way when delivering value from your data projects.
You can’t speed things up if you don’t turn your data projects into code. Everything lives as code nowadays. The Azure cloud makes it much more manageable. Look at this reference data journey architecture:
Example data platform architecture
All these are services in Azure. If it is in the cloud, it means it is based on APIs and can be automated. Once you allow your data and infrastructure to process it as code, you may proceed to the next step, which is delivering two crucial elements of your data operations:
Data flows can be described in code deployed into your Azure services. They make building new environments easy. They also provide standardization and automation of such deployments.
Once data flows are treated as code, you can also build automated tests to verify the streams of data and its quality.
Instead of spending days on execution of manual data validation tests, you can make them part of the pipeline and execute every day or at every deployment.
A sample dashboard for data testing performance
It provides so much needed trust in data and saves time. Now you can check if your information is right every time a change is implemented to data flows or models. It also lets you find out if something was changed at the source. And all of this using a single dashboard!
With your data project living as code, built on standard practices, and automation of tests, you are ready for the next step: deployment.
A typical data project deployment is time-consuming and laborious. But after a transformation into a DataOps project, you can deploy it as any other code – with a CI/CD pipeline.
A CI/CD pipeline for data
It allows you to build environments.
It allows you to execute data flow tests as part of the deployment.
Finally, it allows you to merge new changes – to deliver value.
With those four elements, you started your organization’s journey into DataOps. Now you are ready to iterate and innovate on data, to reduce your business risk. It also makes your data project fun and interesting for teams working on it. They are no longer SQL people, but the DataOps team!
To help you with your journey into DataOps, here are a few links which will let you dig more into the topic:
If you’d like to take a step back and look more into the topic of DevOps, you can complete this questionnaire to find out how your teams are doing right now and where you could improve.
And if you have any questions about the tools we use, the Predica Data Domain Framework, or even the meaning of life, feel free to ask me. Simply contact me or post your question below. I will be sure to answer it.
Next time, we will talk about money!
Read similar articles