wAIste: using OpenAI to tackle food waste and promote sustainability
Tackling the problems of food loss and waste reduction can bring benefits like increasing food availability to the m...
Big data is what powers companies. Lots of data mean lots of insights, which enables better decision-making.
Having lots of information helps organizations work more efficiently and come up with new solutions and ways to better help their customers. But there’s also the matter of cost – large-scale solutions generate large-scale expenses.
The good news is, any savings are large-scale too. Now, how to get them? Skills and knowledge are definitely helpful, but research and determination can be equally important. I will explain it using this real-life example from our recent project.
Earlier this year, I worked with a company providing airport services, such as cargo handling, ground services, etc. Their customers, i.e. airlines, can book their assistance via a business web portal. This solution is one of the key services for our client, generating lots of data.
The organization had 4 million public requests and 26.4 million messages daily, amounting to 3.42 TB of data per day. With this much data, observability was key.
The organization needed specialized tools to manage, process, and generate insights from this multitude of signals and sources. This is why they decided to introduce a dedicated data platform.
Their first implementation of a dedicated service was functional but also pricey. Processing data cost them £1 M per year, just for this one solution. It came as no surprise that the company looked to lower the cost.
The revised approach relied on a custom platform built in-house. The solution was based on Event Hub and Databricks. Although this approach made the service slightly cheaper, the cost was still around £700,000 per year.
£700,000 per year is not exactly peanuts, so it’s not a surprise that the project got a bit of a pushback.
The company wanted to get the cost below £60,000. Not an easy challenge but there had to be a way to do it. Which sounded like fun, so, along with the team, I jumped on the chance to tackle it 🙂
I faced a classic FinOps problem: how to ensure the client only uses the resources they need, without overpaying for them?
Of course, there are never easy answers. So a difficult one had to do. I was determined to find it.
First, I took the approach of understanding the usage. How much data does Databricks store and what happens to it?
While processing data, Databricks streams them simultaneously to write storage. As a result, write operations are the highest cost of the platform – and at the same time, the highest variable cost.
One way to optimize it was to change the frequency with which Azure Databricks takes a data snapshot. In other words, if the service did a “data autosave” less often, it should reduce the load on the system and consequently, the cost.
Sure enough, by processing data at a lower frequency, the expense was marginally reduced.
The first attempt was done – but not even close to the target £60k. It was time to dig deeper.
The next idea was to look under the hood and investigate Azure Databricks workflows, to see if they could be optimized. This is where things got really interesting.
To explain this, let’s first go over how Azure Databricks is set up. Here’s the architecture diagram from Microsoft:
Looking at this structure, I tried to figure out where else we might be able to cut down more costs.
VMs wouldn’t help much here, as the volume of data was more or less consistent. So instead, I turned to DBFS – Databricks File System – which uses blob storage for write operations.
Apache Spark, which is the structured streaming engine for Azure Databricks, by default comes with HDFS backend state store implementation. While it works perfectly in most cases, it can sometimes lead to GC (garbage collection) pauses due to overloaded memory.
The good news is, for scenarios where HDFS doesn’t quite fit, there is an alternative. In Azure Databricks, you can use RocksDB for stateful streaming. With large volumes of data (reaching millions of records at a time) it processes it more efficiently without getting bogged down in read data.
We took advantage of this fact and changed the state management system to RocksDB. What do you know – solution cost went down well below target (it’s at around 5 digits per year now). Result!
What did I learn besides saving the customer TONS of money? Never accept the defaults. The default for Azure Databricks (and some other Spark-based services) is HDFS, where RocksDB rocks for some workloads.
Never accept the defaults. Be a challenger!
In this case, my team and I managed to save the client money by optimizing their workload. We reviewed the architecture carefully and chose a different, more efficient in this case, data processing algorithm in Azure Databricks. It made no functional changes to solution output but managed to cut down the client’s platform cost by over 90% compared to the original service.
Many thanks to Paweł Orzech for all his help and persistence in solving that mystery 🙂
If you’d like to know more, the technicalities are described in the documentation linked below. And if you’d rather talk to a human, just drop me a note and I will get back to you.
Read similar articles