Cloud data workloads are like coffee: They come in many forms and flavors, each with different price points. Just as your daily cappuccino habit will end up costing you dozens of times per month what you’d spend to brew Folgers every morning at home, the way you configure cloud-based data resources and run queries against them can have major implications for your overall cloud spending.
Unfortunately, figuring out whether your spending makes sense – for both coffee and cloud data – can be challenging. No one automatically tells you when you’re buying fancier coffee than you can afford or that you’re paying more for cloud data infrastructure than you need for the workloads you’re running.
Now, I’m not here to tell you how to make a coffee budget. But what I can tell you – because it’s part of the work I do every day – is how to manage cloud data costs. As I explain, it all boils down to understanding what role each of your data workloads plays in your business, then allocating financial resources to them accordingly.
See also: Visibility into Costs Critical for Cloud Migration Success
The challenge of cloud data cost optimization
Overspending on cloud data can occur due to simple mistakes, such as forgetting to delete a block storage volume after you no longer need it. This is a relatively simple type of spending error to correct because it’s typically easy to detect data resources that are not connected to any workloads.
Where cloud data cost optimization gets more challenging – and where the root of a lot of overspending lies – is when it comes to ensuring that the data infrastructure you’re actively using is ideal for your needs.
That’s because it’s not always clear whether the business purpose of data workloads justifies their costs. There are many ways to configure data workloads, each with different cost implications. Without a great deal of context, it’s impossible to determine whether you’re using the best configuration based on the purpose of your data workloads.
Data cost management example
For example, consider a classic data use case: Querying transactional data. For this type of workload, there are multiple ways to host the data. You could put it in a data warehouse, for instance, or in various types of databases. There are also different approaches to querying the data. You could use query tools that are built into your data warehousing platform (if that’s where you store the data), or you could use external solutions. You can also devote varying levels of compute resources to the queries; more compute will typically result in faster queries.
Now, if your data workload is mission-critical – for example, if it’s part of a predictive analytics service that delivers product recommendations to your customers in real time, thereby contributing to revenue generation – you can probably justify spending a lot of money on it. In that case, you’d likely choose to store the data in a warehouse that is designed to optimize queries, and you’d devote plenty of compute resources to it.
But what if the data workload is less critical? What if, for instance, it’s part of an auditing process that your business performs periodically, but which doesn’t have to deliver results in real time? It would be a lot harder to justify paying for top-tier data infrastructure in that case.
In short, determining whether your cloud data is cost-optimized isn’t a matter simply of looking for obvious instances of unnecessary spending. It’s also about assessing whether the money you’re spending on data workloads in the cloud makes sense, given the business results that they help deliver.
Gaining visibility into data spending
To make that assessment, you need to know much more than what you’re spending on cloud data resources, or how your spending varies over time. You also need to know which business purpose the spending supports, as well as which stakeholders are responsible for the spending.
A basic step toward achieving this visibility is to tag all data-related cloud infrastructure in a meaningful way. Databases, block storage resources, object storage buckets and so on should be labeled with tags that identify which workloads they are part of and who is responsible for managing them.
That information is critical because you can pair it with spending metrics to figure out whether spikes in spending are justifiable or not.
For example, if you notice an uptick in the infrastructure costs associated with data queries, you can look at tags for the queries to identify what the purpose of the queries is. Maybe they support fraud detection for purchases, and the increased cost is due to an increase in purchase volume. In that case, you could conclude that the cost is legitimate and move on.
But if the tags instead say that the queries are being run by your accounting department to prepare quarterly reports, you might instead make changes that reduce the costs of the queries – such as running them in batches or moving the data to a lower-cost database. The queries might take longer as a result, but that is likely to be acceptable, given the relationship between the queries and the business.
Reining in data costs permanently
Over the long term, you can use the insights you gain from identifying instances of excess data spending to improve your business’s overall approach to cloud data cost management.
For instance, you might realize that overspending frequently is due to situations where stakeholders scale up data resources in a bid to increase performance, without understanding the cost implications. To prevent that issue from recurring, you could make your organization’s cloud Identity and Access Management (IAM) policies stricter so that only certain employees have permission to scale up data infrastructure.
Conclusion: Getting data costs under control
Cloud data workloads can cost a lot or a little – and sometimes, there are good reasons for them to cost a lot. To know the difference, you need deep visibility into the business context of your data workloads and cloud infrastructure. When you can like data spending to business outcomes, you can systematically make effective determinations about whether the cost of each workload is justified by the value that the workload creates for your business.
Daniel Zagales is the Vice President of Data Engineering at 66degrees.