Optimizing Cloud Costs through Service Level Objectives

Jun 23, 2020 | Author: Kit Merker

Avg. reading time: 5 minutes

Are you spending too much on cloud services? Is there a way to optimize your cloud spend and dramatically lower it? Here’s a hint: it has nothing to do with negotiating better deals (and bigger commitments) or reminding your developers to turn down unused infrastructure. This post outlines a different approach: by leveraging cloud-native infrastructure and pairing it with objective reliability data you can pay only for the infrastructure you need to meet user expectations. Read on to learn how to save money by setting realistic service level objectives (SLOs), keeping customers happy while also managing your cloud infrastructure bill.

One of the key challenges in cloud computing is how to optimize the available cloud resources to handle the demand placed on them. At the heart of the issue is the simple fact that no one wants an unhappy customer. Underprovisioning may lead to users experiencing an outage when there is a spike in traffic, so we tend to overprovision resources.

By applying SLOs to cloud-native workloads, we can better manage our cloud resources to achieve optimum reliability with minimal waste of money.

Unfortunately, that means we’re often spending too much on infrastructure, paying for overhead in a base footprint that is sized too conservatively. On the other hand, it can be very difficult to understand the impact on the customer experience of not meeting increasing demand while we take action to scale up.

Let’s explore how two complementary tools together can address this problem: Cloud-native Workloads and Service Level Objectives (SLOs). By applying SLOs to cloud-native workloads, we can better manage our cloud resources to achieve optimum reliability with minimal waste of money.

Harnessing the fluidity of cloud-native workloads

When you can move a workload from one set of resources to another, you can better control costs by assigning workloads to resources that are the most efficient and effective for the situation at hand. Conversely, when a workload is tightly bound to its underlying capacity (i.e., it’s not “fluid”), operators don’t have as many levers to pull to improve performance and utilization.

Let’s put this in more practical terms by comparing workloads on virtual machines (VMs) to container-based workloads. In a nutshell, container-based workloads are far more fluid than VM-based workloads and thus give operators much more flexibility in managing resources for maximum efficiency and reliability.

VM-based workloads are hard to scale out because they are still married to the concept of a computer. One of the big problems with VMs is that they don’t share an OS with their neighbors; each VM has its own. VMs also can’t have replicas of packages with different versions. As a result, workloads on VMs are more tightly bound to the underlying capacity and are therefore efficiency-hampered as they become heavier, reboot more slowly, and take up a larger memory footprint.

In comparison, if you have a variety of workloads running in containers (Kubernetes clusters, for example), you can rapidly reallocate workloads within a fixed pool of computing resources. To put it another way, container-based apps let you bin-pack better onto hardware. This added flexibility can translate into better density and elasticity of infrastructure resources.

Example: Suppose you have a cluster that is being used for both a production commerce system and also a few dev/test deployments (for example, a staging environment). One big cluster is running all workloads side-by-side from a compute perspective, but the staging/dev workloads are hidden behind restricted load balancers. If you have a sudden spike in usage, the best way to free up capacity is to kill the dev/test workloads (which are being used by employees) and give the compute resources to the customers who are paying you. This can be done quickly and efficiently in a cloud-native environment. It is always faster to schedule new containers to an already booted VM than to provision and boot a whole new VM.

So the point is, if you are on this journey to cloud-native, the fluidity of container-based workloads will give you more levers to pull when it comes to managing reliability. That includes capabilities like autoscaling and strategic pod eviction schedules that allow you to evict lower priority services to support higher priority ones that are having issues. Essentially, this means containerized workloads enable you to create your own “preemptible” workload strategy. It’s a pretty simple thing to do once you know the relative priority of each workload. But how do you determine the relative priority of each service?

Using Service Level Objectives for Workload Prioritization

One way to approach this is to define SLOs and error budgets for each of your workloads/applications. As Google describes it, SLOs define the target values of metrics that matter. In this case, an SLO for a service might define a target availability rate, with the error rate being the maximum amount of time that a service can be unavailable and still meet objectives. A critical, customer-facing service might have an SLO with a high availability target (such as 99.999%) and a very small error budget (0.001%), whereas a less critical service might have a lower availability target (such as 99.9%) and larger error budget (0.1%).

When prioritizing workloads, SLOs help you “label” the relative priorities and goals of different services. For instance, you might use a common nomenclature for the SLOs that tells you if the workload/application is:

Critical
Impacts customer experience
Nice to have
Experimental
Internal

Accompanying this label should be the precise metrics you will use to define whether the availability of the service is in its healthy range. By incorporating this SLO metadata into your analysis, you have richer information to make decisions based on demand as it’s happening. For example, SLO metrics can be used to evaluate the impact of changes to autoscaling policies, such as with the Horizontal Pod Autoscaler (HPA) feature in Kubernetes, to express how you want your services to scale in and out.

Using SLOs to Cost Tune While Protecting Customer Experience

When cost tuning an application in practice, teams often find that a large amount of cost savings can be found in overhead—the minimum CPU and memory footprint of each service, minimum resources available (e.g., machine counts in a cluster), or containers in a pod or scaling group, and a standby reservoir of compute resources ready to receive the increased workload during a spike in usage. The components of an application often scale in very different ways; even those that are designed to scale horizontally may have divergent behavior in a running system with varying live traffic. The interactions can be complex. If you’re planning to adjust those settings—especially if you are trying to squeeze down the amount of overhead and therefore take on more risk to availability and performance—how can you measure the impact on the reliability of the user experience and downstream system performance? SLOs are the answer because they can be designed to measure the reliability of the entire system as consumers of the services define it.

Let’s say you have a service hosted in a public cloud that is one of your more expensive workloads. I would venture to guess that:

The autoscaling could be adjusted to squeeze some overhead (and therefore cost) out.
You are not squeezing it tight enough, because your teams are being conservative about performance.
You should squeeze it tighter, and you can!
It’s probably well over 10% of the compute portion of your bill (if your system is typical).
You need a proven way to make sure you’re not injuring your customer experience.

Balancing Service and Costs, Protecting Margins

The deeper promise in cost tuning comes when you are able to use SLOs to protect your customer experience while adding flexibility to autoscaling. For example, in addition to evicting lower priority workloads such as staging or test environments, you may consider sharing overhead across workloads or deprioritizing batch workloads (long-running jobs such as data analysis, video encoding, or even billing and invoicing jobs—wherever you have backpressure mechanisms in place) to service temporary spikes in more interactive workloads. Another option is moving a portion of workloads to less-costly resources, for example, spot or preemptible compute products. Even though these changes add moving parts, and consequently some operational risk, much of that risk is alleviated by accurate measurement to ensure the service level is still met.

Saving money on infrastructure is all about understanding priorities and matching capacity to demand. If you’d like to talk about balancing reliability, costs and feature delivery, drop us a line at hello@nobl9.com

Image Credit: Jp Valery on Unsplash