Leveraging GitOps and Site Reliability Engineering

Written by Jason Bloomberg | Jul 20, 2021 7:00:00 AM

Jason Bloomberg, President, Intellyx

The enterprise IT challenge of the day: how to manage – and leverage – near-constant change?

In the previous BrainBlog post in this series, my colleague Jason English explained how important service-level objectives (SLOs) are to managing such change:

“SLOs support an iterative, DevOps delivery process that embraces constant change,” English wrote. “Continuous delivery of code to production is merged with continuous observability of the impact of each change in production, and the resulting SLIs can fulfill existing SLOs while helping to identify new SLOs for improvement.”

At the heart of this DevOps delivery process is CI/CD: the continuous integration and continuous deployment that results in working code ready for production.

Today, the ability to leverage change is a top business priority. SRE, leveraging SLOs and GitOps in cloud-native environments, is becoming the only way to deliver such change within the constraints of the business.

Deployment isn’t the end of the process, however. Releasing code is the missing step: putting new software in front of customers and end-users, while ensuring it meets the ongoing objectives of the business.

It is at the point of software release and thereafter that site reliability engineering (SRE) can leverage SLOs to balance these business needs with the technical measures of success that service level indicators (SLIs) represent.

As organizations’ software deployments mature to take advantage of constant change, site reliability engineers increasingly focus on Kubernetes-powered cloud-native environments. However, the massive scale and ephemerality of the operational environment requires an end-to-end rethink of how to release software into production and operate it once it’s there.

Service-Level Objectives for Cloud-Native Computing

While most enterprises are currently in the midst of ramping up their Kubernetes deployments, certain industries are already looking ahead to the need for unprecedented scale.

On the one hand, this explosive growth in business demand for ephemerality and scale is driving the exceptionally rapid maturation of the Kubernetes ecosystem.

On the other hand, all this cutting-edge technology has to actually work. And that’s where cloud-native operations fits in.

Cloud-native computing takes the established ‘infrastructure as code’ principle and extends it to model-driven, configuration-based infrastructure. Cloud-native also leverages the shift-left, immutable infrastructure principle.

While a model-driven, configuration-based approach to software deployment is necessary for achieving the goals of cloud-native computing, it is not sufficient to address the challenges of ensuring the scale and ephemerality characteristics of deployed software in the cloud-native context.

Software teams must extend such configurability to production environments in a way that expects and deals with ongoing change in production. To this end, various ‘shift-right’ activities including canary deployments, blue/green rollouts, automated rollbacks, chaos engineering, and other techniques are necessary to both deal with and take advantage of ongoing, often unpredictable change in production environments.

‘Shift-right’ (not to be confused with ‘shift-left’) refers to the fact that these actions take place after the release of software – in the live, production environment. The reality of modern, cloud-native computing is that change is so constant that the core of testing – making sure the software meets the business need – must take place in production.

SLOs are absolutely essential in such shift-right scenarios, as the balancing act between performance and user experience takes place directly in front of users. The DevOps and SRE teams must manage these factors in real-time in order to keep the software in production within the error budget on an ongoing basis.

GitOps: Cloud-Native Model for Operations

Bringing together these best practice trends for operations in cloud-native environments is an approach we call GitOps.

GitOps is a cloud-native model for operations that takes into account model-driven, configuration-based deployments onto immutable infrastructure that support dynamic production environments at scale.

GitOps gets its name from Git, the hugely popular open-source code management (SCM) tool. Yet, although SCM is primarily focused on the pre-release parts of the software lifecycle, GitOps focuses more on the Ops than the Git.

GitOps extends the Git-oriented best practices of the software development world to ops, aligning with the configuration-based approach necessary for cloud-native operations – only now, the team uses Git to manage and deploy the configurations as well as source code.

Such an approach promises to work at scale, as GitOps is well-qualified to abstract all the various differences among environments, deployments, and configurations necessary to deal with ephemeral software assets at scale.

GitOps also promises a new approach to software governance that resolves issues of bottlenecks. In traditional software development (including Agile), a quality gate or change control board review requirement can stop a software deployment dead in its tracks.

Instead, GitOps abstracts the policies that lead to such slowdowns, empowering organizations to better leverage automation to deliver adequate software government at speed.

The impact of GitOps on SRE – and SLOs in particular – is still largely experimental, but promising. GitOps requires that deployment personnel represent the entire production environment declaratively so that they can deploy or redeploy any or all of it following immutable infrastructure principles – a practice that will inevitably drive any organization’s approach to SRE.

From the SRE’s perspective, GitOps provides a shortcut to maintaining SLOs in shift-right deployments. As Murphy’s Law of Software states, anything can go wrong, will go wrong – in production. GitOps gives SREs a way of managing the resulting error budgets in a fully declarative, governed manner.

The Intellyx Take

There are many moving parts to the vision of cloud-native software release that this BrainBlog lays out. The scripting central to infrastructure-as-code gives way to declarative, model-driven representations that are themselves part of the Git-driven software deployment and release processes.

Combine GitOps with shift-right practices that routinely spend the error budget by testing code in production, and you quickly paint a picture of barely managed chaos.

Given the relative immaturity of the open source infrastructure underlying cloud-native computing, it’s easy to see how things could quickly go off the rails. It’s no wonder, then, that SRE and the engineers that practice it are in such demand today.

Old ways of managing change sought to constrain the chaos in order to maintain the policies and requirements important to the business.

Today, the ability to leverage change is a top business priority. SRE, leveraging SLOs and GitOps in cloud-native environments, is becoming the only way to deliver such change within the constraints of the business.

Image Credit: Alok Sharma on Unsplash

View full post