Reconciling Drift through Helm Controller

I’ve had a number of people ask me about the strange case of Helm and Flux’s capability of drift detection and resolution, and this seems like a good topic to resurrect my blog over. The word on the street is, Helm hooks and 3-way merge means Helm cannot reconcile drift, and this might honestly make some people have a bad day.

My feeling is that a Closed Loop is an important part of GitOps, but I recognize that it has not been adopted as part of the standard and not all people can agree on this sorta-purity-test as a requirement of “doing real GitOps.” That’s OK, we don’t have to agree; and as I’ll show, I don’t even feel strongly enough to keep myself internally consistent here.

Such differences of opinion can be one of the primary things that makes a community great!

Let’s get some background out of the way, then tie these things together with an actual use case that I personally had for introducing drift on purpose, to clarify some concepts and also show a way how one can productively violate the Closed Loop principle. I will demonstrate (with a caveat) how one can use Flux and Helm to reconcile drift with a Flux HelmRelease. This will be illustrative of differences between GitOps tools and a few more advanced features of Flux.

First, what is 3-way merge in this context, and what else about Helm makes reconciling drift such a difficult problem? Actually, since I don’t know you or your background, let’s back up even one step further:

What is Helm good for?

What is Helm doing for us at all, and why do we even use it?

Helm is for distributing Helm charts, and those are Kubernetes YAML manifests that come packaged together as a bundle of templates, which have all had holes punched through them in places where the authors expected we might need to make a decision about configuration. Charts also usually come packaged with generally sensible defaults.

So for the app dev who wants to try a “chart vendor’s” app in about the way it was intended to be used, the experience can be easy and mostly zero-configuration. And for an infrastructure engineer who needs more control over the end result, the experience can still be just about as nice from the very same artifact!

Helm users can install from the default values.yaml or they can override it (without any need for editing the chart) by passing --set or --values with their helm install or helm upgrade commands.

This pattern of overriding chart defaults is manifested (pardon the pun) in Helm Controller’s HelmRelease as a field, spec.values, and some neighbors. We have an example, that installs Hephy Workflow and makes some configuration changes overriding defaults; these possible configurations are already accommodated for in the Helm chart itself.

Some other changes are made in this example with patchesStrategicMerge, for configurations the chart author didn’t think of, or didn’t consider important enough to include: deleting a DaemonSet, setting loadBalancerIP in a Service, adding an annotation to an Ingress, or adding a PersistentVolumeClaim attachment to a Deployment. All such things are possible!

OK, enough background about Helm. Hypothetically, we are using a chart in our infrastructure, and we start with two collaborators in our story: the Chart author and the End-Users, who we can assume are some infrastructure engineers or app devs that install an app from a chart on their Kubernetes cluster.

The drift is coming from inside the house!

So what’s a 3-way merge?

Installing a helm chart on a cluster is taking two visions and mashing them together in a well-defined way, combining the author’s defaults with any overrides from the end-user. This is true whether we are using built-in capabilities configured through values or others we have added via postRenderers.

But as already alluded, this is neither the beginning nor the end of the story; the chart release runs as installed for some time, and it eventually gets changed one way or the other.

This change can come in one of two ways: an upgrade (the well-defined changes), or in the form of configuration drift.

A 3-way merge takes account for the possibility that sometimes, intentional configuration drift happens between the point of install and subsequent upgrades, and assumes that when it happens, it has likely been done on purpose. Consider that it is desirable to preserve the drift, even in the presence of specific well-defined configuration that disagrees!

Helm docs provide further background on how this is an improvement on earlier behavior. For me, the jury is still out.

The best coverage I found of how it can look when this goes awry was in an explanation provided by the devtron blog.

The tl;dr is that our declared configuration is only applied once until it changes, and when overrided by a cluster operator between upgrade and install time, their configuration drift will win out, and our subsequent attempts to repair will only work to resolve drift insofar as we present a new overriding diff as compared against the prior installed configuration.

So, do we like configuration drift now?

If you know about Flux and GitOps, or let’s say even if you’re still reading and following me at all, it should come as absolutely no surprise this behavior sounds like a strange and foreign concept (and even perhaps a wrong decision) to many people who may have been long-time practitioners in the field of DevOps, or even in the Kubernetes space!

“Configuration drift is extremely bad, full-stop. I use [GitOps and] Kubernetes specifically to eliminate the possibility of configuration drift creeping in. That Helm has made this decision to explicitly allow and carry-forward any non-conflicting configuration drift that might have been introduced between releases, makes me question my decision to even use Helm.” — Some infrastructure engineer somewhere, probably.

So, does Helm Controller flip the script and enforce declarative principles? No, it does not… we value upstream compatibility about as highly as we value semantic versioning, and so “correct behavior” in Helm Controller means doing exactly what Helm’s upstream client would have done when it is ran by a human operator and has been given similar inputs to those expressed from within the HelmRelease artifact.

Can we even get away from this undesirable behavior by committing to never use Helm? No, we can not… 3-way merges are not and were never Helm’s exclusive idea. If we are using Kubernetes with kubectl apply, then we can observe the same behavior is plainly described in the Kubernetes concepts docs and so it is not unique to Helm or Flux.

Is it really as bad as described? I’m not so sure, in fact I don’t even think I have quite understood all of this correctly. Writing this article is a way to explore the behavior for me. We’re out here testing the limits, as always.

And I’m honestly not here to cast a vote on the utility of this idea, only providing background so we can understand enough about merges and drifts to hopefully continue into the next idea without getting lost.

Let’s talk through a practical example of drift and how I have decided to incorporate the side-effects of this behavior into my home lab and daily life for better understanding.

A concrete example with Jenkins, Flux, Kustomize, and Helm Controller

From my collection of many hats, one that I have been wearing on-and-off is the role of Jenkins-CI administrator. I even presented at KubeCon CloudNativeCon NA 2021, and a video of the talk I gave there should be available soon.

The subject, “Jenkins with Declarative Everything,” is meant to imply that rather than configuring the system by hand, twiddling drop-downs and entering secrets into text fields in the Jenkins administrative interface, we should do as much configuration as possible in the GitOps way: by writing some change into a YAML file and pushing to Git, then letting automated agents reconcile it as an update into the cluster.

When we run a Jenkins server, having Jenkins is usually not the actual end goal of our efforts and in many cases, as in mine, this responsibility falls on someone who doesn’t even have this as an actual defined part of their job description. (App devs who are responsible for running your own CI infrastructure, raise your hands!)

Ideally the Jenkins admin job can be as small as possible after the thing is stood up and functioning properly, and in a case where Jenkins is on a small cluster that may only see modest to moderate use, I found checking on the instance about once a week was the sweet spot of bare minimum amount of work required to keep it online and from falling into disrepair, in lieu of any in-depth or more comprehensive approach to monitoring for trouble in paradise.

Installing Jenkins through the Helm Chart was my method of choice, since it comes configured out of the box to run jobs as pods with a Kubernetes agent. Test jobs should run in a clean environment that gets recycled after each instantiation.

In spite of our best efforts to make Jenkins completely declarative, the reality is that Jenkins does now and has always carried all of its instantiated configuration around in a bag of XML, which lives in Kubernetes on a PersistentVolume that is properly invoked in any instantiated instance of recent versions of the chart through the use of a StatefulSet.

The chart does not provide a setting for replicas on the StatefulSet because it is not expected to be used in a Highly Available configuration. There is some treatment of this idea in the docs, but it is bound to be more work than simply incrementing replicas as more identical Jenkins controllers need to share the data in the PersistentVolume somehow.

Scaling `replicas` when they don’t want to be scaled

That sounds very complicated, and this might have been a good reason for not exposing the replicas count as a setting in the chart, but what I really wanted wasn’t High Availability, as I never intended to scale Jenkins up beyond 1 single replica. I wanted to scale StatefulSet.spec.replicas from 1 to 0 and back again, really only about once or twice a week.

Sometimes I would do this in response to somebody’s issue, when Jenkins is not responding for some reason and it needs to be kicked. Most of the time I would do this ritually, on a weekly basis or as-needed to provide for early detection and prevention, (in other words, to pre-empt and avoid those calls.)

I haven’t been in charge of anyone’s production Jenkins instance for a while, but I used to do this by hand: kubectl scale -n jenkins sts jenkins --replicas=0 and a few minutes later again with --replicas=1 then I found Helm Controller has support (because Helm itself also has support) for something called a PostRenderer.

This feature lets us override values when a chart author hasn’t thought them necessary to include!

I patched my long, dramatic HelmRelease with a comparatively simple Kustomize patch to apply our own setting for spec.replicas, even though the chart does not include it as a setting in values.yaml.

This provides for a demonstration of the fact that Helm will not override my drift with any supplied configuration, unless it represents a change from the last installed version:

(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ kj get po
No resources found in jenkins namespace.
(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ kj scale --replicas=1 sts jenkins
statefulset.apps/jenkins scaled
(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ kj get po
NAME        READY   STATUS     RESTARTS   AGE
jenkins-0   0/2     Init:0/1   0          7s

The declarative configuration says “Jenkins has 0 replicas” (our own addition), and we have edited it to 1 replica.

(That’s our drift… let’s see if it survives as we’ve understood!)

(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ kff get hr
NAME                    READY   STATUS                             AGE
ambassador-edge-stack   True    Release reconciliation succeeded   24h
jenkinsci               True    Release reconciliation succeeded   24h
kube-oidc-proxy         True    Release reconciliation succeeded   24h
openvpn-as              True    Release reconciliation succeeded   24h
(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ flux reconcile hr jenkinsci
► annotating HelmRelease jenkinsci in flux-system namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ applied revision 3.8.4
(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ kj get po
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   1/2     Running   0          34s

We reconcile the HelmRelease, to confirm it will not override our manually introduced drift. (It doesn’t!)

(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ flux reconcile ks jenkins
► annotating Kustomization jenkins in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✔ applied revision main/f6a2cbd6b7cea95bfe6cc8c98b2f60b96bd90789
(⎈ |oidc@moo:blog):~ (ubuntu-vm * u=)$ kj get po
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   2/2     Running   0          58s

Just to be sure, we take the Flux Kustomization that applied the HelmRelease and reconcile it, seeing the same behavior.

Though our declarative configuration describes a StatefulSet with zero replicas, the 3-way merge preserves the drift, and it will not be overridden by subsequent reconciliations until it represents a diff from the last applied configuration.

Writing drifts such that they are overriden

We’ve just seen something very powerful happen. Now is a good time to re-center ourselves on goals and ask if this really does what we need, or if we should keep looking?

I have created a calendar reminder for 7am on every Saturday and Sunday to shut Jenkins off, and the expectation is currently that I will do it manually. I used to run helm install by hand and maintained an old-world cron job which ensured that Jenkins was online every week day morning, running kubectl scale --replicas....

There were reasons to set it up this way. This time, with GitOps, it will be different. But have we actually enabled a useful workflow? And is it worth it to handle any part of this manually, or should it be totally automated?

Why turn down Jenkins on the weekend?

We’ve explained that a goal is to turn Jenkins off and on again once in a while. This end should satisfy goals in other arenas: we like turning resources off when we’re not using them, and we don’t like working on the weekend.

(We also value flexibility, as we don’t like an un-thinking robot telling us when we can or cannot work.)

So it should be easy to override the schedule, and to do so and walk away should generally not result in extended periods of unexpected downtime or wasteful execution. In other words, if automation is turning resources up and down, it should not behave like a useless machine, and it should require as little intervention as possible to keep the system in operation.

I imagined scaling down the StatefulSet by hand on Saturday and Sunday morning, say about 7:30am, with a robotic process that turns it up again, 7 days a week at about 6:30am. Why manually? This is my weekly opportunity to check for issues like runaway processes, necessary upgrades, or administrative security alerts that pop up in the Jenkins UI.

If we don’t find value in this process after repeating it for a few weeks, the “turn-down” part can be automated but the checking-in cannot. We do not have or plan to implement sophisticated monitoring, so someone will need to verify that important updates are being applied and normal function continues to operate as expected. (Users may even handle this!)

So it should be clear that I expected this part of the process to remain manual, at least until it became someone else’s job. So, whenever that person comes along, what type of issues should we train them to watch out for?

Time for more background!

First and foremost, my Jenkins instance was hosted in the cheapest possible and yet most practical way I knew how: on a “Reserved” EC2 instance of type T2 or T3, which carries a balance of CPU and I/O credits.

Because this machine saw infrequent use (except on days that were very close to release time) it was practical, in spite of best-practices advice to the contrary, to run these Kubernetes nodes on burstable instances in the default credit-based “limited” mode. This worked as long as the load averages could remain low during periods of inactivity.

(Think: we should not be burning CPU credits at idle, else this resource is in a permanently wasteful form.)

I mentioned that sometimes things go wrong; we can have runaway jobs that fail to complete, and keep running or keep re-scheduling until cancelled. We can have someone who figures out a way to make Jenkins push a commit, which triggers a new job, which triggers a new commit, in a run-away loop. Both good reasons to decline enabling “Unlimited mode.”

I worked with our AWS experts and set up a simple monitoring solution so I could get an e-mail alert whenever the credit balance dipped beneath a threshold. The metrics are called CPUCreditUsage and CPUCreditBalance, and if you pay attention, this serves a dual purpose: one can also get a cheap sense of whether the Jenkins server is actually being used or not.

CPUCreditBalance has a ceiling per instance, so when it reaches the maximum, every Jenkins job that runs for longer than a few seconds becomes clearly visible on the graph as a downward blip, and the balance recovers during inactivity.

So we’d like to see the credit balance hit the ceiling, but it should not stay there for too long. On the other end, we should not see the balance hitting the floor too often, as that may represent an availability issue that could suggest we have not allocated enough resources to meet the actual demand.

Troubleshooting Jenkins

At least 10 times in my tenure as a production Jenkins admin, I saw CPU credits approaching zero, and at least eight times out of ten, the first step I took toward fixing it was to scale Jenkins down, then clean up any stray pods in Pending, Error, or Evicted states, and finally scale Jenkins back up again. This might have been “all it needed.”

Around 50% of the time or more, issues like this were, at the root, caused by insufficient attention to request and limit settings for memory and CPU, which always required some fine-tuning to right-size. Getting these settings right reduced the incidence of those random cascading trouble events, or kept them from escalating all the way to the “alerting” stage.

Sometimes (more rarely) we found these alerts were the result of an actual person pushing many commits for a few days straight. Alerting here was a good indicator that one might be nearing an actual burnout, and could be ready for a break! Or maybe it’s just that time of the quarter, and we’re doing a big release that needs lots of testing.

If you check the graph (and just use your imagination or trust me if you can’t see the graphic), you’ll see that some non-zero CPU credits are used when Jenkins is running, and fewer (but still non-zero) CPU credits are also used when Jenkins is not running. This is so, even without any active Jenkins job workloads present or pending in the running queue.

But in either case, the balance should be recovering credits at a positive rate, as there is no workload or “subject-under-test” and Jenkins ought not use more than the baseline CPU credit accrual, or our machine instance is likely too small.

If the trouble has lasted long enough to generate an alert from CPUCreditBalance, it may also be time to shut Jenkins down and leave it off a while. Turning Jenkins down on Saturday and Sunday helps ensure we have plenty of CPU credits on Monday morning, as long as the instance isn’t actually burning balance credits at the baseline without work to do.

(This idea is basic micro-economics, so it should hopefully make sense even without any technical background.)

Why not just turn it completely off?

Turning the instance off is another alternative that we considered, but we found instances reset their credit balances when turned off for long enough. We preferred to allow “running out of credits” on Friday afternoon and gaining them back with a much needed rest. Instances do not accrue CPU credits while they are powered off, so that wouldn’t work.

So we need idle periods where the instance can accrue credits, in order to account for production loads throughout the week. Jenkins (the EC2 node) is therefore production and has to run 24/7. However, Jenkins (the service) can have scheduled periods of downtime that we can also override to access the CPU credits when working on the weekend.

This arrangement gives us some enhanced degree of control and visibility over the testing machinery as a bottleneck!

Regularly scheduled downtime as configuration drift

That’s everything we need to know about how and why downtime affects our use of Jenkins, and how that interacts with the real people on our team, who we hope will also take advantage of Jenkins and make use of it in their daily workflows.

We have seen that Jenkins can be set declaratively to have a non-default scale of 0 and that we can override it with a drift which remains persistent. It then follows that we cannot expect a default setting of 1 to be replaced by the reconciler, scaling Jenkins back up periodically if we are visiting Jenkins on the weekend to turn it down. (Or can we?)

We can’t expect Helm to reconcile configuration drift without being provided new inputs because of 3-way merge. If we are careful about how we introduce the drift, though, we can actually use Flux to correct drift – a Flux Kustomization uses (server-side) apply, so it should definitely revert any drift that was created via kubectl apply!

This is true since there’s only one annotation for last-applied-configuration and all clients using apply will have to share it! This is probably going to be an abusive pattern, but we’re still going to try and test the limits to see how it works out. Drift created any other way, like kubectl scale, is still a thing but it is no longer in scope for this discussion.

What option do we go with?

We’re going to use a Flux Kustomization to apply the StatefulSet.spec.replicas count of 1 to Jenkins, outside of (after) our HelmRelease which we have seen will not revert changes introduced by outside configuration drift.

Then when we want to create the scheduled downtime, we will use kubectl apply to put the new configuration onto the cluster in a way that should be reverted on the next Kustomization.spec.interval.

Will this actually work? A quick reading on the actual difference between apply and patch semantics suggests it should, but let’s Flux Around and Find Out!

Undo that HelmRelease replicas postRenderer kustomization

We’re not going to need this patch anymore, we think, so let’s put things back the way they were in the default configuration.

We’ll keep the patch around as an example in case we need it again later, but it’s not patching anything now.

The state of our StatefulSet, before and after this change:

$ kj get po
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   2/2     Running   0          20h

We know the change succeeded in spite of no change, because an event confirms something actually happened:

$ kff describe hr jenkinsci|tail -5
Events:
  Type    Reason  Age   From             Message
  ----    ------  ----  ----             -------
  Normal  info    63s   helm-controller  Helm upgrade has started
  Normal  info    61s   helm-controller  Helm upgrade succeeded

Great. Now, we need a place to apply the patch from, and a guarantee the StatefulSet is first created successfully (before we can try patching it via an apply, in order to shut it down.)

In this pull request, we use an apply after Jenkins’ StatefulSet and PersistentVolumeClaim are ready, to adjust the scale.

$ kj get po -w
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   2/2     Running   0          21h
jenkins-0   2/2     Terminating   0          21h
jenkins-0   0/2     Terminating   0          21h
jenkins-0   0/2     Terminating   0          21h
jenkins-0   0/2     Terminating   0          21h

That worked! The patch scales Jenkins down to zero. We can now flip the bit, making the default scale 1 and try this again, now in the opposite direction, to simulate a Sunday morning scheduled downtime.

Adjusting the jenkins-scale Flux Kustomization.spec.interval lets us roughly determine how long the downtime should last, in a way that gets us most of the way to where we were going, without any need for introducing another CronJob yet.

First we configure the patch to scale replicas to 1. Merging this takes some time to have an effect because of the dependency ordering. But after a little while, things are quickly springing to life again:

$ kj get po -w
NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   0/2     Pending   0          0s
jenkins-0   0/2     Pending   0          0s
jenkins-0   0/2     Init:0/1   0          0s
jenkins-0   0/2     Init:0/1   0          2s
jenkins-0   0/2     PodInitializing   0          20s
jenkins-0   1/2     Running           0          21s
jenkins-0   1/2     Running           0          42s
jenkins-0   2/2     Running           0          50s

Now we can try patching via apply to shut it down, and then trigger a sync manually to see if our Flux kustomization reverts the specific drift we’ve caused on the next reconciling interval, as we expect!

If so, we’re basically finished as we got what we came for, and there should be no more missing puzzle pieces.

$ echo '{"apiVersion":"apps/v1","kind":"StatefulSet","metadata":{"name":"jenkins","namespace":"jenkins"},"spec":{"replicas":0}}'| kubectl apply -f - --validate=false

We need to use --validate=false to avoid an unnecessarily verbose patch, per the output of the apply without that flag:

error: error validating "STDIN": error validating data: [
  ValidationError(StatefulSet.spec): missing required field "selector" in io.k8s.api.apps.v1.StatefulSetSpec,
  ValidationError(StatefulSet.spec): missing required field "template" in io.k8s.api.apps.v1.StatefulSetSpec,
  ValidationError(StatefulSet.spec): missing required field "serviceName" in io.k8s.api.apps.v1.StatefulSetSpec];
if you choose to ignore these errors, turn validation off with --validate=false

After the patch takes effect, we run flux reconcile to simulate the Kustomization.sync.interval (or, we can imagine that interval is set to something very long, and a new CronJob does this part for us daily at a scheduled time, say 6:30am.)

$ kj get po -w
$ echo '{"apiVersion":"apps/v1","kind":"StatefulSet","metadata":{"name":"jenkins","namespace":"jenkins"},"spec":{"replicas":0}}'| kubectl apply -f - --validate=false
$ flux reconcile ks jenkins-scale   # (in three separate terminal windows, we can watch the progress)
...

NAME        READY   STATUS    RESTARTS   AGE
jenkins-0   2/2     Running   0          34m
jenkins-0   2/2     Terminating   0          36m
jenkins-0   0/2     Terminating   0          36m
jenkins-0   0/2     Terminating   0          36m
jenkins-0   0/2     Terminating   0          36m
jenkins-0   0/2     Pending       0          0s
jenkins-0   0/2     Pending       0          0s
jenkins-0   0/2     Init:0/1      0          0s
jenkins-0   0/2     Init:0/1      0          2s
jenkins-0   0/2     PodInitializing   0          22s
jenkins-0   1/2     Running           0          23s
jenkins-0   1/2     Running           0          42s
jenkins-0   2/2     Running           0          50s

We’ve just done something which I thought we could not do! We added some drift to a HelmRelease with a manual process, and we used an automated (non-Helm-based) GitOps process to resolve it in a way that automatically repeats.

While this is far from anything I would term “strong policy enforcement,” it does the job, and it uses Flux to do it. The simple caveat is that we can only mitigate drift in a Helm chart if we have specifically anticipated the particular drift.

Ugly, but it works!

Sure it ain’t pretty, but as an example of what is possible with Flux, this article has covered an awful lot of ground.

If you have ideas for how it could be better, while keeping the goals and full Helm compatibility in mind, I’m all ears.

Left as exercises for the reader: we can wonder whether a basic CronJob resource could be added to make the process more precise or more automated, and even if there might be an interval setting that could make a CronJob unnecessary.

Thanks for following through to the end, and I hope you enjoyed reading this as much as I have enjoyed writing it!

Installation Docs