When deploying OpenShift clusters -- there's nothing that's more frustrating than kicking off a deployment and realizing half way into the process that your deployment failed just because of a bad config caused by a typo, an invalid region or any other small, preventable error. I've recently run into this myself when I was attempting to deploy a cluster into an unsupported region in AWS and realized I've finally had enough. There's got to be a way to prevent this self-caused insanity.
What's The Problem?
Let's first take a look at the problem we're trying to solve.
In OpenShift, if you want to pass any customizations to the installer, you can do this through a file called install-config.yaml
. This can be used to tune general cluster settings, but is especially useful when dealing with cloud-providers when you need to specify things like regions, availability zones, etc. In this specific instance, I was working with AWS -- so this example will be centered around that use case (although is applicable to anywhere you want to run your installer).
The particular issue that popped up was that I was attempting to deploy to a cluster and hadn't given any thought to where I was deploying this (the wonders of cloud infrastructure!) as I just wanted a cluster to play around with. However, because of some of the requirements that OpenShift has, this was an issue. In a given release of OpenShift, there are approved regions that can support a deployment do to various requirements. As this specific region didn't meet all of the requirements and wasn't on the list that the documentation lists as an approved region; things did not go well.
Note: The list of approved regions for any given release can be found here.
So there I was, in a state of despair because I had sat there and wasted my precious time only for the cluster build to fail.
It was at this point that I decided enough was enough. I've got to stop doing this to myself. Surely there's a way to save future me some pain and suffering.
Luckily, as mentioned above, this documentation is available and easy enough to find. But I wanted things to be a bit more actionable. I want something that allows me to fight back against the YAML! So I got to thinking. What's are some of my favorite tools to abuse YAML with?
Open Policy Agent and Rego
That's right! Time to write some Rego. policies so that these types of things are checked when I start pushing them into my version control of choice. In this way, not only can I run these checks while I'm doing any sort of development, but I can also rig this up within my CICD processes so that anytime these things change I can make sure they'll still be functional.
If you're not familiar with Open Policy Agent (OPA) and Rego, I highly suggest you take a look here. At a high level, OPA is a general purpose policy engine that allows you to store your policy as code in a language called Rego. With these tools, I can describe policies that look at the value contained in specific parts of my YAML to ensure it is contained in an approved list and show an appropriate message if things go wrong. So let's get started!
Install Config
The first thing we'll need is the problem child. Our install-config.yaml
is the source of our issues here, so we'll want to make sure we have one to test against. For our purposes, we'll just steal the one that is provided as an example on the OpenShift docs site (which you can find here. This will look something like this. We'll make one change to the example provided by the OpenShift team by changing the region from us-west-2
to eu-north-1
for the purposes of our example.
apiVersion: v1
baseDomain: example.com
controlPlane:
hyperthreading: Enabled
name: master
platform:
aws:
zones:
- us-north-1a
- us-north-1b
rootVolume:
iops: 4000
size: 500
type: io1
type: m5.xlarge
replicas: 3
compute:
- hyperthreading: Enabled
name: worker
platform:
aws:
rootVolume:
iops: 2000
size: 500
type: io1
type: c5.4xlarge
zones:
- us-west-2c
replicas: 3
metadata:
name: test-cluster
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineCIDR: 10.0.0.0/16
networkType: OpenShiftSDN
serviceNetwork:
- 172.30.0.0/16
platform:
aws:
region: eu-north-1
userTags:
adminContact: jdoe
costCenter: 7536
pullSecret: '{"auths": ...}'
sshKey: ssh-ed25519 AAAA...
Now that we have this content, we can save it to a file called install-config.yaml
, we now have the base of what we're going to test with. So next, let's write a policy that will make sure that our region is supported. We'll write an initial policy for OpenShift 4.1 (where eu-north-1
was not supported) and then write a seperate policy for OpenShift 4.5 (where eu-north-1
is supported).
Make Things Actionable
Understanding Rego Policies
So let's get started with our policy. The first thing to do is understand what exactly makes up a policy. To do that, let's take a look at a basic example.
<action>[msg] {
myvar = 1
input.my.yaml.struct != myvar
msg := sprintf("%s is broken. This is my message",[input.my.yaml.struct])
}
So what does all of that actually mean? Well the first thing you'll want to decide is your action. Do you simply just want to alert people to things? Or do you want to throw a hard error when something isn't as expected? This is where you decide the level of intervention you want to display to your user and what you'll see here most frequently is warn
or deny
. When you see warn
, you'll see the message associated with your policy, but it won't return an error code. On the flip side of this, if we use the deny
action, this will sound the alarms and return an error code that alerts you to an issue with your policy (as well as still showing the message that you've defined). This is important to think about upfront as to how you want to handle this and what your strategy may be rolling forward when handling things like deprecations where you may want to move a policy from just throwing a warning to actually throwing an error.
Example:
<action>[msg] {
What you'll see next inside of the brackets is the actual content of your policy. There are a number of ways to store values inside of variables, so we won't cover all of them. If you're interested in what is available, please refer to the documentation here. The important takeaway here is that you can set a certain set of values that you think may be approved or disapproved and then test the value of your YAML against that. In our example instance above, we're just storing the value 1 inside of myvar
.
Example:
myvar = 1
The interesting part comes next. Now that we now what value we want to compare against, we need to decide what our "rule" is. Do we want to make sure that a certain part of our configuration matches something. Do we want to make sure it isn't set to a specific value? What happens here is the evaluation over whether your policy should be triggered or not. So in our example, we just want to make sure that a config in our YAML isn't equal to our value in myvar
.
Example:
input.my.yaml.struct != myvar
The final part of our policy is the "thing" that should be thrown should a policy get triggered. In our case, we're just showing a message telling everyone what is happening to let everyone know that things are not as they should be. We can do all kinds of formatting and things of that nature here, or we could just pass in a simple string. The important piece to take away is that this is what lets the user know what is going on, so make it useful!
Example:
msg := sprintf("%s is broken. This is my message",[input.my.yaml.struct])
Building A Policy For OpenShift install-config.yaml
So now that we understand what we're seeing when looking at a policy, let's build one for our use case. To remind ourselves what that is -- remember, we want to make sure that we are only providing appropriate values for an AWS region. If we don't have an appropriate value, we want to see all of the alarms go off so that we don't waste our time waiting for a failed deployment. So since we want things to break quickly, that leads us to building a deny
policy. To start, that means we'll have something that looks like this:
deny[msg] {
}
By itself, this isn't useful. Why? Because we haven't told it how to do anything. So let's move to the next step. Now that we know we want to deny things, what is it that we want to deny? We need to make sure that a value that is provided is within a certain set of values that we know are approved. In this case, that means we need a bunch of strings that align with our approved AWS regions. We're going to start off with building a policy for an older version of OpenShift (4.1) because a) it gives us more broken options to play with and b) because we want to make sure we have a set of policies that allow us to inspect changes when we're doing things like upgrades (i.e. moving from 4.1 to 4.2 to 4.X).
So after looking at the approved regions for OpenShift 4.1, we end up with this list:
- ap-northeast-1
- ap-northeast-2
- ap-south-1
- ap-southeast-1
- ap-southeast-2
- ca-central-1
- eu-central-1
- eu-west-1
- eu-west-2
- eu-west-3
- sa-east-1
- us-east-1
- us-east-2
- us-west-1
- us-west-2
So now that we know what values we're working with, we can add them to our policy. Since we're working with a group of strings, we can just look to shove them into an array so that we can compare what is contained in our config against them. This will end up looking like this:
deny[msg] {
regions := ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "sa-east-1", "us-east-1","us-east-2","us-west-1","us-west-2"]
}
With that, we know what we want to check. But now we have to have some part of the config that we check against. The install-config.yaml
file contains the AWS region under the .platform.aws.region
part of the config. So what we need to do is inspect input.platform.aws.region
. What this does is it says "for whatever file we're inspecting (input
), let's grab the value at .platform.aws.region
and compare it to something (the rest of our rule)". So how do we check it against an array of values? We can use the notation of != regions[_]
. What this says is we don't want something to be equal to any item that is contained within the regions
variable. This ends up giving us the below policy:
deny[msg] {
regions := ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "sa-east-1", "us-east-1","us-east-2","us-west-1","us-west-2"]
input.platform.aws.region != regions[_]
}
Great! We're almost there! Now we just need to provide a useful message to let folks know what is going on. To do this, we just need to set the variable that we're using for our message (msg
) to what we want to display to our users (i.e. ourselves). To do this, we can do something like the below:
msg := sprintf("%s is not a supported region. Please reference the associated list for supported regions.", [input.platform.aws.region])
What this does is allow us to be a bit dynamic about the message that we're presenting to users. Instead of just saying that they're using an unsupported region, we can use sprintf
to provide a formatted message and just pass in what has been provided by our config file by using input.platform.aws.region
. In that manner, we let them know both what is going on as well as the problematic configuration value. Pulling all this together, we end up with a final policy that looks like this.
package main
deny[msg] {
regions := ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "sa-east-1", "us-east-1","us-east-2","us-west-1","us-west-2"]
input.platform.aws.region != regions[_]
msg := sprintf("%s is not a supported region. Please reference the associated list for supported regions.", [input.platform.aws.region])
}
And with that, we almost have something that is actionable. We have our configuration file and we have a policy that we can run it against. But now you might be asking yourself "How do we run this"?
Open Policy Agent and ConfTest
While Rego is the policy language we use to assemble our policies, we still need something to run those policies with. If you have a cluster and you want to actively evaluate policies, you can end up running an instance of Open Policy Agent and it's associated tooling. However in our case, we just want to check things at runtime (or just on some recurring basis such as when changes get checked in or a pull request is submitted). In the latter instance, we are able to use another tool from the Open Policy Agent project called ConfTest. What ConfTest allows us to do is to specify a file or directory of files that we want to inspect along with the set of policies we want to inspect them with. It then takes all of that and dumps out the associated outputs from those policies and tell us the results (i.e. the messages, how many policies were checked and the results of those policies). This tool is much better suited for our use case, so this is what we will proceed with. To grab the latest version of ConfTest, you can grab the latest release from here.
Note: Just make sure you grab the appropriate release for the OS that you are working with.
Pulling It All Together
So now we have it all: our configuration file, our policy and the tool to bind them together. So let's take a look at how we do that. Given that we have our configuration file stored in install-config.yaml
, our policy stored in a file called ocp-4.1.yaml
and conftest available on our path, we can just run the following:
conftest test -p ocp-4.1.yaml install-config.yaml
Given the above command executes successfully, we should then see our output. Since we've purposefully given it a file that goes against our policy, we should expect an error and see our deny policy.
FAIL - install-config.yaml - main - eu-north-1 is not a supported region. Please reference the associated list for supported regions.
1 test, 0 passed, 0 warnings, 1 failure, 0 exceptions
And just like that, we're now able to test our configurations against our policies. Before kicking off our cluster builds, we can run these against a policy for a specific version and we'll know that we'll be in good shape. Right now we're only putting policies in place for the regions, but the sky is the limit for what you would like to enforce. Does your organization only want certain machine types to be used? Do you want to make sure that each cluster contains a certain set of tags? Write as many policies as you would like! Then just make sure to have a process in place that checks the results before you kick off a build to save yourself all of that wasted time we've seen in the past!