From Pebbles to Brickworks: a Story of Cloud Infrastructure Evolved

RJ Zaworski - Aug 18 '20 - - Dev Community

You can build things out of pebbles. Working with so many unique pieces isn’t easy, but if you slather them with mortar and fit them together just so, it’s possible to build a house that won’t tumble down in the slightest breeze.

Like many startups, that’s where Koan’s infrastructure started. With lovingly hand-rolled EC2 instances sitting behind lovingly hand-rolled ELBs inside a lovingly — yes — hand-rolled VPC. Each came with its own quirks, software updates, and Linux version. Maintenance was a constant test of our technical acumen and patience (not to mention nerves); scalability was out of the question.

These pebbles carried us from our earliest prototypes to the first public iteration of Koan’s leadership platform. But there comes a day in every startup’s journey when its infrastructure needs to grow up.

Motivations

The wolf chased them down the lane and he almost caught them. But they made it to the brick house and slammed the door closed.

What we wanted were bricks, uniform commodities that can be replicated or replaced at will. Infrastructure built from bricks has some significant advantages over our pebbly roots:

  1. Visibility. Knowing who did what (and when) makes it possible to understand and collaborate on infrastructure. It’s also an absolute must for compliance. Repeatable, version-controlled infrastructure supplements application changelogs with a snapshot of the underlying infrastructure itself.
  2. Confidence. Not knowing — at least, not really knowing — infrastructure makes changes very nervous. For our part, we didn’t. Which isn’t a great position to be in when that infrastructure needs to scale.
  3. Consistency. Pebbles come in all shapes and sizes. New environment variables, port allocations, permissions, directory structure, and dependencies must be individually applied and verified on each instance. This consumes development time and increases the risk of “friendly-fire” incidents from any inconsistencies between different hosts (see: #2).
  4. Repeatability. Rebuilding a pebble means replicating all of the natural forces that shaped it over the eons. Restoring our infrastructure after a catastrophic failure seemed like an impossible task—a suspicion that we weren’t in a hurry to verify.
  5. Scalability. Replacing and extending are two sides of the same coin. While it’s possible to snap a machine image and scale it out indefinitely, an eye to upkeep and our own mental health encouraged us to consider a fresh start. From a minimal, reasonably hardened base image.

Since our work at Koan is all about goal achievement, most of our technical projects start exactly where you’d expect. Here: reproducible infrastructure (or something closer to it), documented and versioned as code. We had plenty of expertise with tools like terraform and ansible to draw on and felt reasonably confident putting them to use—but even with familiar tooling, our initially shaky foundation didn’t exactly discourage caution.

That meant taking things step by gradual step, establishing and socializing patterns that we intended to eventually adopt across all of our cloud infrastructure. That’s a story for future posts, but the journey had to start somewhere.

Dev today, tomorrow the world

“Somewhere,” was our trusty CI environment, dev. Frequent, thoroughly-tested releases are both a reasonable expectation and a point of professional pride for our development team. dev is where the QA magic happens, and since downtime on dev blocks review, we needed to keep disruptions to a minimum.

Before dev could assume its new form, we needed to be reasonably confident that we could rebuild it:

  • …in the right VPC
  • …with the right Security Groups assigned
  • …with our standard logging and monitoring
  • …and provisioned with a working instance of the Koan platform

Four little tests, and we’d have both a repeatable dev environment and a template we could extend out to production.

We planned to tackle dev in two steps. First, we would document (and eventually rebuild) our AWS infrastructure using terraform. Once we had a reasonably-plausible configuration on our hands, we would then use ansible to deploy the Koan platform. The two-step approach deferred a longer-term dream of fully-immutable resources, but it allowed us to address one big challenge (the infrastructure) while leaving our existing deployment processes largely intact.

Replacing infrastructure with Terraform

First, the infrastructure. The formula for documenting existing infrastructure in terraform goes something like this:

  1. Create a stub entry for an existing resource
  2. Use terraform import to attach the stub to the existing infrastructure
  3. Use terraform state and/or terraform plan to reconcile inconsistencies between the stub and reality
  4. Repeat until all resources are documented

Here’s how we documented the dev VPC's default security group, for example:

$ echo '
resource "aws_default_security_group" "default" {
  # hard-coded reference to a resource not yet represented in our
  # Terraform configuration
  vpc_id = var.vpc_id
}' >> main.tf
Enter fullscreen mode Exit fullscreen mode

At this point, we could run terraform plan to see the difference between the existing infrastructure and our Terraform config:

$ terraform import aws_default_security_group.default sg-123456
$ terraform plan
# module.dev-appserver.aws_default_security_group.default will be updated in-place
  ~ resource "aws_default_security_group" "default" {
      ~ egress                 = [
          - {                                 
              - cidr_blocks      = [
                  - "0.0.0.0/0",
                ]                    
              - description      = ""
              - from_port        = 0
              - ipv6_cidr_blocks = []
              - prefix_list_ids  = []
              - protocol         = "-1"
              - security_groups  = []
              - self             = false
              - to_port          = 0
            },
        ]
        id                     = "sg-123456"
    # ...
    }
Enter fullscreen mode Exit fullscreen mode

Using the diff as an outline, we could then fill in the corresponding aws_default_security_group.default entry:

# main.tf
resource "aws_default_security_group" "default" {
  vpc_id = var.vpc_id
  ingress {
    protocol  = -1
    self      = true
    from_port = 0
    to_port   = 0
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Re-running terraform plan, we could verify that the updated configuration matched the existing resource:

$ terraform plan
...
No changes. Infrastructure is up-to-date.
This means that Terraform did not detect any differences between
your configuration and real physical resources that exist. As a 
result, no actions need to be performed.
Enter fullscreen mode Exit fullscreen mode

The keen observer will recognize a prosaic formula crying out for automation, a call we soon answered. But for our first, cautious steps, it was helpful to document resources by hand. We wrote the configurations, parameterized resources that weren’t imported yet, and double-checked (triple-checked) our growing Terraform configuration against the infrastructure reported by the aws CLI.

Sharing Terraform state with a small team

By default, Terraform tracks the state of managed infrastructure in a local tfstate file. This file contains both configuration details and a mapping back to the “live” resources (via IDs, resource names, and in Amazon’s case, ARNs) in the corresponding cloud provider. As a small, communicative team in a hurry, we felt comfortable bucking best practices and checking our state file right into source control. In almost no time we ran into collisions across git branches—a shadow of collaboration and locking problems to come—but we resolved to adopt more team-friendly practices soon. For now, we were up and running.

Make it work, make it right.


Provisioning an application with Ansible

With most of our dev infrastructure documented in Terraform, we were ready to fill it out. At this stage our attention shifted from the infrastructure itself to the applications that would be running on it—namely, the Koan platform.

Koan’s platform deploys as a monolithic bundle containing our business logic, interfaces, and the small menagerie of dependent services that consume them. Which services run on a given EC2 instance will vary from one to the next. Depending on its configuration, a production node might be running our REST and GraphQL APIs, webhook servers, task processors, any of a variety of cron jobs, or all of the above.

As a smaller, lighter, facsimile, dev has no such differentiation. Its single, inward-facing node plays host to the whole kitchen sink. To simplify testing (and minimize the damage to dev), we took the cautious step of replicating this configuration in a representative local environment.

Building a local Amazon Linux environment

Reproducing cloud services locally is tricky. We can’t run EC2 on a developer’s laptop, but Amazon has helpfully shipped images of Amazon Linux—our bricks’ target distribution. With a little bit of fiddling and a lot of help from cloud-init, we managed to bring up reasonably representative Amazon Linux instances inside a local VirtualBox:

$ ssh -i local/ssh/id_rsa dev@localhost -p2222
Last login: Fri Sep 20 20:07:30 2019 from 10.0.2.2
       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\\___|___|
Enter fullscreen mode Exit fullscreen mode

At this point, we could create an ansible inventory assigning the same groups to our "local" environment that we would eventually assign to dev:

# local/inventory.yml
appservers:
  hosts:
    127.0.0.1:2222
cron:
  hosts:
    127.0.0.1:2222
# ...
Enter fullscreen mode Exit fullscreen mode

If we did it all over again, we could likely save some time by skipping VirtualBox in favor of a detached EC2 instance. Then again, having a local, fast, safe environment to test against has already saved time in developing new ansible playbooks. The jury’s still out on that one.

Ansible up!

With a reasonable facsimile of our “live” environment, we were finally down to the application layer. ansible approaches hosts in terms of their roles—databases, webservers, or something else entirely. We approached this by separating out two “base” roles for our VMs generally (common) and our app servers in particular (backend), where:

  • The common role described monitoring, the runtime environment, and a default directory structure and permissions
  • The backend role added a (verioned) release of the Koan platform

Additional roles layered on top represent each of our minimally-dependent services — api, tasks, cron, and so on—which we then assigned to the local host:

# appservers.yml 
- hosts: all
  roles:
  - common
  - backend
- hosts: appservers
  roles:
  - api
- hosts: cron
  roles:
  - cron
Enter fullscreen mode Exit fullscreen mode

We couldn’t bring EC2 out of the cloud, but bringing up a local instance that quacked a lot like EC2 was now as simple as:

$ ansible-playbook \
  --user=dev \
  --private-key ./local/ssh/id_rsa \
  --inventory local/inventory.yml \
  appservers.yml
Enter fullscreen mode Exit fullscreen mode

From pebbles to brickwork

With our infrastructure in terraform, our deployment in ansible, and all of the confidence that local testing could buy, we were ready to start making bricks. The plan (and there’s always a plan!) was straightforward enough:

  1. Use terraform apply to create a new dev instance
  2. Add the new host to our ansible inventory and provision it
  3. Add it to the dev ELB and wait for it to join (assuming provisioning succeeded and health checks passed)
  4. Verify its behavior and make adjustments as needed
  5. Remove the old dev instance (our pebble!) from terraform
  6. Rinse and repeat in production

The entire process was more hands-on than anyone really wanted, but given the indeterminate state of our existing infrastructure and the guiding philosophy of, step one was simply waving dev out the door.

Make it work, make it right.

Conclusion

Off it went! With only a little back and forth to sort out previously unnoticed details, our new dev host took its place as brick #1 in Koan’s growing construction. We extracted the dev configuration into a reusable terraform module and by the end of the week our brickwork stretched all the way out to production.

In our next post, we'll dive deeper into how we imported volumes of undocumented infrastructure into Terraform.


Big thanks to Ashwin Bhat for early feedback, Randall Gordon and Andy Beers for helping turn the pets/cattle metaphor into something more humane, and EMAR DI on Unsplash for the cover image.

And if you’re into building software to help every team achieve its objectives, Koan is hiring!

. . . . . . .