DevOps at Adzerk
“A system of local optimums is not an optimum system at all.” - Dr. E. Goldratt
DevOps, the mutant offspring of software development and the subsequent operations to keep that software functioning, is typically an abberaration for most organizations. Most engineering groups create a cultural wall between the developers and the people tasked with installing and running the software. Instances where the disciplines are considered to be a unified skillset are few and far between. Looking around the landscape of technology companies I am happy to see that this artificial division is eroding.
At Adzerk there is no separation between development and operations. All engineers are expected to develop and master skills of writing code and keeping the business operations functioning.
IS THIS YOU?
If you find yourself squirming uncomfortably at the idea of developers deploying code to production environments, you probably work in an environment that discourages mixing of dev and ops roles. Developers write the code. Operations engineers install the code.
Perhaps there’s a hand off, with a lot or maybe little ceremony. Perhaps you have coordinated release reviews where the heads of departments show up and grunt and nod and sign off on a release. Or maybe you also have a long verification process filled with checklists that takes days or weeks to finish. There’s probably also volumes of expected documentation which acts as a contract designed to insulate your team from blame.
Worse still, if there is a problem with the deployed code there is a frustrating delay to fix it. Developers can’t work on the systems to debug the problem easily, or at all. Production staff insist on overly complicated rituals before they will expose themselves to yet another risky release of code. The feedback cycle on the software lengthens further still, delaying the delivery of actual value from the software.
Because the production staff is so overwhelmed with rising costs of complexity and risk of each deployment, they demand larger work buffers. In this case the buffers are carefully documented processes and ever more rigorous manual inspection of the product before they deploy. Since they cannot directly control the process of putting quality into the software, they react the only way they can - they must hire more people to make the workload manageable.
Meanwhile, the developers react to the demands of the production staff by grumbling about how production is “dragging their feet” on deploying fixes. Any problems in the deployment is clearly the fault of people who just don’t understand how to work with the product. They are unable to rapidly adjust the product to a changing marketplace. They lengthen their work buffers by creating longer cycles of “requirements gathering” and architecting features no one asked for.
The end result is that the business become slower to respond to change. Development costs begin to climb as projects become delayed. Engineers become frustrated and angry with the workplace. Attrition becomes a problem as the most talented people seek greener pastures.
Everyone sees problems but no answers. Everyone complains “We can’t get anything done!”
KANBAN
As bad as this sounds, there are ways to turn it around. As Andy explained in a previous post, Kanban is a powerful tool for visualizing and inspecting the flow of software delivery. By creating large and visible representations of how we work, we were able to avoid the trap of dividing our engineering team into functional silos. Rather than focus on efficiency of each work unit (code created per developer) we chose to focus on the pace of the working, software delivered.
Eliminating the “traditional” specialization between development and production is an outcome of our Kanban adoption. Kanban is not the only way to address this problem, of course. Any framework that encourages inspection of how work is done and encouragement of explicit conversations about how to improve that work are necessary conditions.
INFRASTRUCTURE AS CODE
As a corollary to devops as an organizing principle, we’ve adopted “Infrastructure as Code”. This says that the configuration and provisioning of our infrastructure is described and controlled by programming tools. We chose Chef, but there are many tools available such as Puppet and cfengine. Every developer has access to the infrastructure code and the server infrastructure itself.
If the thought of easy access to the servers that run your business makes you uneasy, it just means you are sufficiently paranoid. We use that unease to develop our version of “poka-yoke”, or error proofing mechanisms. Since we are responsible for the availability of the production environment, we make engineering choices to minimize fear.
CONTINUOUS DELIVERY
Our software is pulled from our Github source code control repositories into a continuous integration pipeline. We apply a battery of fast unit tests. If those tests pass, we immediately and automatically deploy the code to a test environment. The pipeline runs another round of API tests followed by our slowest functional tests. The pipeline shuts down when it detects a failure and notifies everyone.
If the tests pass then the code is promoted to a release. The pipeline then deploys the promoted code automatically to our production servers.
EMERGENT PROPERTIES
Do we make mistakes? Of course! What we have found is that the longer the time a developer takes to commit changes, the bigger the changeset. And the more that changes, the greater the chance that a bug has been introduced. By forcing all changes to be applied as soon as possible we encourage small changes.
Small, incremental changes are also easy to debug. If a developer makes a mistake in 3 lines of code, we know we only have to search 3 lines of code for the problem. If your release cycle takes a month, how many lines of code did you change? How many lines of code do you have to inspect to find and fix the defect? Thousands? Tens of thousands? Millions?
The other advantage to making frequent releases is that the time to fix a problem is usually about 15 minutes. We don’t petition a committee to make a change, or fill out a form, or create a ticket proposing the change. We make the change is made and it’s deployed immediately. Our build and deployment infrastructure creates and reports the information we need to audit changes and fix problems.
We think the ability to react this rapidly is an immense advantage. This stance forces everyone to have explicit conversations about what constitutes value to the business. We measure the impact of changes almost immediately, rather than in weeks or months. This culture creates an engineering culture of achievement and pride instead of fear and frustration.
We learn rapidly from our work. We serve our customers better. Win!