As a fast-paced company with modern needs, Cedexis has developed an interesting balance of work between operations and development. Last year, there was a lot of talk about the concept of “NoOps,” which indicated (depending on your perspective) either an elimination of the traditional operations role or a description of what must be a well-oiled and well-working operations team. In either case, the main message seemed to be that operations should focus on automation as much as possible, thus freeing up the need for actual man-hours with eyes on the network.
A crucial part of the Internet discussion involved an article by Operations Engineer John Allspaw in which he criticized the concept of NoOps.
“I do find it disturbing,” said Ramin Khatibi in support of Allspaw, “that enough people have experienced Ops performed so poorly that faced with a working version they assume it must be something else.”
While I also don’t think that NoOps is a realistic approach, I’m not going to mount an argument against its existence. Instead, Cedexis uses its small technical team to the best effect, in a system we’ve been calling “LessOps.”
What is LessOps?
We wanted our LessOps approach to build more collaboration between the operations and development teams, including cross-training. We also hoped that integration between the two groups would lead to a stronger, more robust product. However, it would mean some changes that might not be that popular, such as adding developers to the on-call rotation.
When we first put our developers on call, the operations team was hesitant and skeptical. After all, not many developers have what we might call the operations mindset. Developers tend to be focused on individual problems or components of a system, while operations staff take a more holistic approach out of necessity. We were unsure how it would turn out. However, after a few months under this new system, we’ve enjoyed great benefits.
That Time We Had Zero Alerts
The biggest and most measurable benefit is that after a few weeks with the mixed on-call rotation, for the first time we were experiencing periods with zero alerts and errors on our system health monitor. This happened as a result of several factors.
First, developers were put in charge of both designing and responding to monitors for components they’d written. Error messages written by developers have a tendency to be written for those who are dealing with their own running code. They can be difficult for others to interpret efficiently. These messages, suddenly monitored by engineers, pointed out places in the system where unneeded monitoring and reporting was happening. A number of minor, easy-to-fix issues that operations had not paid sufficient attention to before also were quickly discovered and addressed.
This also led to another change that produced a more robust system. Developers began writing the monitors for the components they had written. Operations know how difficult it is to monitor a system they don’t understand. With ops and developers working together, however, monitors could be written that provide both teams with meaningful information that facilitates more effective responses to alerts by directing them to the proper domain experts.
These monitors have also become part of our software development lifecycle. In production, they operate as live unit tests, and as the system is developed over time, they provide a set of legacy regression tests. These keep us alerted if any changes in the system have produced unpredictable behavior in components we may not actively monitor or develop anymore.
A Culture of Documentation
The cross-training between operations and development also produces much stronger documentation. We have learned, as have many teams, that good documentation needs to be a core element of the team’s culture. Everyone on the team has realized the benefit of stable, consistent documentation that is kept up to date.
Overall, we have found that the LessOps approach, with operations and engineering working closely with each other, has produced more automation, fewer system faults, and more free time for our ops staff to play LAN games. Plus, the teams work extremely well together, with architecture and design decisions made more efficiently as a group, and problems being solved faster with domain experts quickly on hand.