Posts

Battle Plan 2018: Illuminate Blind Spots and Unknown Unknowns

By Josh Gray, Chief Architect at Cedexis

– Originally published in DevOps Digest – 

There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.

Bonus points if you know who came up with that tongue twister. He was talking about terrorists, but we’re here to discuss a different sort of war — the Battle for Bandwidth. These days, application and content delivery requires special tactics, an integrated strategy, and well-sourced intelligence. And the unknown unknowns are the true enemy because they inevitably lead to outages, slowdowns, and mutinous customers.

In early November, a major outage caused by a minor configuration error (a route leak, to be exact) at global backbone provider Level 3 created widespread connection issues on both U.S. coasts. Comcast, Verizon, Cox, and Vonage customers were particularly affected.

One small error can have mighty ripple effects, and the cause isn’t always apparent to network admins and enterprise customers. The time it took to return the Down Detector maps from angry red to mellow yellow could have been shortened by looking at Real User Measurements (crowdsourced telemetry), realizing it wasn’t a single site or ISP, and following a logic tree to find the culprit.

With Global Server Load Balancing, your delivery network is smart enough to see the barricade around the corner and switch routes on the fly — saving the day (and making the other guys look a bit dazed and confused).

Blind spots can be hiding more than outages. Your crack team of DevOps commandos can’t run successful release missions if they can’t check what’s really going on in the field. You don’t want them dashing around in the dark without a robust tactical plan based on all the parameters you can assess — when you turn unknown unknowns into known knowns from your various data streams, you can put them to work.

Continuous deployment isn’t for the faint of heart — you better have your Kevlar and your night vision goggles. Companies like Salesforce are releasing updates dozens of times a day; but even a handful a week requires a careful strategy. You can use RUM to test an update by initially limiting roll-out to one data center. Check for 40x/50x errors. If you’re seeing problems, you can check both user experience with your app (non-updated versions) in other places, and user experience at the same data center where you are testing the updated version, to deduce the source of trouble.

One of the biggest unknown unknowns in traffic management is what’s going on in places you haven’t served recently. If a story about Boise causes traffic to spike there, and that’s not normally an audience hotspot for your service, chances are you won’t have any measurements of your own to go on. Community intelligence turns these dark corners of your empire into known knowns through automated crowdsourcing of quality of experience metrics. When combined with real-time server health checks and third-party data streams, you have a powerful ability to make efficient, economical routing decisions, even for destinations you don’t have any history with.

The more insight and intelligence can be used to accelerate the acquisition of known knowns, the better it is for your business and your bottom line. In the New Year, we should be less accepting of blind spots. They’re expensive — they cost us time, money, and customers. Nobody has enough human problem solvers around to keep putting out fires and rigging up one-off workarounds. Our best talent should be working on the next release, the next big idea, or the next major dilemma (Net Neutrality game changers, anyone?) — not floundering around trying to guess what’s holding up traffic. You can’t control what you can’t see, and on the hybrid IT battlefield, control keeps you on top of the hill. We’re pretty sure Donald Rumsfeld would agree.

To learn more:

10 Ways to Make Your Outage Emergency Room Fun

Originally published in Devops Digest website. By Andrew Marshall, Director of Product Marketing at Cedexis

Blogpost Make outages fun

It’s 3:47am. You and the rest of the Ops team have been summoned from your peaceful slumber to mitigate an application delivery outage. As you catch up on the frantic emails, Slack chats, and text messages from everyone on your international sales team (Every. Single. One.), your mind races as you switch to problem solving mode. It’s time to start thinking about how to make this mitigation FUN!

1. GENTLY REMIND EVERYONE THAT THE SAAS-DELIVERED AUTOMATED APPLICATION DELIVERY STRATEGY YOU PROPOSED WOULD HAVE PREVENTED THIS

No need to rub it in, but … if you had turned on a software-defined application delivery platform for your hybrid infrastructure, you’d all be sound asleep right now. Automated real-time delivery decisions and failover would be nice, right? Just sayin’.

2. DROP A GENERATIONALLY-DIVISIVE GIPHY IN THE EMERGENCY SLACK CHANNEL

Your coworkers’ opinions on gaming consoles, Spiderman movies, and music are different than yours! The perfect animated GIF will remind them of your erudite tastes, while you have their attention. Extra credit if you can work in some low-key shade about not listening to your equally sophisticated opinions on optimized outage mitigation.

3. HOLD AN IMPROMPTU CLOUD HEALTH DASHBOARD PAGE CUSTOMIZATION CONTEST

Oh hey, look at that! Your cloud provider’s health dashboard page says everything is fine…because it’s powered by the services that went down. Help your team vent their creative energy (and frustration) with some fun customized MS Paint updates to the offending page. Bonus points for art that reminds everyone of the value of a multi-cloud strategy powered by a programmable application delivery platform.

4. USE HEALTH METRICS TO CURE ILLS INSTEAD OF JUST CONDUCTING ANOTHER REALLY GOOD POST-MORTEM

Tired of “learning lessons” from these emergency room drills? You depend on NGINX for your LLB, but don’t have a way to use those LLB health metrics and data to automate global delivery. Disjointed delivery intelligence means you don’t know how your apps will land with users. You need an end user-centric approach to app delivery that automates the best delivery path and ensures failover is in place at all times. Micro-outages often fly under your passive monitoring radar, but that doesn’t mean your users don’t notice them. An active, integrated app delivery approach re-routes automatically before you lose business. Post-mortems are fun…but so is making sure your apps survive the last mile. Arguably more so?

5. OPEN (AND THEN CLOSE) SOME TICKETS FOR THE SALES TEAM

“Sales needs to sell more stuff.” You’ll feel better.

6. BUILD A COUNTDOWN CLOCK FOR WHEN YOU CAN FINALLY REPLACE YOUR OLD ADC HARDWARE

Sure, your Mode1 ADC hardware was a sunk cost so you’re stuck with them for a while. But you’re one unnecessary emergency closer to having a fully software-defined application delivery platform for your hybrid cloud. And now you’re even closer. And closer … Tick. Tock.

7. LEVERAGE REAL USER MEASUREMENTS (RUM) TO FORESEE AND FIX PROBLEMS BEFORE THEY HAPPEN

Probably best to do this during normal work hours. User experience data from around the world can detect degrading, sluggish resources in real-time, and user-centric app delivery logic powered by RUM can make quick re-routing decisions automatically. No more getting woken up after the application crashes. While you’re wishing you had RUM on your side, you can look up some fun facts from the countries experiencing app outages. Did you know Luxembourgish is an official language?

8. SEND AN EMAIL

You too can be the weird colleague who sends emails with crazy, middle-of-the-night time stamps.

9. DO SOME VIRTUAL WINDOW SHOPPING

Browse around to see what you could have purchased with the money you were just forced to spend on unplanned cloud instance provisioning in order to keep your app running. That desktop Nerf missile launcher (or 700 of them) would have been pretty nice.

10. SWEAR OFF THE GAME OF CHANCE

You’ve just proven it’s not that much fun after all. Don’t just dump everything onto one cloud and call it done. Clouds go down for so many reasons. Use an application delivery platform control layer to build in the capability to auto-switch to an available resource, while you sleep soundly. Running on multi-cloud without an abstracted control layer removes most of the value of the cloud. Swear off the game of chance. Out loud. Right now.

 

More information:

More Than Science Fiction: Why We Need AI for Global Data Traffic Management

Originally published in Product Design and Development website

by Josh Gray, Chief Architect, Cedexis

Blogpost

 

 

 

 

 

 

Blade Runner 2049 struck a deep chord with science fiction fans. Maybe it’s because there’s so much talk these days of artificial intelligence and automation — some of it doom and gloom, some of it utopian, with every shade of promise and peril in between. Many of us, having witnessed the Internet revolution first hand — and still in awe of the wholesale transformation of commerce, industry, and daily life — find ourselves pondering the shape of the future. How can the current pace of change and expansion be sustainable? Will there be a breaking point? What will it be: cyber (in)security, the death of net neutrality, or intractable bandwidth saturation?

Only one thing is certain: there will never enough bandwidth. Our collectively insatiable need for streaming video, digital music and gaming, social media connectivity, plus all the cool stuff we haven’t even invented yet, will fill up whatever additional capacity we create. The reality is that there will always be buffering on video — we could run fiber everywhere, and we’d still find a way to fill it up with HD, then 4K, then 8K, and whatever comes next.

Just like we need smart traffic signals and smart cars in smart cities to handle the debilitating and dangerous growth of automobile traffic, we need intelligent apps and networks and management platforms to address the unrelenting surge of global Internet traffic. To keep up, global traffic management has to get smarter, even as capacity keeps growing.

Fortunately, we have Big Data metrics, crowd-sourced telemetry, algorithms, and machine learning to save us from breaking the Internet with our binge watching habits. But, as Isaac Asimov pointed out in his story Runaround, robots must be governed. Otherwise, we end up with a rogue operator like HAL, an overlord like Skynet, or (more realistically) the gibberish intelligence of the experimental Facebook chatbots. In the case of the chatbots, the researchers learned a valuable lesson about the importance of guiding and limiting parameters: they had neglected to specify use of recognizable language, so the independent bots invented their own.

In other words, AI is exciting and brimming over with possibilities, but needs guardrails if it is to maximize returns and minimize risks. We want it to work out all the best ways to improve our world (short of realizing that removing the human race could be the most effective pathway to extending the life expectancy of the rest of Nature).

It’s easy to get carried away by grand futuristic visions when we talk about AI. After all, some of our greatest innovators are actively debating the enormous dangers and possibilities. But let’s come back down to earth and talk about how AI can at least make the Internet work for the betterment of viewers and publishers alike, now and in the future.

We are already using basic AI to bring more control to the increasingly abstract and complex world of hybrid IT, multi-cloud, and advanced app and content delivery. What we need to focus on now is building better guardrails and establishing meaningful parameters that will reliably get our applications, content, and data where we want them to go without outages, slowdowns, or unexpected costs. Remember, AI doesn’t run in glorious isolation, unerring in the absence of continual course adjustment: this is a common misconception that leads to wasted effort and disappointing or possibly disastrous results. Even Amazon seems to have fallen prey to the set-it-and-forget-it mentality: ask yourself, how often does their shopping algorithm suggest the exact same item you purchased yesterday? Their AI parameters may need periodic adjustment to reliably suggest related or supplementary items instead.

For AI to be practically applied, we have to be sure we understand the intended consequences. This is essential from many perspectives:  marketing, operations, finance, compliance, and business strategy. For instance, we almost certainly don’t want automated load balancing to always route traffic for the best user experience possible — that could be prohibitively expensive. Similarly, sometimes we need to route traffic from or through certain geographic regions in order to stay compliant with regulations. And we don’t want to simply send all the traffic to the closest, most available servers when users are already reporting that quality of experience (QoE) there is poor.

When it comes right down to it, the thing that makes global traffic management work is our ability to program the parameters and rules for decision-making — as it were, to build the guardrails that force the right outcomes. And those rules are entirely reliant upon the data that flows in.  To get this right, systems need access to a troika of guardrails: real-time comprehensive metrics for server health, user experience health, and business health.

System Guardrails

Real-time systems health checks are the first element of the guardrail troika for intelligent traffic routing. Accurate, low-latency, geographically-dispersed synthetic monitoring answers the essential server availability question reliably and in real-time: is the server up and running at all.

Going beyond ‘On/Off’ confidence, we need to know the current health of those available servers. A system that is working fine right now may be approaching resource limits, and a simple On/Off measurement won’t know this. Without knowing the current state of resource usage, a system can cause so much traffic to flow to this near-capacity resource that it goes down, potentially setting off a cascading effect that takes down other working resources.

Without scriptable load balancing, you have to dedicate significant time to shifting resources around in the event of DDoS attacks, unexpected surges, launches, repairs, etc. — and problems mount quickly if someone takes down a resource for maintenance but forgets to make the proper notifications and preparations ahead of time. Dynamic global server load balancers (GSLBs) use real-time system health checks to detect potential problems, route around them, and send an alert before failure occurs so that you can address the root cause before it gets messy.

Experience Guardrails

The next input to the guardrail troika is Real User Measurements (RUM), which provide information about Internet performance at every step between the client and the clouds, data centers, or CDNs hosting applications and content. Simply put, RUM is the critical measurement of the experience each user is having. As they say, the customer is always right, even when Ops says the server is working just fine. To develop true traffic intelligence, you have to go beyond your own system. This data should be crowd-sourced by collecting metrics from thousands of Autonomous System Numbers, delivering billions of RUM data points each day.

Community-sourced intelligence is necessary to see what’s really happening at both the edges of the network as well as in the big messy pools of users where your own visibility may be limited (e.g. countries with thousands of ISPs like Brazil, Russia, Canada, and Australia). Granular, timely, real user experience data is particularly important at a time when there are so many individual peering agreements and technical relationships, all of which could be the source of unpredictable performance and quality.

Business Guardrails

Together, system and experience data inform intelligent, automated decisions so that traffic is routed to servers that are up and running, demonstrably providing great service to end users, and not in danger of maxing out or failing. As long as everything is up and running and users are happy, we’re at least halfway home.

We’re also at the critical divide where careful planning to avoid unintended consequences comes into play. We absolutely must have the third element of the troika: business guardrails.

After all, we are running businesses. We have to consider more than bandwidth and raw performance: we need to optimize the AI parameters to take care of our bottom line and other obligations as well. If you can’t feed cost and resource usage data into your global load balancer, you won’t get traffic routing decisions that are as good for profit margins as they are for QoE. As happy as your customers may be today, their joy is likely to be short-lived if your business exhausts its capital reserves and resorts to cutting corners.

Beyond cost control, automated intelligence is increasingly being leveraged in business decisions around product life cycle optimization, resource planning, responsible energy use, and cloud vendor management. It’s time to put all your Big Data streams (e.g., software platforms, APM, NGINX, cloud monitoring, SLAs, and CDN APIs) to work producing stronger business results. Third party data, when combined with real-time systems and user measurements, creates boundless possibilities for delivering a powerful decisioning tool that can achieve almost any goal.

Conclusion

Decisions made out of context produce optimal results rarely and only by sheer luck. Most companies have developed their own special blend of business and performance priorities (and anyone who hasn’t, probably should). Automating an added control layer provides comprehensive, up-to-the-minute visibility and control, which helps any Ops team to achieve cloud agility, performance, and scale, while staying in line with business objectives and budget constraints.

Simply find the GSLB with the right decisioning capabilities, as well as the capacity to ingest and use System, Experience, and Business data in real-time, then build the guardrails that optimize your environment for your unique needs.

When it comes to practical applications of AI, global traffic management is a great place to start. We have the data, we have the DevOps expertise, and we are developing the ability to set and fine-tune the parameters. Without it, we might break the Internet. That’s a doomsday scenario we all want to avoid, even those of us who love the darkest of dystopian science fiction.

Josh GrayAbout Josh Gray: Josh Gray has worked as both a leader in various startups as well as at large enterprise settings such as Microsoft. At Microsoft he was awarded multiple patents. As VP of Engineering for Home Comfort Zone his team designed and developed systems that were featured in Popular Science, HGTV, Ask this Old House and won #1 Cool product at introduction at the Pacific Coast Builders Show. Josh has been a part of many other startups and built on his success by becoming an Angel Investor in the Portland Community. Josh continues his run of success as Chief Architect at Cedexis. Linkedin profile

 

With a Multi-cloud Infrastructure, Control is Key

By Andrew Marshall, Cedexis Director of Product Marketing

Ask any developer or DevOps manager about their first experiences with the public cloud and it’s likely they’ll happily share some memories of quickly provisioning some compute instances for a small project or new app. For (seemingly) a few pennies, you could take full advantage of the full suite of public cloud services—as well as the scalability, elasticity, security and pay-as-you-go pricing model. All of this made it easy for teams to get started with the cloud, saving on both the IT budgets and infrastructure setup time. Public cloud providers AWS, Azure, Google Cloud, Rackspace and others made it easy to innovate.

Fast forward several years and the early promise of the cloud is still relevant: Services have expanded, costs have (in some cases) been reduced and DevOps teams have adapted to the needs of their team by spinning up compute instances whenever they’re needed. But for many companies, the realities of their hybrid-IT infrastructure necessitates support for more than one public cloud provider. To make this work, IT Ops needs a control layer that sits on top of their infrastructure, and can deliver applications to customers over any architecture, including multicloud. This is true, no matter what the reason teams need to support multi-cloud environments.

Prepare for the Worst

As any IT Ops manager, or anyone who has lost access to their web app knows, outages and cloud service degradation happen. Modern Ops teams need a plan in place for when they do. Many companies choose to utilize multiple public cloud providers to ensure their application is always available to worldwide customers, even during an outage. The process of manually re-routing traffic to a second public cloud in the event of an outage is cumbersome, to say the least. Adding an app delivery control plan on top of your infrastructure allows companies to seamlessly and automatically deliver applications over multiple clouds, factoring in real-time availability and performance.

Support Cloud-driven Innovation

Ops teams often support many different agile development teams, often in multiple countries or from new acquisitions. When this is the case, it’s likely that the various teams are using many architectures on more than one public cloud. Asking some dev teams to switch cloud vendors is not very DevOps-y. A better option is to control app delivery automation with a cloud-agnostic control plane that sits on top of any cloud, data center or CDN architectures. This allows dev teams to work in their preferred cloud environment, without worrying about delivery.

Avoid Cloud Vendor Lock-in

Public cloud vendors such as Amazon Web Services or Microsoft Azure aren’t just infrastructure-as-a-service (IaaS) vendors, they sell (or resell) products and services that could very well compete with your company’s offering. In the beginning, using a few cloud instances didn’t seem like such a big deal. But now that you’re in full production and depend on one of these cloud providers for your mission-critical app, this no longer feels like a great strategy. Adding a second public cloud to your infrastructure lessens your dependence on a single cloud vendor who you may be in “coopetition” with.

Multiple-vendor sourcing is a proven business strategy in many other areas of IT, giving you more options during price and SLA negotiations. The same is true for IaaS. Cloud services change often, as new services are added or removed, and price structures change. Taking control over these changes in public cloud service offerings, pricing models and SLAs is another powerful motivator for Ops teams to move to a multi-cloud architecture. An application delivery automation platform that can ingest and act on cloud service pricing data is essential.

Apps (and How They’re Delivered) Have Changed

Monolithic apps are out. Modern, distributed apps that are powered by microservices are in. Similarly, older application delivery controllers (ADCs) were built for a static infrastructure world, before the cloud (and SaaS) were commonly used by businesses. Using an ADC for application delivery requires a significant upfront capital expense, limits your ability to rapidly scale and hinders the flexibility to support dynamic (i.e. cloud) infrastructure. Using ADCs for multiple cloud environments compounds these issues exponentially. A software-defined application delivery control layer eliminates the need for older ADC technology and scales directly with your business and infrastructure.

Regain Control

Full support for multi-cloud in product may sound daunting. After all, Ops teams already have plenty to worry about daily. Adding a second cloud vendor requires a significant ramp-up period to get ready for production-level delivery, and the new protocols, alerts, competencies and other things you need to think about. You can’t be knee-deep in the details of each cloud and still manage infrastructure. Adding in the complexity of application delivery over multiple clouds can be a challenge, but much less so if you use a SaaS-based application delivery platform. With multi-cloud infrastructure, control is key.

Learn more about our solutions for  multi-cloud architectures and discover our application delivery platform.

You can also download our last ebook “Hybrid Cloud, the New Normal” for free here.

Cloud-First + DevOps-First = 81% Competitive Advantage

We recently ran across a fascinating article by Jason Bloomberg, a recognized expert on agile digital transformation, that examines the interplay between Cloud-First and DevOps-First models. That article led us, in turn, to an infographic centered on some remarkable findings from a CA Technologies survey of 900-plus IT pros from around the world. The survey set out to explore the synergies between Cloud and DevOps, specifically in regards to software delivery. You can probably guess why we snapped to attention.

The study found that 20 percent of the organizations represented identified themselves as being strongly committed to both Cloud and DevOps, and their software delivery outperformed other categories (Cloud only, DevOps only, and slow adopters) by 81 percent. This group earned the label “Delivery Disruptors” for their outlying success at maximizing agility and velocity on software projects. On factors of predictability, quality, user experience, and cost control, the Disruptor organizations soared above those employing traditional methods, as well as Cloud-only and DevOps-only methods, by large percentages. For example, Delivery Disruptors were 117 percent better at cost control than Slow Movers, and 75 percent better in this category than the DevOps-only companies.

These findings, among others, got us to thinking about the potential benefits and advantages such Delivery Disruptors can gain from adding Cedexis solutions into their powerful mix. Say, for example, you have agile dev teams working on new products and apps and you want to shorten the execution time for new cloud projects. To let your developers focus on writing code, you need an app delivery control layer that supports multiple teams and architectures. With the Cedexis application delivery platform, you can support agile processes, deliver frequent releases, control cloud and CDN costs, guarantee uptime and performance, and optimize hybrid infrastructure. Your teams get to work their way, in their specific environment, without worrying about delivery issues looming around every corner.

Application development is constantly changing thanks to advances like containerization and microservice architecture — not to mention escalating consumer demand for seamless functionality and instant rich media content. And in a hybrid, multi-cloud era, infrastructure is so complex and abstracted, delivery intelligence has to be embedded in the application (you can read more about what our Architect, Josh Gray, has to say about delivery-as-code here).

To ensure that an app performs as designed, and end users have a high quality experience, agile teams need to automate and optimize with software-defined delivery. Agile teams can achieve new levels of delivery disruption by bringing together global and local traffic management data (for instance, RUM, synthetic monitoring results, and local load balancer health), application performance management, and cost considerations to ensure the optimal path through datacenters, clouds, and CDNs.

Imagine the agility and speed a team can contribute to digital transformation initiatives with fully automated app delivery based on business rules, actionable data from real user and synthetic testing, and self-healing network optimizations. Incorporating these capabilities with a maturing Cloud-first and DevOps-first approach will likely put the top performers so far ahead of the rest of the pack, they’ll barely be on the same racetrack.

 

 

Shellshocked! Big problem. Cedexis swift response.

For 22 years Shellshock lay dormant. This bug exposed the ability to “take control of hundreds of millions of machines around the world, potentially including Macintosh computers and smartphones that use the Android operating system”*.

Bugs

From Cedexis viewpoint – our dev-ops team sprang into action. In short order:

  • The Shellshock vulnerability in ‘bash’ was disclosed publicly on Sept 24th.
  • Ubuntu, our operating system provider, issued a security update for ‘bash’
  • All of our systems were patched and updated by 1pm today (Sept 25th).

The inventor of Bash, Brian J. Fox joked in an interview Thursday, that his first reaction to the Shellshock discovery was, “Aha, my plan worked.”

Indeed.

Rapid Deployment That Won’t Delay Lunch

Part of our disaster mitigation plan has included developing a rapid server deployment process. In addition to saving us a lot of time when launching a new server, rapid deployment ensures that existing servers stay up-to-date with the latest configuration changes and software releases. It also means that when we need to expand our services, populate a new data center or CDN, or recover from a crash, we can be on top of things quickly. Our servers are located in over 40 data centers and cloud providers worldwide and collect over a billion-and-a-half measurements every day. These measurements are crucial to the decision-making power of Openmix.

Deploying on Ubuntu with Puppet

Our deployment plan has several steps. First, we begin with the latest Ubuntu Long Term Support (LTS) release available. Each of our service providers already supports Ubuntu LTS, typically by providing a pre-built virtual machine image. Their configurations can vary from provider to provider, so we have built specialized bootstrap shell scripts that we can use to deploy our own very basic configuration. These scripts are customized by both cloud provider and topographic location of the server. Much of our configuration is based around the hostname of the new server, which the bootstrap script sets based on the location and service provider of the new host.

The bootstrap process also installs and configures Puppet Enterprise from Puppet Labs. This tool provides a configuration management system that allows us to specify the state we want each server to be in. Everything from specific applications to firewall configurations can be defined in Puppet modules, which are gathered into site manifests. The granularity is incredible—even the permissions and ownership of individual file can be specified.

Using Puppet, we are then able to make sure that the new server is working the way we want it to. We know, even before deploying any Cedexis software, that the server is in a known state and running smoothly. We’re also able to keep an eye on hundreds of servers across dozens of data centers at once using Puppet, which makes it an invaluable tool. We love using Puppet, and if you’d like more information about how and why we selected it, you should read our Puppet Labs case study.

Installing the Cedexis Tools

The next step is to use Puppet to add the Cedexis apt repositories to the new server’s configuration. We prepare our software releases as Debian packages, which gives us all of the power of dpkg, the Debian package management system. This allows us to specify dependencies, track software versions, and automatically deploy new packages and revisions. Once the server is aimed at our apt repositories, installing the Cedexis tools is quick and painless.

The new server is just about ready. Because Openmix makes decisions using only very recent measurements, only a small amount of data needs to be transferred to the new server before it can begin making intelligent decisions and joining the rest of our workforce.

We monitor with both Puppet and our own monitoring software. If problems develop, we notice them quickly. From beginning to end, the process of deploying a new server takes about an hour, and the tools we have chosen give us the confidence that each new server is playing nicely with the Cedexis network and doing its job appropriately.

Life at Cedexis: LessOps, not NoOps

As a fast-paced company with modern needs, Cedexis has developed an interesting balance of work between operations and development. Last year, there was a lot of talk about the concept of “NoOps,” which indicated (depending on your perspective) either an elimination of the traditional operations role or a description of what must be a well-oiled and well-working operations team. In either case, the main message seemed to be that operations should focus on automation as much as possible, thus freeing up the need for actual man-hours with eyes on the network.

A crucial part of the Internet discussion involved an article by Operations Engineer John Allspaw in which he criticized the concept of NoOps.

“I do find it disturbing,” said Ramin Khatibi in support of Allspaw, “that enough people have experienced Ops performed so poorly that faced with a working version they assume it must be something else.”

While I also don’t think that NoOps is a realistic approach, I’m not going to mount an argument against its existence. Instead, Cedexis uses its small technical team to the best effect, in a system we’ve been calling “LessOps.”

What is LessOps?

We wanted our LessOps approach to build more collaboration between the operations and development teams, including cross-training. We also hoped that integration between the two groups would lead to a stronger, more robust product. However, it would mean some changes that might not be that popular, such as adding developers to the on-call rotation.

When we first put our developers on call, the operations team was hesitant and skeptical. After all, not many developers have what we might call the operations mindset. Developers tend to be focused on individual problems or components of a system, while operations staff take a more holistic approach out of necessity. We were unsure how it would turn out. However, after a few months under this new system, we’ve enjoyed great benefits.

That Time We Had Zero Alerts

The biggest and most measurable benefit is that after a few weeks with the mixed on-call rotation, for the first time we were experiencing periods with zero alerts and errors on our system health monitor. This happened as a result of several factors.

First, developers were put in charge of both designing and responding to monitors for components they’d written. Error messages written by developers have a tendency to be written for those who are dealing with their own running code. They can be difficult for others to interpret efficiently. These messages, suddenly monitored by engineers, pointed out places in the system where unneeded monitoring and reporting was happening. A number of minor, easy-to-fix issues that operations had not paid sufficient attention to before also were quickly discovered and addressed.

This also led to another change that produced a more robust system. Developers began writing the monitors for the components they had written. Operations know how difficult it is to monitor a system they don’t understand. With ops and developers working together, however, monitors could be written that provide both teams with meaningful information that facilitates more effective responses to alerts by directing them to the proper domain experts.

These monitors have also become part of our software development lifecycle. In production, they operate as live unit tests, and as the system is developed over time, they provide a set of legacy regression tests. These keep us alerted if any changes in the system have produced unpredictable behavior in components we may not actively monitor or develop anymore.

A Culture of Documentation

The cross-training between operations and development also produces much stronger documentation. We have learned, as have many teams, that good documentation needs to be a core element of the team’s culture. Everyone on the team has realized the benefit of stable, consistent documentation that is kept up to date.

Overall, we have found that the LessOps approach, with operations and engineering working closely with each other, has produced more automation, fewer system faults, and more free time for our ops staff to play LAN games. Plus, the teams work extremely well together, with architecture and design decisions made more efficiently as a group, and problems being solved faster with domain experts quickly on hand.