Posts

New Feature: Reason Code reporting

Cedexis’ Global Load Balancing solution Openmix makes over 2.5 billion real-time delivery decisions every day. These routing decisions are based on a combination of Radar community’s 14 billion daily real user measurements and our customers’ defined business logic.

One thing we hear time and time again is “It’s great that you are making all these decisions, but it would be very valuable into why you are switching pathways.”  The “why” is hugely valuable in understanding the “what” (Decisions) and “when” (Time) of the Openmix decision-routing engine.

And so, we bring you: Reason Codes.

Reason Codes in Openmix applications are used to log and identify decisions being made, so you can easily establish why users were routed, such as to certain providers or geographical locations.  Reason Codes reflect factors such as Geo overrides, Best Round Trip Time, Data Problems, Preferred Provider Availability and or whatever other logic is built into your Openmix applications. Having the ability to see which Reason Codes (the “why”) impacted what decisions were made allows you to see clearly where problems are arising in your delivery network, and make adjustments where necessary.

Providing these types of insights is core to Cedexis’ DNA, so we are pleased to announce the general availability of Reason Codes as part of the Decision Report.  You can now view Reason Codes as both Primary and Secondary Dimensions, as well as through a specific filter for Reason Codes.

As a Cedexis Openmix user, you’ll want to get in on this right away. Being able to see what caused Openmix to route users from your preferred Cloud or CDN provider to another one because of a certain event (perhaps a data outage in the UK) allows you to understand what transpired over a specific time period. No second guessing of why decisions spiked in a certain country or network. Using Reason Codes, you can now easily see which applications are over- and under-performing and why.

Here is an example of how you can start gaining insights.

You will notice in the first screenshot below that for a period of time, there was a spike in the number of decisions that Openmix made for two of the applications.

Now all you have to do is switch the view from looking at the Application as your primary dimension to Reason Code and you can quickly see that “Routed based [on] Availability data” was the main reason for Openmix re-routing users

Drilling down further, you can add Country as your Secondary Dimension and you can see that this was happening primarily in the United States.

All of a sudden, you’re in the know: there wasn’t just ‘something going on’ – there was a major Availability event in the US. Now it’s time to hunt down your rep from that provider and find out what happened, what the plan is to prevent it in the future, and how you can adjust your network to ensure continued excellent service for all your users.

Introducing the All New Sonar: a cloud-native synthetic testing tool for any infrastructure

I never guess. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
Sir Arthur Conan Doyle

Synthetic monitoring built for hybrid cloud
Sonar tests all of your endpoints: in your public clouds, private clouds, data centers, or CDNs. This provides a comprehensive and uniform view of the overall health of your applications delivery, no matter what the status of your various infrastructure components happens to be.
Sonar’s proactive testing acts like a virtual end user, testing to see how an application, video, or large file download would be experienced by your global customers. Being able to test your app from nine locations worldwide helps ensure your data has incredibly low latency, and therefore is actually usable for your app delivery strategy.

Ultra-low latency Synthetic monitoring, refreshed up to every other second
Public cloud users are probably used to having access to some sort of synthetic app testing functionality, as a core part of the services offered by the individual cloud provider. Where many cloud services offer services that check for availability every 30 to 120 seconds, Sonar offers checks as frequently as every two seconds. Data that’s updated every few minutes really isn’t meaningful for a solution that need to make real-time, automated delivery decisions. Not to mention the question of data objectivity when source information comes from the provider of the infrastructure being monitored.

Monitoring is passive. Cedexis is insight + action.
What makes Sonar different to other synthetic testing agents is that Sonar data can be used to shape application delivery decisions in real-time. Data collected by Sonar feeds directly into the Cedexis application delivery platform, which uses fully user-configurable algorithms to route traffic to the endpoints that deliver the highest customer experience at the lowest operational cost. Owing to the frequent health checks, and the rapid calculation of optimal traffic routes, Cedexis provides the lowest-latency cloud-based application delivery service available, with automated delivery decisions being made to route around traffic congestion less than 10 seconds after problems initially arise. By contrast, most cloud services, with less frequent synthetic checks and slower decisioning engines, may be expected to take as much as two to four times as long to respond to emerging issues.

Better data means better decisions.
Delivering applications over the internet, like all interactions with complex, dynamic systems, ultimately meets success or failure based on the data you use for making decisions. In this case, decisions are the “real-time” application delivery choices your platform makes to ensure apps and video reach your customers in a way that produces a great user experience. Using real user monitoring like Radar – the world’s largest real-time user experience community – provides data you can use to make automated delivery decisions on your hybrid infrastructure. But to enable your application delivery logic to fully understand and optimize delivery for all of your customers and potential customers worldwide, you need to proactively test networks. That’s where Cedexis’ Sonar functionality comes in.

The three pillars of Application Delivery
Cedexis application delivery platform is powered by three powerful services:

  • Radar: the world’s largest community of instantaneous and actionable user experience data
  • Fusion: a powerful 3rd party data ingestion tool that makes APM, Local Load Balancer, cloud metrics, and any other dataset actionable in delivery logic
  • [NEW!]: Sonar: a massively scalable and architecture-agnostic synthetic testing tool that is immune to the latency issues of proprietary cloud tools

 

The Cedexis application delivery platform automates and optimizes the customer experience for apps, video, and static content while minimizing cloud and content delivery costs. This is done by combining billions of real user data points from over 50,000 networks, Sonar synthetic testing data, and any other dataset you use to optimize delivery based on real user data from our entire network (not just your customers).
If you haven’t created a Cedexis portal account yet, now’s the time. You can set up your global application delivery in a few minutes and see how Sonar works for yourself.   

Caching at The Edge: The Secret Accelerator

Think about how much data has to move between a publisher and a whole audience of eager viewers, especially when that content is either being streamed live, or is a highly-anticipated season premiere (yes, we’re all getting excited for the return of GoT). Now ask yourself where there is useless repetition, and an opportunity to make the whole process more efficient for everyone in the process.

Do so, and you come up with the Streaming Video Alliance-backed concept of Open Caching.

The short explanation is this: popular video content is detected and cached by ISPs at the edge; then, when consumers want to watch that content, they are served from local caches, instead of forcing everyone to pass a net-new version from origin to CDN to ISP. The amazing thing is how much of a win/win/win it really is:

  • Publishers and CDNs don’t have to deliver as much traffic to serve geographically-centered audiences
  • ISPs don’t have to pull multiple identical streams from publishers and CDNs
  • Consumers get their video more quickly and reliably, as it is served from a source that is much closer to them

A set of trials opened up in January, featuring some of the biggest names in streaming video: ViaSat, Viacom, Charter, Verizon, Yahoo, Limelight Networks, MLBAM, and Qwilt.

If this feels a bit familiar, it should: Netflix have essentially built exactly this (they call it Netflix Open Connect), by placing hardware within IXPs and ISPs around the world – some British researchers have mapped it, and it’s fascinating. And, indeed, they recently doubled down in India, deploying cached versions of their catalog (or at least the most used elements of it) all around that country.  The bottom line is that the largest streaming video provider (accounting for as much as 37% of all US Internet traffic) understands that the best experience is delivered by having the content closer to the consumer.

As it turns out, ISPs are flocking to this technology for all the reasons one might expect: this gives back some control over their networks, and provides the opportunity to get off the backhaul treadmill. By pulling, say, a live event one time, caching it at the edge, then delivering from that edge cache, they can substantially reduce their network volume and make end customers happy.


And yet – most publishers are only vaguely aware that this is happening (if you’re all up to speed on ISP caching, consider yourself ahead of the curve). Part of the reason is that when ISPs cache content that has traveled their way through a CDN, they preserve the headers – so the traffic isn’t necessarily identifiable as having been cached. And, indeed, if you have video monitoring at the client, those headers are being used, potentially making the performance of a given CDN look even better than it already is, because content is being served at the edge by the ISP. The ISP, in other words, is making not only the publisher look good, with excellent QoE – they’re also making the CDN look like a rock star!

To summarize: the caching that is happening at the ISP level is like a double-super-secret accelerator for your content, whose impact is currently difficult to measure.

It’s also, however, pretty easy to break. Publishers who opt to secure all their traffic essentially eliminate the opportunity for the ISP to cache their content, because the caching intelligence can’t identify what the file is or whether it needs caching. Now, that’s not to say the challenge insurmountable at all – APIs and integrations exist that allow the ISP to re-enter the fray, decrypt that secure transmission, and get back to work making everyone look good by delivering quickly and effectively to end consumers.

So if you aren’t yet up to speed on open caching, now is the time to do a little research. Pop over to the Streaming Video Alliance online and learn more about their Open Caching working group today – there’s nothing like finding out you deployed a secret weapon, without even knowing you did it.

 

Don’t Be Afraid of Microservices!

Architectural trends are to be expected in technology. From the original all-in-one-place Cobol behemoths half the world just learned existed because of Hidden Figures, to three-tiered architecture, to hyper-tier architecture, to Service Oriented Architecture….really, it’s enough to give anyone a headache.

And now we’re in a time of what Gartner very snappily calls Mesh App and Service Architecture (or MASA). Whether everyone else is going for that particular nomenclature is less relevant than the reality that we’ve moved on from web services and SOA toward containerization, de-coupling, and the broadest possible use of microservices.

Microservices sound slightly disturbing, as though they’re very, very small components, of which one would need dozens if not hundreds to do anything. Chris Richardson of Eventuate, though, recently begged us not to assume that just because of the name these units are tiny. In fact, it makes more sense to think of them as ‘hyper-targeted’ or ‘self-contained’ services: their purpose should be to execute a discrete set of logic, which can exist in isolation, and simply provide easily-accessed public interfaces. So, for instance, one could imagine a microservice whose sole purpose was to find the best match from a video library for a given user: requesting code would provide details on the user, the service would return the recommendation. Enormous amounts of sophistication may go into ingesting the user-identifying data, relating it to metadata, analyzing past results, and coming up with that one shining, perfect recommendation…but from the perspective of the team using the service, they just need to send a properly-formed request, and receive a properly-formed response.

The apps we all rely upon on those tiny little computers we carry around in our pocketbooks or pockets (i.e. smart phones) fundamentally rely on microservices, whether or not their developers thought to describe them that way. That’s why they sometimes wake up and spring to life with goodness…and sometimes seem to drag, or even fail to get going. They rely upon a variety of microservices – not always based at their own home location – and it’s the availability of all those microservices that dictates the user experience. If one microservice fails, and is not dealt with elegantly by the code, the experience becomes unsatisfactory.

If that feels daunting, it shouldn’t – one company managed to build the whole back end of a bank on this architecture.

Clearly, the one point of greatest risk is the link to the microservice – the API call, if you will. If the code calls to a static endpoint, the risk is that that endpoint isn’t available for some reason; or at least, is unavailable at an acceptable speed. This is why there are any number of solutions for trying to ensure the microservice is available, often spread between authoritative DNS services (which essentially take all the calls for a given location and then assign them to backend resources based on availability), and application delivery controllers (generally physical devices that perform the same service). Of course if either is down, life gets tricky quickly.

In fact, the trick to planning for highly available microservices is to embed locations that are managed by a cloud-based application delivery service. In other words, as the microservice is required, a call goes out to a location that combines both synthetic and real-user measurements to determine the most performant source and re-direct the traffic there. This compounds the benefits of the microservice architecture: not only can the microservice itself be maintained and updated independently of the apps that use it, so too the network and infrastructure necessary to its smooth and efficient delivery can be tweaked without affecting existing users.

Microservices are the future. To make the most, first ensure that they independently address discrete purposes; then make sure that their delivery is similarly defined and flexible without recourse to updating apps that use them; then settle back and watch performance meet innovation.

Live and Generally Available: Impact Resource Timing

We are very excited to be officially launching Impact Resource Timing (IRT) for general availability.

IRT is Impact’s powerful window into the performance of different sources of content for the pages in your website. For instance, you may want to distinguish the performance of your origin servers relative to cloud sources, or advertising partners; and by doing so, establish with confidence where any delays stem from. From here, you can dive into Resource Timing data sliced by various measurements over time, as well as through a statistical distribution view.

What is Resource Timing? Broadly speaking, resource timing measures latency within an application (i.e. browser). It uses JavaScript as the primary mechanism to instrument various time-based metrics of all the resources requested and downloaded for a single website page by an end user. Individual resources are objects such as JS, CSS, images and other files that the website pages requests. The faster the resources are requested and loaded on the page, the better quality user experience (QoE) for users.  By contrast, resources that cause longer latency can produce a negative QoE for users.  By analyzing resourcing timing measurements, you can isolate the resources that may be causing degradation issues for your organization to fix.  

Resource Timing Process:

Cedexis IRT makes it easy for you to track resources from identified sources, normally identified through domain (*.myDomain.com), by sub-domain(e.g. images.myDomain.com), and by the provider serving your content. In this way, you can quickly group together types of content, and identify the source of any latency. For instance, you might find that origin-located content is being delivered swiftly, while cloud-hosted images are slowing down the load time of your page; in such a situation, you would now be in a position to consider a range of solutions, including adding a secondary cloud provider and a global server load balancer to protect QoE for your users.

Some benefits of tracking Resource Timing.

  • See which hostnames  – and thus which classes of content – are slowing down your site.
  • Determine which resources impact your overall user experience.
  • Correlate resource performance with user experience.

Impact Resource Timing from Cedexis allows you to see how content sources are performing across various measurement types such as Duration, TCP Connection Time, and Round Trip Time. IRT reports also give you the ability to drill down further by Service Providers, Locations, ISPs, User Agent (device, browsers, OS) and other filters.

Check out our User Guide to learn more about our Measurement Type calculations.

There are two primary reports in this release of Impact Resource Timing. The Performance report, which gives you a trending view of resource timing over time and the Statistical Distribution report, which reports Resource Timing data through a statistical distribution view.  Both reports have very dynamic reporting capabilities that allow you to easily pinpoint resource-related issues for further analysis.  


Using the Performance report, you can isolate which grouped resources are causing potential end user experience issues by hostname, page or service provider and when the issue happened. Drill down even further to see if this was a global issue or localized to a specific location or if it was by certain user devices or browsers.  

IRT is now available for all in the Radar portal – take it for a spin and let us know your experiences!

Why The Web Is So Congested

If you live in a major city like London, Tokyo, or San Francisco, you learn one thing early: driving your car through the city center is about the slowest possible way to get around. Which is ironic, when you think about it, as cars only became popular because they made is possible to get around more quickly. There is, it seems, an inverse relationship between efficiency and popularity, at least when it comes to goods that pass through a public commons like roads.

Or like the Internet.

Think about all that lovely 4K video you could be consuming if there was nothing between you and your favorite VOD provider but a totally clear fiber optic cable. But unless you live in a highly over-provisioned location, that’s exactly not what’s going on; rather, you’re lucky to get a full HD picture, and even luckier if it stays at 1080p, without buffering, all the way through. Why? Because you’re sharing a public commons – the Internet – and its efficiency is being chewed away by popularity.

Let’s do some math to illustrate this,

  • Between 2013 and January 2017 the number of web users increased by 1.4 billion people to just over 3.7 billion. Today Internet penetration is at 50% (or put another way – half the world isn’t online yet)
  • In 2013, the average amount of Internet data per person was 7.9G per month; by 2015 it was 9.9G, with Cisco expecting it to reach over 25Gb by 2020 – so assume something in the range of 15Gb by 2017.
  • Logically, then in 2013 web traffic would have been around 2.3B * 7.9G per months (18.1t exabytes), by 2015 it would have been  3.7B * 17Gb per month (62.9 exabytes)
  • If we assume another billion Internet users by 2020, we’re looking at 4.7B & 25Gb per month – or a full 117.5 exabytes

In just seven years, the monthly web traffic will have grown 600% (based on the math, anyway: Cisco is estimating closer to 200 exabytes monthly by 2020).

And that is why the web is so busy.

But it doesn’t describe why the web is congested. Congestion happens when there is more traffic than transit space – which is why, as cities get larger and more populous, governments add lanes to major thoroughfares, meeting the automobile demand with road supply.

Unfortunately, unlike cars on roads, Internet traffic doesn’t travel in straight lines from point to point. So even though infrastructure providers have been building out capacity at a madcap pace, it’s not always connected in such a way that makes transit efficient. And, unlike roads, digital connections are not built out of concrete, and often become unavailable – sometimes for a long time that causes consternation and PR challenges, and sometimes just for a minute or so, stymying a relative handful of customers.

For information to get from A to B, it has to traverse any number of interconnected infrastructures, from ISPs to the backbone to CDNs, and beyond. Each is independently managed, meaning that no individual network administrator can guarantee smooth passage from beginning to end. And with all the traffic that has been – and will continue to be – added to the Internet, it has become essentially a guarantee that some portion of content requests will bump into transit problems along the way.

Let’s also note that the modern Internet is characterized less by cat memes, and more by the delivery of information, functionality, and ultimately, knowledge. Put another way, the Internet today is all about applications: whether represented as a tile on a smart phone home screen, or as a web interface, applications deliver the intelligence to take the sum total of all human knowledge that is somewhere on the web and turn it into something we can use. When you open social media, the app knows who you want to know about; when you consult your sports app, it knows which teams you want to know about first; when you check your financial app, it knows how to log you in from a fingerprint and which account details to show first. Every time that every app is asked to deliver any piece of knowledge, it is making requests across the Internet – and often multiple requests of multiple sources. Traffic congestion doesn’t just endanger the bitrate of your favorite sci fi series – it threatens the value of every app you use.

Which is why real-time predictive traffic routing is becoming a topic that web native businesses are digging deeper into. Think of it as Application Delivery for the web – a traffic cop that spots congestion and directs content around it, so that it’s as though it never happened. This is the only way to solve for efficient routing around a network of networks without a central administrator: assume that there will be periodic roadblocks, and simply prepare to take a different route.

The Internet is increasingly congested. But by re-directing traffic to the pathways that are fully available, it is possible to get around all those traffic jams. And, actually, it’s possible to do today.

Find out more by reading the story of how Rosetta Stone improved performance for over 60% of their worldwide customers.

 

Amazon Outage: The Aftermath

Amazon AWS S3 Storage Service had a major, widely reported, multi-hour outage yesterday in their US-East-1 data center. The S3 service in this particular data center was one of the very first services Amazon launched when it introduced cloud computing to the world more than 10 years ago. It’s grown exponentially since–storing over a trillion objects and servicing a million requests/second supporting thousands of web properties (this article alone lists over 100 well-known properties that were impacted by this outage).

Amazon has today published a description of what happened. The summary is that this was caused by human error. One operator, following a published run book procedure, mis-typed a command parameter setting a sequence of failure events in motion. The outage started at 9:37 am PST.  A nearly complete S3 service outage lasted more than three hours and full recovery of other AWS S3-dependent services lasted several hours more.

A few months ago, Dyn taught the industry that single-sourcing your authoritative DNS creates the risk the military described as two is one, one is none. This S3 incident underscores the same lesson for object storage. No service tier is immune. If a website, content, service or application is important, redundant alternative capability at all layers is essential. And this requires appropriate capabilities to monitor and manage this redundancy. After all, fail-over capacity is only as good as the system’s ability to detect the need to, and to actually, failover. This has been at the heart of Cedexis’ vision since the beginning, and as we continue to expand our focus in streaming/video content and application delivery, this will continue to be an important and valuable theme as we seek to improve the Internet experience of every user around the world.

Even the very best, most experienced services can fail. And with increasing deconstruction of service-oriented architectures, the deeply nested dependencies between services may not always be apparent. (In this case, for example, the AWS status website had an underlying dependency on S3 and thus incorrectly reported the service at 100% health during most of the outage.)

We are dedicated to delivering data-driven, intelligent traffic management for redundant infrastructure of any type. Incidents like this should continue to remind the digital world that redundancy, automated failover, and a focus on the customer experience are fundamental to the task of delivering on the continued promise of the Internet.

Make Mobile Video Stunning with Smart Load Balancing

If there’s one thing about which there is never an argument it’s this: streaming video consumers never want to be reminded that they’re on the Internet. They want their content to start quickly, play smoothly and uninterrupted, and be visually indistinguishable from traditional TV and movies. Meanwhile, the majority of consumers in the USA (and likely a similar proportion worldwide) prefer to consume their video on mobile devices. And as if that wasn’t challenging enough, there are now suggestions that live video consumption will grow – according to Variety by as much as 39 times! That seems crazy until you consider that Cisco predicted video would represent 82% of all consumer Internet traffic by 2020.

It’s no surprise that congestion can result in diminished viewing quality, leading over 50% of all consumers to, at some point, experience buffer rage from the frustration of not being able to play their show.

Here’s what’s crazy: there’s tons of bandwidth out there – but it’s stunningly hard to control.

The Internet is a best-efforts environment, over which even the most effective Ops teams can wield only so much control, because so much of it is either resident with another team, or is simply somewhere in the amorphous ‘cloud’.  While many savvy teams have sought to solve the problem by working with a Content Delivery Network (CDN), the sheer growth in traffic has meant that some CDNs are now dealing with as much traffic as the whole Internet transferred just a few years ago…and are themselves now subject to their own congestion and outage challenges. For this reason, plenty of organizations now contract with multiple CDNs, as well as placing their own virtual caching servers in public clouds, and even deploying their own bare-metal CDNs in data centers where their audiences are centered.

With all these great options for delivering content, Ops teams must make real-time decisions on how to balance the traffic across them all. The classic approaches to load balancing have been (with many thanks to Nginx):

  • Availability – Any servers that cannot be reached are automatically removed from the list of options (this prevents total link failure).
  • Round Robin – Requests are distributed across the group of servers sequentially.
  • Least Connections – A new request is sent to the server with the fewest current connections to clients. The relative computing capacity of each server is factored into determining which one has the least connections.
  • IP Hash – The IP address of the client is used to determine which server receives the request.

You might notice something each of those has in common: they all focus on the health of the system, not on the quality of the experience actually being had by the end user. Anything that balances based on availability tends to be driven by what is known as synthetic monitoring, which is essentially one computer checking another computer is available.

But we all know that just because a service is available doesn’t mean that it is performing to consumer expectations.

That’s why the new generation of Global Server Load Balancer(GSLB) solutions goes a step further. Today’s GSLB uses a range of inputs, including

  • Synthetic monitoring – to ensure servers are still up and running
  • Community Real User Measurements – a range of inputs from actual customers of a broad range of providers, aggregated, and used to create a virtual map of the Internet
  • Local Real User Measurements – inputs from actual customers of the provider’s own service
  • Integrated 3rd party measurements – including cost bases and total traffic delivered for individual delivery partners, used to balance traffic based not just on quality, but also on cost

Combined, these data sources allow video streaming companies not only to guarantee availability, but also to tune their total network for quality, and to optimize within that for cost. Or put another way – streaming video providers can now confidently deliver the quality of experience consumers expect and demand, without breaking the bank to do it.

When you know that you are running across the delivery pathway with the highest quality metrics, at the lowest cost, based on the actual experience of your users – that’s a stunning result. And it’s only possible with smart load balancing, combining traditional synthetic monitoring with the real-time feedback of users around the world, and the 3rd party data you use to run your business.

If you’d like to find out more about smart load balancing, keep looking around our site. And if you’re going to be at Mobile World Congress at the end of the month, make an appointment to meet with us there so we can show you smart load balancing in real life.

Network Resilience for the Cloud

If we’ve learned nothing else over the last few weeks, it has been that the Internet is an unruly, inherently insecure network. The ups and downs of Dyn – taken offline by hackers, yet subsequently purchased for what appears to be north of half a billion dollars – remind us that we are still a generation or two away from comfortable consistency. More importantly, they remind cloud businesses that they are relying upon a network over which they have only limited control.

Peter Deutch and others at Sun Microsystems proposed, a number of years ago, the Fallacies of Distributed Computing.  They are:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

The briefest look through this list tells you that these are brilliantly conceived, and as valid today as they were when first introduced in 1994 (in fairness, number 8 was added in 1997). Indeed, turn them upside down and you can already see the poster to go on every Operations team’s wall, reminding them that:

  1. The Internet is inherently unreliable
  2. Internet latency is a fact of life, and must be anticipated
  3. Bandwidth is shared, precious, and limited
  4. No Internet-connected system is 100% secure
  5. Internet topology changes quicker than the staircases at Hogwarts
  6. There are so many administrators of the Internet there may as well be none
  7. Transport always has a cost – your job is to keep it low
  8. The Internet consists of an infinite number of misfitting pieces

In a recent paper commissioned by Cedexis from The FactPoint Group, a new paradigm is proposed: stop building fault-tolerant systems, and start building failure-tolerant systems. Simply stated, an internally-managed network can be constructed with redundancy and failover capabilities, with a reasonable goal of near-100% consistent service.  Cloud architectures, however, have so many moving parts and interdependencies, that no amount of planning can eliminate failures. Cloud architecture, therefore, requires a design that assumes failures will happen and plans for them.

data-center-evolution-icons

(Click here if you’d like to read the paper in full)

This means there’s more to this than load balancing – we’re really talking about resource optimization. Take, for example, caching within a private cloud. First, we can use a Global Traffic Manager (GTM) to maximize Quality of Experience (QoE) by routing traffic along the pathways that will deliver content the most quickly and efficiently. Secondly we can use intelligent caching to protect against catastrophic failure: a well-tuned Varnish server for instance, can continue to serve cached content while an unavailable origin server is repaired and put back into service. In a situation where DNS services are down, the GTM can use Real User Measurements (RUM) to spot the problem, and direct requests to the right Varnish server (and, in case DNS is down, a well-constructed decision set can contain IP addresses for emergencies). The Varnish server can check for the availability of its origin and, if DNS problems prevent it from sourcing fresh content, can serve cached content.

Will this solve for every challenge? Assuredly not – but the multi-layered preparation for failure greatly improves our chances of protecting against extended outages. Meanwhile, the agility that is adopted by Operations teams as they prepare for failure means a more subtle, sophisticated set of network architectures, which lend themselves to way greater resiliency.

As applications increasingly become a tightly-knit conglomeration of web-connected services and resources, planning for failure is not a choice, it is an imperative. Protecting against the variety of threats to the shared Internet requires agility, forethought, and the zen-like acceptance that failure is inevitable.

What is the difference between the optimist and the pessimist?

Answer: nothing. Except the pessimist is better informed.

This old Russian joke is funny because it has some truth to it. The pessimist understands that things will fail. The pessimist is eventually always right, since eventually everything fails. There is a reason that most good system admins and operations professionals are pessimists. Everything eventually fails.

Today’s discussion is about the availability (or lack thereof) of CDN (Content Delivery Networks) and Cloud services.

As we will see in a moment – Clouds and CDNs generally can and do have availability issues. Regularly. These issues do not exhibit themselves as major outages that get in the newspapers. Rather – they exhibit themselves in the 1000’s of micro-outages (or reachability issues) between ISPs and the Clouds/CDNs.

I recently went and looked at Cedexis Live to get a sense of how many Cloud outages we might see in a 10 day period. I randomly chose June 25th to July 5th. Over that time there were 156 Availability lapses and 24 Latency fluctuations.

Screen-Shot-2016-07-06-at-10.53.20-AM

In the world of CDNs the micro-outages during this time frame came even hotter and heavier!

Screen-Shot-2016-07-18-at-11.15.27-AM

As you can see – 638 Availability issues and 546 significant Latency fluctuations!

We have talked about Cedexis Live before and if you have not had a chance to see how messy the internet can be, I urge you to go check it out.

Sum of Availability

One of the best examples of an innovative company that has pursued this strategy to increase its Availability is Amplience.

This rapidly growing company has solved many of the really hard problems for its customers base. One of which is availability.

Sum-of-Availablity-

They solved the problem of availability by combining the natural availability provided by each provider into a 100% available solution. This is weaved into their broader product in an integrated fashion so that their customers do not even know it is there – and yet – it all works flawlessly together.

Screen-Shot-2016-07-18-at-3.09.03-PM

You can learn more about how Amplience uses Cedexis here to see how it might work for you.

I will leave you with another great quote on the difference between the optimist and the pessimist.

A pessimist is a man who has been compelled to live with an optimist.
-Elbert Hubbard, 1927

Don’t be an optimist. Be a realist. Things always fail, but the good news is you can do something about it.