Posts

More Than Science Fiction: Why We Need AI for Global Data Traffic Management

Originally published in Product Design and Development website

by Josh Gray, Chief Architect, Cedexis

Blogpost

 

 

 

 

 

 

Blade Runner 2049 struck a deep chord with science fiction fans. Maybe it’s because there’s so much talk these days of artificial intelligence and automation — some of it doom and gloom, some of it utopian, with every shade of promise and peril in between. Many of us, having witnessed the Internet revolution first hand — and still in awe of the wholesale transformation of commerce, industry, and daily life — find ourselves pondering the shape of the future. How can the current pace of change and expansion be sustainable? Will there be a breaking point? What will it be: cyber (in)security, the death of net neutrality, or intractable bandwidth saturation?

Only one thing is certain: there will never enough bandwidth. Our collectively insatiable need for streaming video, digital music and gaming, social media connectivity, plus all the cool stuff we haven’t even invented yet, will fill up whatever additional capacity we create. The reality is that there will always be buffering on video — we could run fiber everywhere, and we’d still find a way to fill it up with HD, then 4K, then 8K, and whatever comes next.

Just like we need smart traffic signals and smart cars in smart cities to handle the debilitating and dangerous growth of automobile traffic, we need intelligent apps and networks and management platforms to address the unrelenting surge of global Internet traffic. To keep up, global traffic management has to get smarter, even as capacity keeps growing.

Fortunately, we have Big Data metrics, crowd-sourced telemetry, algorithms, and machine learning to save us from breaking the Internet with our binge watching habits. But, as Isaac Asimov pointed out in his story Runaround, robots must be governed. Otherwise, we end up with a rogue operator like HAL, an overlord like Skynet, or (more realistically) the gibberish intelligence of the experimental Facebook chatbots. In the case of the chatbots, the researchers learned a valuable lesson about the importance of guiding and limiting parameters: they had neglected to specify use of recognizable language, so the independent bots invented their own.

In other words, AI is exciting and brimming over with possibilities, but needs guardrails if it is to maximize returns and minimize risks. We want it to work out all the best ways to improve our world (short of realizing that removing the human race could be the most effective pathway to extending the life expectancy of the rest of Nature).

It’s easy to get carried away by grand futuristic visions when we talk about AI. After all, some of our greatest innovators are actively debating the enormous dangers and possibilities. But let’s come back down to earth and talk about how AI can at least make the Internet work for the betterment of viewers and publishers alike, now and in the future.

We are already using basic AI to bring more control to the increasingly abstract and complex world of hybrid IT, multi-cloud, and advanced app and content delivery. What we need to focus on now is building better guardrails and establishing meaningful parameters that will reliably get our applications, content, and data where we want them to go without outages, slowdowns, or unexpected costs. Remember, AI doesn’t run in glorious isolation, unerring in the absence of continual course adjustment: this is a common misconception that leads to wasted effort and disappointing or possibly disastrous results. Even Amazon seems to have fallen prey to the set-it-and-forget-it mentality: ask yourself, how often does their shopping algorithm suggest the exact same item you purchased yesterday? Their AI parameters may need periodic adjustment to reliably suggest related or supplementary items instead.

For AI to be practically applied, we have to be sure we understand the intended consequences. This is essential from many perspectives:  marketing, operations, finance, compliance, and business strategy. For instance, we almost certainly don’t want automated load balancing to always route traffic for the best user experience possible — that could be prohibitively expensive. Similarly, sometimes we need to route traffic from or through certain geographic regions in order to stay compliant with regulations. And we don’t want to simply send all the traffic to the closest, most available servers when users are already reporting that quality of experience (QoE) there is poor.

When it comes right down to it, the thing that makes global traffic management work is our ability to program the parameters and rules for decision-making — as it were, to build the guardrails that force the right outcomes. And those rules are entirely reliant upon the data that flows in.  To get this right, systems need access to a troika of guardrails: real-time comprehensive metrics for server health, user experience health, and business health.

System Guardrails

Real-time systems health checks are the first element of the guardrail troika for intelligent traffic routing. Accurate, low-latency, geographically-dispersed synthetic monitoring answers the essential server availability question reliably and in real-time: is the server up and running at all.

Going beyond ‘On/Off’ confidence, we need to know the current health of those available servers. A system that is working fine right now may be approaching resource limits, and a simple On/Off measurement won’t know this. Without knowing the current state of resource usage, a system can cause so much traffic to flow to this near-capacity resource that it goes down, potentially setting off a cascading effect that takes down other working resources.

Without scriptable load balancing, you have to dedicate significant time to shifting resources around in the event of DDoS attacks, unexpected surges, launches, repairs, etc. — and problems mount quickly if someone takes down a resource for maintenance but forgets to make the proper notifications and preparations ahead of time. Dynamic global server load balancers (GSLBs) use real-time system health checks to detect potential problems, route around them, and send an alert before failure occurs so that you can address the root cause before it gets messy.

Experience Guardrails

The next input to the guardrail troika is Real User Measurements (RUM), which provide information about Internet performance at every step between the client and the clouds, data centers, or CDNs hosting applications and content. Simply put, RUM is the critical measurement of the experience each user is having. As they say, the customer is always right, even when Ops says the server is working just fine. To develop true traffic intelligence, you have to go beyond your own system. This data should be crowd-sourced by collecting metrics from thousands of Autonomous System Numbers, delivering billions of RUM data points each day.

Community-sourced intelligence is necessary to see what’s really happening at both the edges of the network as well as in the big messy pools of users where your own visibility may be limited (e.g. countries with thousands of ISPs like Brazil, Russia, Canada, and Australia). Granular, timely, real user experience data is particularly important at a time when there are so many individual peering agreements and technical relationships, all of which could be the source of unpredictable performance and quality.

Business Guardrails

Together, system and experience data inform intelligent, automated decisions so that traffic is routed to servers that are up and running, demonstrably providing great service to end users, and not in danger of maxing out or failing. As long as everything is up and running and users are happy, we’re at least halfway home.

We’re also at the critical divide where careful planning to avoid unintended consequences comes into play. We absolutely must have the third element of the troika: business guardrails.

After all, we are running businesses. We have to consider more than bandwidth and raw performance: we need to optimize the AI parameters to take care of our bottom line and other obligations as well. If you can’t feed cost and resource usage data into your global load balancer, you won’t get traffic routing decisions that are as good for profit margins as they are for QoE. As happy as your customers may be today, their joy is likely to be short-lived if your business exhausts its capital reserves and resorts to cutting corners.

Beyond cost control, automated intelligence is increasingly being leveraged in business decisions around product life cycle optimization, resource planning, responsible energy use, and cloud vendor management. It’s time to put all your Big Data streams (e.g., software platforms, APM, NGINX, cloud monitoring, SLAs, and CDN APIs) to work producing stronger business results. Third party data, when combined with real-time systems and user measurements, creates boundless possibilities for delivering a powerful decisioning tool that can achieve almost any goal.

Conclusion

Decisions made out of context produce optimal results rarely and only by sheer luck. Most companies have developed their own special blend of business and performance priorities (and anyone who hasn’t, probably should). Automating an added control layer provides comprehensive, up-to-the-minute visibility and control, which helps any Ops team to achieve cloud agility, performance, and scale, while staying in line with business objectives and budget constraints.

Simply find the GSLB with the right decisioning capabilities, as well as the capacity to ingest and use System, Experience, and Business data in real-time, then build the guardrails that optimize your environment for your unique needs.

When it comes to practical applications of AI, global traffic management is a great place to start. We have the data, we have the DevOps expertise, and we are developing the ability to set and fine-tune the parameters. Without it, we might break the Internet. That’s a doomsday scenario we all want to avoid, even those of us who love the darkest of dystopian science fiction.

Josh GrayAbout Josh Gray: Josh Gray has worked as both a leader in various startups as well as at large enterprise settings such as Microsoft. At Microsoft he was awarded multiple patents. As VP of Engineering for Home Comfort Zone his team designed and developed systems that were featured in Popular Science, HGTV, Ask this Old House and won #1 Cool product at introduction at the Pacific Coast Builders Show. Josh has been a part of many other startups and built on his success by becoming an Angel Investor in the Portland Community. Josh continues his run of success as Chief Architect at Cedexis. Linkedin profile

 

Announcing Cedexis Netscope: Advanced Network Performance and Benchmarking Analysis

The Cedexis Radar community collects tens of billions of real user monitoring data points each day, giving Cedexis users unparalleled insight into how applications, videos, websites, and large file downloads are actually being experienced by their users. We’re excited to announce a product that offers a new lens into the Radar community dynamic data set: Cedexis Netscope.

Know how your service stacks up, down to the IP subnet
Metrics like network throughput, availability, and latency don’t tell the whole story of how your service is performing, because they are network-centric, not user-centric: however comprehensively you track network operations, what matters is the experience at the point of consumption. Cedexis Netscope provides you with additional user-centric context to assess your service, namely the ability to compare your service’s performance to the results of the “best” provider in your market. With up-to-date Anonymous Best comparative data, you’ll have a data-driven benchmark to use for network planning, marketing, and competitive analysis.

Highlight your Service Performance:

  • Relative to peers in your markets
  • In specific geographies
  • Compared with specific ISPs
  • Down to the IP Sub-net
  • Including both IPv4 and IPv6 addresses
  • Comprehensive data on latency or throughput
  • Covering both static and dynamic delivery

Actionable insights
Netscope provides detailed performance data that can be used to improve your service for end users. IT Ops teams can use automated or custom reports to view performance from your ASN versus peer groups in the geographies you serve. This lets you fully understand how you stack up versus the “best” service provider, using the same criteria. Real-time logs organized by ASN can be used to inform instant service repairs or for longer-term planning.

Powered by: the world’s largest user experience community
Real User Monitoring (RUM) means fully understanding how internet performance impacts customer satisfaction and engagement. Cedexis gathers RUM data from each step between the client and any of the clouds, data centers, and CDNs hosting your applications to build a holistic picture of internet health. Every request creates more data, continuously updating this unique real-time virtual map of the web.

Data and alerts, your way
To effectively evaluate your service and enable real-time troubleshooting, Netscope lets you roll up data by the ASN, country, region, or state level. You can zoom in within a specific ASN at the IP subnet level, to dissect the data in any way your business requires. This data will be stored in the cloud on an ongoing basis. Netscope also allows users to easily set up flexible network alerts for performance and latency deviations.

Netscope helps ISP Product Managers and Marketers better understand:

  • How well users connect to the major content distributors
  • How well users/business connect to public clouds (AWS, Google Cloud, Azure, etc.)
  • When, where, and how often outages and throughput issues happen
  • What happens during different times of day
  • Where are the risks during big events (FIFA World Cup, live events, video/content releases)
  • How service on mobile looks versus web
  • How the ISP stacks up v. ”the best” ISP  in the region

Bring Advanced Network analysis to your network
Netscope provides a critical data set you need for your network planning and enhancement. With its real-time understanding of worldwide network health, Netscope gives you the context and actionable data you need to delight customers and increase your market share.

Ready to use this data with your team?

Set up a demo today

 

Why CapEx Is Making A Comeback

The meteoric rise of both the public cloud and SaaS have brought along a strong preference for OpEx vs. CapEx. To recap: OpEx means you stop paying for a thing up front, and instead just pay as you go. If you’ve bought almost any business software lately you know the drill: you walk away with a monthly or annual subscription, rather than a DVD-ROM and a permanent or volume license.

But the funny thing about business trends is the frequency with which they simply turn upside down and make the conventional wisdom obsolete.

Recently, we have started seeing interest in getting out of pay as you go (rather unimaginatively often shortened as PAYGO) as a model, and moving back toward making upfront purchases then holding on for the ride as capital items get amortized.

Why? It’s all about economies of scale.

Imagine, if you will, that you are able to rent an office building for $10 a square foot, then rent out the space for $15 a square foot. Seems like a decent deal at 50% margin; but of course you’re also on the hook for servicing the customers, the space, and so forth. You’ll get a certain amount of relief as you share janitorial services across the space, of course, but your economic ceiling is stuck at 50%.

Now imagine that you purchase that whole building for $10M and rent out the space for $15M. Your debt payment may cut into profits for a few years, but at some point you’re paid off – and every year’s worth of rent thereafter is essentially all profit.

The first scenario puts an artificial boundary on both risk and reward: you’re on the hook for a fixed  amount of rental cost, and can generate revenues only up to 150% of your outlay. You know how much you can lose, and how much you can gain. By contrast, in the second scenario, neither risk nor reward is bounded: with ownership comes risk (finding asbestos in the walls, say), as well as unlimited potential (raise rental prices and increase the profit curve).

This basic model applies to many cloud services – and to no small degree explains why so many companies are able to pop up – their growth is scaled with provisioned services.

If you were to decide to fire up a new streaming video service, say, that showed only the oeuvre of, say, Nicolas Cage, you’d want to have a fairly clear limit on your risk: maybe millions of people will sign up, but then again maybe they won’t. In order to be sure you’ve maximized the opportunity, though, you’ll need a rock solid infrastructure to ensure your early adopters get everything they expect: quick video start times, low re-buffering ratios, and excellent picture resolution. It doesn’t make sense to build that all out anew: you’re best off popping storage onto a cloud, maybe outsourcing CMS and encoding to an Online Video Platform (OVP), and delegating delivery to a global content delivery network (CDN). In this way you can have a world-class service, without having to pony up for servers, encoders, points of presence (POPs), load balancers, and all the other myriad elements necessary to compete.

In the first few months, this would be great – your financial risk is relatively low as you target your demand generation at the self-proclaimed “total Cage-heads”. But as you reach a wider and wider audience, and start to build a real revenue stream, you realize: the ongoing cost of all those outsourced, opex-based, services is flattening the curve that could bring you to profitability. By contrast, spinning up a set of machines to store, compute, and deliver your content could set a relatively fixed cost that, as you add viewers, would allow you to realize economies of scale and unbound profit.

We know that this is a real business consideration because Netflix already did it. Actually, they did it some time ago: while they do much (if not most) of their computation through cloud services, they decided in 2012 to move away from commercials CDNs in favor of their own Open Connect, and announced in 2016 that all its content delivery needs were now covered by their own network. Not only did this reduce their monthly opex bill, it also gave them control over the technology they used to guarantee an excellent quality of experience (QoE) for their users.

So for businesses nearing this op v. cap inflection point, the time really has arrived to put pencil to paper and calculate the cost of going it alone. The technology is relatively easy to acquire and manage, from server machines, to local load balancers and cache servers, and on up to global server load balancers. You can see a little bit more about how to actually build your own CDN here.

Opex solutions are absolutely indispensable in getting new services off the starting line; but it’s always worth keeping an eye on the economics, because with a large enough audience going it alone is the way to go.

Whither Net Neutrality

In case you missed it during what we shall carefully call a busy news week, the FCC voted to overturn net neutrality rules. While this doesn’t mean net neutrality is dead – there’s now a long period of public comments, and any number of remaining bureaucratic hoops to jump through – it does mean that it’s at best on life support.

So what does this mean anyway? And does it actually matter, or is it the esoteric nattering of a bunch of technocrats fighting over the number of angels that can dance on the head of a CAT-5 cable?

Let’s start with a level set: what is net neutrality? Wikipedia tells us that it is

the principle that Internet service providers and governments regulating the Internet should treat all data on the Internet the same, not discriminating or charging differentially by user, content, website, platform, application, type of attached equipment, or mode of communication.”

Thought of a different way, the idea is that ISPs should simply provide a pipe through which content flows, and have neither opinion nor business interest in what that content is – just as a phone company doesn’t care who calls you or what they’re talking about.

The core argument for net neutrality is that, if the ISPs can favor one content provider over another, then it will create insurmountable barriers to entry for feisty, innovative new market entrants: Facebook, for example, could pay Comcast to make their social network run twice as smoothly as a newly-minted competitor. Meanwhile, that ISP could, in an unregulated environment, accept payment to favor the advertisements of one political candidate over another, or to simply block access to material or any purpose.

On the other side, there are all the core arguments of de-regulation: demonstrating adherence to regulations siphons off productive dollars; regulations stifle innovation and discourage investment; regulations are a response to a problem that hasn’t even happened yet, and may never occur (there’s a nice layout of these arguments at TechCrunch here). Additionally, classic market economics suggests that, if ISPs provide a service that doesn’t match consumers’ needs, then those consumers will simply take their business elsewhere.

It doesn’t much matter which side of the argument you find yourself on: as long as there is an argument to be had, it is going to be important to have control over one’s traffic as it traverses the Internet. Radar’s ability to track, monitor and analyze Internet traffic will be vital whether net neutrality is the law of the land or not. Consider these opposing situations:

  • Net neutrality is the rule. Tracking the average latency and throughput for your content, then comparing it to the numbers for the community at large, will swiftly alert you if an ISP is treating your competition preferentially.
  • Net neutrality is not the rule. Tracking the average latency and throughput for your content, then comparing it to the numbers for the community at large, will swiftly alert you if an ISP is not providing the level of preference for which you have paid (or that someone else has upped the ante and taken your leading spot).

Activists from both sides will doubtless be wrestling with how to regulate (or not!) the ISP market for years to come. Throughout, and even after the conclusion of that fight, it’s vital for every company that delivers content across the public Internet to be, and remain, aware of the service their users are receiving – and who to call when things aren’t going according to plan.

Find out more about becoming a part of the Radar community by clicking here to create an account.

New Feature: Apply Filters

The Cedexis UI team spends considerable time looking for ways to help make our products both useful and efficient. One of the areas they’ve been concentrating on is improving the experience of applying several sets of filters to a report, which historically has led to a reload of the report every time a user has changed the filter list.

So we are excited to be rolling out a new reporting feature today called Apply Filters.  With the focus on improved usability and efficiency, this new feature allows you to select (or deselect) your desired filters first and then click the Apply Filters button to re-run the report. By selecting all your filters once, you will save time and eliminate confusion of remembering which filters you selected while the report is continuously refreshing itself.

The Apply Filter button appears  in off-state and on-state. The off-state button is a lighter green version that you will see before any filter selections are made. The on-state will become enabled once a filter selection has been made.  Once you run Apply Filters, and the report has completed re-running with the selected filters, the button will return to the off-state.

We have also placed the Apply Filters button at both the top and bottom of the Filters area.  The larger button at the bottom is a fixed setting so no matter how many filter options you have open, this button will always be easily accessible.

 .      

We hope you’ll agree this makes reports easier to use, and will save you time as you slice-and-dice your way to a deep and broad understanding of how traffic is making its way across the public internet.

Want to check out the new filters feature, but don’t have a portal account? Sign up here for free!

Together, we’re making a better internet. For everyone, by everyone.

Which Is The Best Cloud or CDN?

Oh no, you’re not tricking us into answering that directly – it’s probably the question we hear more often than any other. The answer we always provide: it depends.

Unsatisfying? Fair enough. Rather than handing you a fish, let us show you how to go haul in a load of blue fin tuna.

What a lot of people don’t know is that, for free, you can answer this sort of thing all by yourself on the Cedexis portal. Just create an account, click through on the email we send, and you’re off to the races (go on – go do it now, we’ll wait…it’s easier to follow along when you have your own account).

The first thing you’ll want to do is find the place where you get all this graphical statistical goodness: click Radar then select Performance Report, as shown below

With this surprisingly versatile (and did we mention free) tool, you can answer all the questions you ever had about traffic delivery around the world. For instance, if I’m interested in working out which continent has the best and worst availability. Simply change the drop down around the top left to show ‘Continent’ instead of ‘Platform’, and voila – an entirely unsurprising result:

Now that’s a pretty broad brush. Perhaps you’d like to know how a different group of countries, or states look relative to one another – simply select those countries or states from the Location section on the right hand side of the screen and you’re off to the races. Do the same with Platforms (that’s the cloud providers and CDNs), and adjust your view from Availability to Throughput or Latency to see how the various providers are doing when they are Available.

So, if you’re comparing a couple of providers, in a couple of states, you might end up with something that looks like this:

Be careful though – across 30 days, measured day to day, it looks like there’s not much difference to be told, nor much improvement to be found by using multiple providers. Ensure that you dig in a little deeper – maybe to the last 7 days, 48 hours, or even 24 hours.Look what can happen when you focus in on, for instance, a 48 hour period:

There are periods there where having both providers in your virtual infrastructure would mean the difference between serving your audience really well, and being to all intents and purposes unavailable for business.

If you’ve never thought about using multiple traffic delivery partners in your infrastructure – or have considered it, but rejected it in the absence of solid data – today would be a great day to go poke around. More and more operations teams are coming to the realization that they can eliminate outages, guarantee consistent customer quality, and take control over the execution and cost of their traffic delivery by committing to a Hybrid Cloud/CDN strategy.

And did we mention that all this data is free for you to access?

 

Live and Generally Available: Impact Resource Timing

We are very excited to be officially launching Impact Resource Timing (IRT) for general availability.

IRT is Impact’s powerful window into the performance of different sources of content for the pages in your website. For instance, you may want to distinguish the performance of your origin servers relative to cloud sources, or advertising partners; and by doing so, establish with confidence where any delays stem from. From here, you can dive into Resource Timing data sliced by various measurements over time, as well as through a statistical distribution view.

What is Resource Timing? Broadly speaking, resource timing measures latency within an application (i.e. browser). It uses JavaScript as the primary mechanism to instrument various time-based metrics of all the resources requested and downloaded for a single website page by an end user. Individual resources are objects such as JS, CSS, images and other files that the website pages requests. The faster the resources are requested and loaded on the page, the better quality user experience (QoE) for users.  By contrast, resources that cause longer latency can produce a negative QoE for users.  By analyzing resourcing timing measurements, you can isolate the resources that may be causing degradation issues for your organization to fix.  

Resource Timing Process:

Cedexis IRT makes it easy for you to track resources from identified sources, normally identified through domain (*.myDomain.com), by sub-domain(e.g. images.myDomain.com), and by the provider serving your content. In this way, you can quickly group together types of content, and identify the source of any latency. For instance, you might find that origin-located content is being delivered swiftly, while cloud-hosted images are slowing down the load time of your page; in such a situation, you would now be in a position to consider a range of solutions, including adding a secondary cloud provider and a global server load balancer to protect QoE for your users.

Some benefits of tracking Resource Timing.

  • See which hostnames  – and thus which classes of content – are slowing down your site.
  • Determine which resources impact your overall user experience.
  • Correlate resource performance with user experience.

Impact Resource Timing from Cedexis allows you to see how content sources are performing across various measurement types such as Duration, TCP Connection Time, and Round Trip Time. IRT reports also give you the ability to drill down further by Service Providers, Locations, ISPs, User Agent (device, browsers, OS) and other filters.

Check out our User Guide to learn more about our Measurement Type calculations.

There are two primary reports in this release of Impact Resource Timing. The Performance report, which gives you a trending view of resource timing over time and the Statistical Distribution report, which reports Resource Timing data through a statistical distribution view.  Both reports have very dynamic reporting capabilities that allow you to easily pinpoint resource-related issues for further analysis.  


Using the Performance report, you can isolate which grouped resources are causing potential end user experience issues by hostname, page or service provider and when the issue happened. Drill down even further to see if this was a global issue or localized to a specific location or if it was by certain user devices or browsers.  

IRT is now available for all in the Radar portal – take it for a spin and let us know your experiences!

Better OTT Quality At Lower Cost? That Would Be Video Voodoo

According to the CTA, streaming video now claims as many subscribers as traditional Pay TV. Another study, from the Leichtman Research Group proposed that more households have streaming video than have a DVR. However accurate – or wonkily constructed – these statistics, what’s not up for grabs is that more people than ever are getting a big chunk of their video entertainment over the Web. Given the infamous AWS outage, this means that providers are constantly at risk of seeing their best-laid-plans laid low by someone’s else’s poor typing skills.

Resiliency isn’t a nice-to-have, it’s a necessity. Services that were knocked out last week owing to AWS’ challenges were, to some degree, lucky: they may have lost out on direct revenue, but their reputations took no real hit, because the core outage was so broadly reported. In other words, everyone knew the culprit was AWS. But it turns out that outages happen all the time – smaller, shorter, more localized ones, which don’t draw the attention of the global media, and which don’t supply a scapegoat. In those circumstances, a CDN glitch is invisible to the consumer, and is therefore not considered: when the consumer’s video doesn’t work, only the publisher is available to take the blame.

It’s for this reason that many video publishers that are Cedexis customers first start to look at breaking from the one-CDN-to-rule-them-all strategy, and look to diversify their delivery infrastructure. As often as not,this starts as simply adding a second provider: not so much as an equal partner, but as a safety outlet and backup. Openmix intelligently directs traffic, using a combination of community data (the 6 billion measurements we collect from web users around the world each day) and synthetic data (e.g. New Relic and CDN records). All of a sudden, event though outages don’t stop happening, they do stop being noticeable because they are simply routed around. Ops teams stop getting woken up in the middle of the night, Support teams stop getting sudden call spikes that overload the circuits, and PR teams stop having to work damage control.

But a funny thing happens once the outage distractions stop: there’s time catch a breath, and realize there’s more to this multi-CDN strategy than just solving a pain. When a video publisher can seamlessly route between more than one CDN, based on its ability to serve customers at an acceptable quality level, there is a natural economic opportunity to choose the best-cost option – in real time. Publishers can balance traffic based simply on per-Gig pricing; ensure that commits are met, but not exceeded until every bit of pre-paid bandwidth throughout the network is exhausted; and distribute sudden spikes to avoid surge pricing. Openmix users have reported seeing cost savings that reach low to mid double-digit percentages – while they are delivering a superior, more consistent, more reliable service to their users.

Call it Video Voodoo: it shouldn’t be possible to improve service reliability and reduce the cost of delivery…and yet, there it is. It turns out that eliminating a single point of failure introduces multiple points of efficiency. And, indeed, we’ve seen great results for companies that already have multiple CDN providers: simply avoiding overages on each CDN until all the commits are met can deliver returns that fundamentally change the economics of a streaming video service.

And changing the economics of streaming is fundamental to the next round of evolution in the industry. Netflix, the 800 pound gorilla, has turned over more than $20 billion in revenue the last three years, and generated less than half a billion in net margin, a 5% rate; Hulu (privately- and closely-held) is rumored to have racked up $1.8B in losses so far and still be generating red ink on some $2B in revenues. The bottom line is that delivering streaming video is expensive, for any number of reasons. Any engine that can measurably, predictably, and reliably eliminate cost is not just intriguing for streaming publishers – it is mandatory to at least explore.

Amazon Outage: The Aftermath

Amazon AWS S3 Storage Service had a major, widely reported, multi-hour outage yesterday in their US-East-1 data center. The S3 service in this particular data center was one of the very first services Amazon launched when it introduced cloud computing to the world more than 10 years ago. It’s grown exponentially since–storing over a trillion objects and servicing a million requests/second supporting thousands of web properties (this article alone lists over 100 well-known properties that were impacted by this outage).

Amazon has today published a description of what happened. The summary is that this was caused by human error. One operator, following a published run book procedure, mis-typed a command parameter setting a sequence of failure events in motion. The outage started at 9:37 am PST.  A nearly complete S3 service outage lasted more than three hours and full recovery of other AWS S3-dependent services lasted several hours more.

A few months ago, Dyn taught the industry that single-sourcing your authoritative DNS creates the risk the military described as two is one, one is none. This S3 incident underscores the same lesson for object storage. No service tier is immune. If a website, content, service or application is important, redundant alternative capability at all layers is essential. And this requires appropriate capabilities to monitor and manage this redundancy. After all, fail-over capacity is only as good as the system’s ability to detect the need to, and to actually, failover. This has been at the heart of Cedexis’ vision since the beginning, and as we continue to expand our focus in streaming/video content and application delivery, this will continue to be an important and valuable theme as we seek to improve the Internet experience of every user around the world.

Even the very best, most experienced services can fail. And with increasing deconstruction of service-oriented architectures, the deeply nested dependencies between services may not always be apparent. (In this case, for example, the AWS status website had an underlying dependency on S3 and thus incorrectly reported the service at 100% health during most of the outage.)

We are dedicated to delivering data-driven, intelligent traffic management for redundant infrastructure of any type. Incidents like this should continue to remind the digital world that redundancy, automated failover, and a focus on the customer experience are fundamental to the task of delivering on the continued promise of the Internet.

How To Deliver Content for Free!

OK, fine, not for free per se, but using bandwidth that you’ve already paid for.

Now, the uninitiated might ask what’s the big deal – isn’t bandwidth essentially free at this point? And they’d have a point – the cost per Gigabyte of traffic moved across the Internet has dropped like a rock, consistently, for as long as anyone can remember. In fact, Dan Rayburn reported in 2016 seeing prices as low as ¼ of a penny per gigabyte. Sounds like a negligible cost, right?

As it turns out, no. As time has passed, the amount of traffic passing through the Internet has grown. This is particularly true for those delivering streaming video: consumers now turn up their nose at sub-broadcast quality resolutions, and expect at least an HD stream. To put this into context, moving from HD as a standard to 4K (which keeps threatening to take over) would result in the amount of traffic quadrupling. So while CDN prices per Gigabyte might drop 25% or so each year, a publisher delivering 400% the traffic is still looking at an increasingly large delivery bill.

It’s also worth pointing out that the cost of delivery relative to delivering video through a traditional network, such as cable or satellite is surprisingly high. An analysis by Redshift for the BBC clearly identifies the likely reality that, regardless of the ongoing reduction in per-terabyte pricing “IP service development spend is likely to increase as [the BBA] faces pressure to innovate”, meaning that online viewers will be consuming more than their fair share of the pie.

Take back control of your content…and your costs

So, the price of delivery is out of alignment with viewership, and is increasing in practical terms. What’s a streaming video provider to do?

Allow us to introduce Varnish Extend, a solution combining the powerful Varnish caching engine that is already part of delivering 25% of the world’s websites; and Openmix, the real-time user-driven predictive load balancing system that uses billions of user measurements a day to direct traffic to the best pathway.

Cedexis and Varnish have both found that the move to the Cloud left a lot of broadcasters as well as OTT providers with unused bandwidth available on premise.Bymaking it easy to transform an existing data-center into a private CDN Point of Presence (PoP), Varnish Extend empowers companies to easily make the most out of all the bandwidth they have paid for, by setting up Varnish nodes on premise, or on cloud instances that offer lower operational costs than using CDN bandwidth.

This is especially valuable for broadcasters/service providers whose service is limited to one country: the global coverage of a CDN may be overkill, when the same quality of experience can be delivered by simply establishing POPs in strategic locations in-country.

Unlike committing to an all-CDN environment, using a private CDN infrastructure like Varnish Extend supports scaling to meet business needs – costs are based on server instances and decisions, not on the amount of traffic delivered. So as consumer demands grow, pushing for greater quality, the additional traffic doesn’t push delivery costs over the edge of sanity.

A global server load balancer like Openmix automatically checks available bandwidth on each Varnish node as well as each CDN, along with each platform’s performance in real-time. Openmix also uses information from the Radar real user measurement community to understand the state of the Internet worldwide and make smart routing decisions.

Your own private CDN – in a matter of hours

Understanding the health of both the private CDN and the broader Internet makes it a snap to dynamically switch end-users between Varnish nodes and CDNs, ensuring that cost containment doesn’t come at the expense of customer experience – simply establish a baseline of acceptable quality, then allow Openmix to direct traffic to the most cost-effective route that will still deliver on quality.

Implementing Varnish Extend is surprisingly simple (some customers have implemented their private CDN in as little as four hours):

  1. Deploy Varnish Plus nodes within existing data-centre or on public cloud,
  2. Configure Cedexis Openmix to leverage these nodes as well as existing CDNs.
  3. Result: End-users are automatically routed to the best delivery node based on performance, costs, etc.

Learn in detail how to implement Varnish Extend

Sign up for Varnish Software – Cedexis Summit in NYC

References/Recommended Reading: