Posts

Battle Plan 2018: Illuminate Blind Spots and Unknown Unknowns

By Josh Gray, Chief Architect at Cedexis

– Originally published in DevOps Digest – 

There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.

Bonus points if you know who came up with that tongue twister. He was talking about terrorists, but we’re here to discuss a different sort of war — the Battle for Bandwidth. These days, application and content delivery requires special tactics, an integrated strategy, and well-sourced intelligence. And the unknown unknowns are the true enemy because they inevitably lead to outages, slowdowns, and mutinous customers.

In early November, a major outage caused by a minor configuration error (a route leak, to be exact) at global backbone provider Level 3 created widespread connection issues on both U.S. coasts. Comcast, Verizon, Cox, and Vonage customers were particularly affected.

One small error can have mighty ripple effects, and the cause isn’t always apparent to network admins and enterprise customers. The time it took to return the Down Detector maps from angry red to mellow yellow could have been shortened by looking at Real User Measurements (crowdsourced telemetry), realizing it wasn’t a single site or ISP, and following a logic tree to find the culprit.

With Global Server Load Balancing, your delivery network is smart enough to see the barricade around the corner and switch routes on the fly — saving the day (and making the other guys look a bit dazed and confused).

Blind spots can be hiding more than outages. Your crack team of DevOps commandos can’t run successful release missions if they can’t check what’s really going on in the field. You don’t want them dashing around in the dark without a robust tactical plan based on all the parameters you can assess — when you turn unknown unknowns into known knowns from your various data streams, you can put them to work.

Continuous deployment isn’t for the faint of heart — you better have your Kevlar and your night vision goggles. Companies like Salesforce are releasing updates dozens of times a day; but even a handful a week requires a careful strategy. You can use RUM to test an update by initially limiting roll-out to one data center. Check for 40x/50x errors. If you’re seeing problems, you can check both user experience with your app (non-updated versions) in other places, and user experience at the same data center where you are testing the updated version, to deduce the source of trouble.

One of the biggest unknown unknowns in traffic management is what’s going on in places you haven’t served recently. If a story about Boise causes traffic to spike there, and that’s not normally an audience hotspot for your service, chances are you won’t have any measurements of your own to go on. Community intelligence turns these dark corners of your empire into known knowns through automated crowdsourcing of quality of experience metrics. When combined with real-time server health checks and third-party data streams, you have a powerful ability to make efficient, economical routing decisions, even for destinations you don’t have any history with.

The more insight and intelligence can be used to accelerate the acquisition of known knowns, the better it is for your business and your bottom line. In the New Year, we should be less accepting of blind spots. They’re expensive — they cost us time, money, and customers. Nobody has enough human problem solvers around to keep putting out fires and rigging up one-off workarounds. Our best talent should be working on the next release, the next big idea, or the next major dilemma (Net Neutrality game changers, anyone?) — not floundering around trying to guess what’s holding up traffic. You can’t control what you can’t see, and on the hybrid IT battlefield, control keeps you on top of the hill. We’re pretty sure Donald Rumsfeld would agree.

To learn more:

Improving Website Performance using Global Traffic Management Requires Sufficient Real User Measurements

At Cedexis, we often talk about our Radar community and the vast number of Real User Measurements (RUM) we take daily. Billions every day. Is that enough? Too many? How many measurements are sufficient? These are valid questions. As with many questions, the answer is “it depends”. It depends on what you are doing with the RUM measurements. Many companies that deploy RUM use it to analyze a website’s performance using Navigation Timing and Resource Timing.  Cedexis does this, too, with its Impact Product. To do this type of analysis may not require billions of measurements a day.

However, making the RUM data actionable by utilizing the data for Global Traffic Management is another matter. To do this, it is incredibly important to have data from as many of the networks that make up the Internet as is possible.  If the objective is to have a strong representative sample of the “last mile” then it turns out you need a pretty large number. Let’s take a closer look at perhaps how many.

The Internet is a network of networks. There are around 51k networks established that make up what we call the Internet today. These networks are named, or at least numbered, by a designator called an ASN, or Autonomous System Number. Each ASN is really a set of unified routing policies. As our friend Wikipedia states:

“Within the Internet, an autonomous system (AS) is a collection of connected Internet Protocol (IP) routing prefixes under the control of one or more network operators on behalf of a single administrative entity or domain that presents a common, clearly defined routing policy to the Internet.”

Every ISP has one or more ASNs – usually more. There are 51,468 ASNs in the world as of August 2015. How does that look when you distribute it over whatever number of RUM measurements you can obtain? A perfect monitoring solution should tell you, for each network, whether your users are experiencing something bad – for instance, high latency from the network they are using.

If you are able to spread the measurements out to cover each network evenly (which you cannot) then you get something like the graph below.

Screen-Shot-2015-11-05-at-5.02.07-AM

On the left hand column, you see the number of RUM measurements you get a day and the labels on the bars show the number of measurements PER networks you can expect.

So, if you distributed your RUM measurements over all the networks in the world, and you only had a 100,000 page visits a day, you would get two measurements per network per day. This is abysmal from a monitoring perspective.

With so many ASNs, it’s easy to see why using synthetic measurements is hopeless. Even if you were to have 200 locations for your synthetic measurements and three networks per location that would only give you 600 ASN/Geo map pairings.  Cedexis dynamically monitors over seven million ASN/Geo maps every day. 

One issue, however, is that RUM measurements are not distributed equally. We have been assuming that given your 51k networks you can spread those measurements over them equally, but that’s not the way RUM works. Rather, RUM works by taking the measurements from where they actually come from. It turns out that any given site has a more limited view of the ASNs we have been discussing. To understand this better, let’s look at a real example.

Assume you have a site that generates over 130 million page views a day. The data is from a Cedexis client and was culled over a 24-hour period in October 2015.

134 million is a pretty good number, and you’re a smart technologist who implemented your own RUM tag – you are tracking information about your users, so you can improve the site. You also use your RUM to monitor your site for availability. Your site has significant users in Europe and North and South America, so you’re only really tracking the RUM data from those locations for now. So, what is the spread of where your measurements come from?

Of the roughly 51k ASNs in the world, your site can expect measurements from approximately 1,800 different networks on any given day (specifically 1,810 on this day for this site).

ISP Real User Monitoring

In the diagram above, you see a breakdown of the ISPs and ASNs that participated in the monitoring on this day – the size of the circle shows the number of measurements per minute. At the high-end are Comcast and Orange S.A. with over 4,457 and 6,377 measurements per minute, respectively. The last 108 networks (with the least measurements) all garnered less than one measurement every two minutes. Again, that’s with 134 million page views a day.

The disparity between the top measurement-producing networks and the bottom is very high. As you can see in the table below, almost 30% of your measurements come from only 10 networks while the bottom 1,000 networks produce 2% of the measurements.

Screen-Shot-2015-11-05-at-5.15.03-AM

What is the moral here? Basically, RUM obtains measurements from the networks where the people are and not so much from networks where there are fewer folks. And, every site has a different demographic, meaning that the networks that users come from for Site A is not necessarily the same mix of networks for Site B. Any one site that deploys a RUM tag will not be able to get enough measurements from enough networks to make an intelligent decision about how to route traffic. It just will not have enough data.

This is value of the Cedexis Radar community. By taking measurements from many sites (over 800 and rising) the Cedexis Radar community is able to combine these measurements and get a complete map of how ALL the ASN/GEO pairings are performing – over seven million ASN/GEO pairings a day – and our clients can utilize this shared data for their route optimization using Cedexis Openmix. This is what we mean when we say “Making the Internet Better for Everyone, By Everyone”.  The community measurements allow every individual site (that may only see 1,800 of the 51k networks) to actually see them all! 

The Radar community is free and open to anyone.  If you are not already a member, we urge you to sign up for a free account and see the most complete view of the Internet available today. While you are at it, you can go check out Radar Live, our live view of Internet traffic outages in real time.

 

RUM never sleeps – why this is important for your video streaming

I stole that line “RUM never sleeps”. I stole it from one of the best writers in the field of web performance. Tammy Everts or @tameverts for the twitter folks. She gave me the nod to use it. However, full disclosure is required, and I am fulfilling that requirement. Tammy works for SOASTA, and they also use RUM – for a different, albeit important, purpose. The blog post where she made that quote is here – and it’s definitely worth your time. However, I want you to stay here and read this blog for now, so I made that link open in another tab. Pretty sneaky of me, huh?

So why did I steal that line? Because it is such a great way to describe what Cedexis RUM provides. We measure CDNs and Clouds from over 48k networks. Cedexis takes billions of measurements of these public infrastructure providers every day – so that we might be able to route around problems that we see in real-time. Not a billion a month. Not a billion a year. 4 to 6 billion a day. That is a big number.

Cedexis-for-Media-Video1

Why is it important to have a big number? Why does size matter in this situation? Simply put – it is in the math. With 195 countries and over 48k networks, to ensure that you are getting coverage for your retail website, your videos or your gaming downloads, you must have a large number. We have performed a couple of hundred million measurements in the US over thousands of networks while I have been writing this blog! Pretty cool. We benchmark all the Clouds and CDNs in real-time. The only way to do this is to have a firehose of RUM!

Let’s look at why this is important. Taking video as an example – let’s think through what happens with billions of RUM measurements vs. a synthetic solution (or even a RUM solution with fewer measurements).

For our thought experiment, let’s take a hypothetical video streaming company (or OVP) that has 1500 customers streaming VOD and live. They have worldwide distribution. Cedexis regularly sees traffic on over 48k networks, but for the sake of argument let’s round that down to 40k of the most important networks and ISPs in the world. Like any good streaming company, they utilize 3 or 4 CDNs to ensure that their clients in various regions get the best performance, as well as ensuring 100% availability in case of a micro-outage (or worse). They use monitoring data to route customers to the best performing CDN with the best availability.

A perfect monitoring solution should tell you, for each network, whether your users are experiencing something bad – for instance, high latency to one of the CDNs – and be able to choose a better CDN for your customers to use. The VP of Operations of this video company decides to hire a synthetic monitoring firm and decides that a million measurements a day should be adequate. After all, a million is a lot! Its also quite expensive to do a million synthetic measurements a day, but let’s leave that aside for the moment. Re-buffering video and slow video start times must be avoided to compete in this space. This is table-stakes. Performance continues to be a differentiator, and the smart VP of Operations knows this. It’s worth the money.

An increase in startup delay beyond two seconds causes viewers to abandon the video.

The problem is, if you are able to spread the measurements out to cover each network evenly (which you cannot, and I’ll cover that in another post) then you get something like the graph below.

Screen-Shot-2015-05-18-at-4.11.13-PM

The expensive 1 million per day synthetic monitor gives you a measly 25 measurements per day per network. Basically 1 per hour! If one of your CDNs starts to experience a micro-outage across one or more ISPs (which we have documented here and here) you may not even KNOW the CDN is having problems for 45-50 minutes. By then, your customers that had been directed to that failing CDN/ISP pairing would be long gone.

Let’s say you wise up and realize that you must use RUM to get full ISP (last mile) coverage. Even then, it’s important to understand that there are thresholds of volume that must be reached for you to be able to get the type of coverage you desire.

It takes 4-6 billion measurements a day a for a video web site with global distribution to have the coverage that they need to ensure 100% availability and the best possible performance. In the thought experiment above, this hypothetical company would have been creating significant buffering, slower video start times and ensuring that lower bit rate streams would be consumed. The evidence for this can be found in this blog. Even at a billion measurements a day, you do not get enough coverage to adequately blanket the number of ISPs you need to cover. Seconds matter. At 50 million measurements a day, you may get a probe every 2 minutes or so on some of the ISPs. The internet (and video in particular) works in seconds. I will leave you with a key quote.

81% of web users stop watching a video when it buffers.

As you can see – seconds matter. Your monitoring solution must be that responsive if you are to route away from failing public infrastructure (or an ISP peering point or any of 100 or more things that can go wrong).

RUM vs Synthetic – why people matter

LightBulb-in-the-lead

“Cedexis Leads the Pack”.

It is nice to hear – especially when it comes from a prestigious analyst firm. If you have not seen this report, its worth the read. Cedexis was recognized for innovations in the monitoring space. While we are clearly honored, its somewhat ironic because we give our Real User Monitoring (RUM) away for FREE to all Radar community members. What this report clearly shows is why RUM is significantly better than synthetic monitoring for certain kinds of things. This is not to say that Synthetic monitoring does not have a place – but for real time traffic routing RUM is the best solution. Let me give you an example of why this is true.

As an experiment – lets take 6 global CDNs and point synthetic monitoring agents at them. The 6 CDNs are Akamai, Limelight, Level3, Edgecast, ChinaCache and Bitgravity. I am not going to list their results by name as we are not trying to call anyone out. Rather I mention them here just so the reader knows we are talking about true global CDNs. I am also not going to mention the synthetic monitoring company by name – but suffice it to say they are a major player in the space.

We point 88 agents, located all over the world, to the small test object on these 6 CDNs we benchmark. Now we can compare the synthetic agent’s measurements to the Cedexis Radar measurements for the same network from the same country, each downloading the same object. The only difference is volume of measurements and the location of the agent. The synthetic agent measures about every 5 minutes whereas Radar measurements can exceed 100 measurements per second from a single AS. Of course, the synthetic agents are sitting in big data centers versus Radar running on real user’s browsers.

One more point on the methodology: since we are focused on HTTP Response; we decided to take out DNS resolution time and TCP setup time and focusing on pure wire time. That is First Byte + Connect time. DNS resolution and TCP Setup time happen once for each domain or TCP stream whereas response time is going to impact every object on the page.

We will look at a single network in the US. The network is ASN 701: “UUNET – MCI Communications Services Inc. d/b/a Verizon Business” (USA). This is a backbone network and captures major metropolitan areas all over the US. Cedexis Radar received billions of measurements from browsers sitting on this network within the US.

Screen-Shot-2015-04-08-at-12.59.28-PM

Clearly, CDNs are much faster inside a big data center then they are in our homes! More interestingly are the changes in Rank; Notice how CDN1 moves from #5 to #1 under RUM! Also the scale changes dramatically, the synthetic agents data would have you believe CDN6 is nearly 6X slower than the fastest CDNs – yet when measured from the last mile they are only about 20% slower.

So if you had these 6 CDNs in your multCDN federation and were doing Latency Based load balancing based on these synthetic measurements – the people on this network would be poorly served. CDN1 would be getting very little (if any) of the traffic from this network even though its the fastest actual network. RUM matters because thats where the people are! By measuring from the datacenter you obfuscate this important point.

Synthetic agents can do many wonderful things but measuring actual Web Performance (from actual real People) is not among them; performance isn’t about being the fastest on a specific backbone network from a datacenter, it is about being fastest on the networks which provide service to the subscribers of your service. The actual people.

RUM based monitoring provides a much truer view of the actual performance of a web property than does synthetic, agent based monitoring. We urge you to go deploy our Radar tag and see for yourself who is performing best right now. Our real-time RUM measurements provide the best possible view into how global CDNs compare with each other in every region of the world.