Why Micro-Outages are the speed bumps of the Internet

In previous posts I have written about the concept of “Micro-Outages”. These outages:

  1. Usually last between 5-20 minutes.
  2. Are sometimes limited to a subset of the 40k+ ASN/ISP’s that exists in the world
  3. Typically do not violate a SLA (for duration reasons mostly) or if they do so, it’s minimally.
  4. Are sometimes undetected by synthetic monitoring.

There are many reasons for Micro-outages. Everything from a planned maintenance event that was poorly executed, to DNS misconfigurations, to fat fingered routes inserted into BGP by sleepy sysadmins, to fiber cuts; these outages can have a serious impact on performance. “Performance you say? Don’t you mean availability?” Glad you asked! In today’s note we want to show you the impact of performance when availability is reduced. Availability is simply defined as the ability to connect to a remote device. The great thing about the Internet is that from the ground up it was built as a “best effort” type protocol – so when a connection attempt is made from point A to point B and the connection is rejected (timed out for instance) point A will typically retry. When we measure Availability it’s not just on or off – but rather it is characterized as some connection attempt succeeded X times out of Y tries. So generally we talk about it in terms of:

  • 10 out of 10 connection attempts – 100% availability. Good availability
  • 7 out of 10 connection attempts – 75% availability. Moderate availability.
  • 5 or less out of 10 connections attempts – poor availability

When we measure a specific platform (say a Global CDN or a regional Cloud) we can see when they have micro-outages and we can report on them like we did in the previous posts.

  • Green – Approximately 100% availability
  • Yellow – Approximately 75% availability
  • Red – Less than 40% availability

For this blog we are taking a different, broader approach – we are going graph out availability for over 100+ platforms and map that on to how good (or bad) availability dictates good (or bad) performance.

So lets take a 24-hour period. That’s the horizontal axis. Next lets take the 50 top talkers ISPs. That’s the vertical axis. So what you see below is 24 hours for the top talker 50 ISPs. Every block is a minute. For every hour we are measuring the ability of that network (ISP) to make connections to over 100 platforms. Those platforms are Cloud’s and Content Delivery Networks.

So when a specific ISP can connect to every platform 100% of the time we can call that a perfect day – or a perfect game in baseball nomenclature – and its represented as a green block. Likewise – if a network has poor connectivity to the majority of platforms its color is red. For instance – looking at the next picture you can see 2 networks with a quite different day:

Availability issues

Arrow (1) points at a network that had really perfect availability for the entire day- the so-called perfect game. Arrow (2) points at a network that struggled with availability starting in the late morning and finally got back to green late in the evening.

In the middle you see many networks that have typical diurnal flows of congestion – meaning people wake up and start using the internet and that network has some congestion and connection attempts go from 10 out of 10 to 7 out of 10 to even lower for short periods. These are marked by the yellowish to orange markings. Network managers are always making decisions based on CAPX and customer requirements. Many network admins will not upgrade hardware and capacity until there are distinct problems. Consumer connection attempts retries is not a strong enough reason in many cases.

So now lets look at the same time period – but rather than measure connection attempts – lets measure latency. Again; over the exact same time period and the exact same ISPs – what latency was experienced by real end user (RUM) measurements.

So as you can see – the networks where users were experiencing high availability the latency was reasonably low (mostly). And where there was poor connectivity or availability there was always poor latency. In fact – if you look closely the rule seems to be something like: Poor availability always predicts poor latency – and good availability mostly predicts good latency.

An example of that latter claim can be seen by taking a look at arrow (3) in the below diagram:

Here is an example of a network that has decent connectivity (availability) but is REALLY slow. The latency on this network is horrible – but if you look back at previous diagrams you will see pretty good availability. But the exception proves the rule as they say. This is an exception – mostly poor connectivity will predict poor latency.

Hope this is interesting. I find it to be. Either way I would love to hear some feedback! We have a tremendous amount of data and would love to hear about what interest you. With over 2 billion RUM measurements of every major infrastructure player on the Internet – I am sure we can find some stories that would be of interest.

Thanks

-p