Posts

Improving Your Web Performance is Not a Game

Understanding how to improve the performance of your web app or OTT stream within specific geographies or on specific networks is no small matter. There are many variables that can impact that performance. But of all of the things you can do, reachability to your CDN(s) is perhaps the most important. Seventy five percent of the latency in a page load is outside of the data center. In other words, if want to impact your performance, figure out how to reduce the latency outside the cloud, data center or Content Delivery Network (CDN) where your app lives.

Multi-CDN Done Right

The good news is that there are actually ways to do this. But to understand them you have to understand how the internet is architected. So let me digress for a second.

The Last Mile and The Middle Mile

See the picture above. It represents the internet. There are more than 50,000 networks that make up the internet. Some of them are end-user networks (or eyeball networks) and many of them are middle mile and Tier 1 networks that specialize in long haul. How they are connected to one another is one of the most important things you should understand about the internet. These are called peering relationships, and they can be paid or unpaid depending on the relationship between the two companies. The number of networks crossed to get to a destination is referred to as hops. These hops are the basic building blocks that Border Gateway Protocol (BGP) uses to select paths through the Internet.

As you can see in the picture above, if a user were trying to get to the lower cloud instance from the ISP in the upper left, it would entail four hops, whereas the getting there from the ISP in lower left would only make three hops. But that does not mean that the lower ISP has a faster route. Because of outages between networks, lack of deployed capacity or congestion, the users of the lower ISP might actually find it faster to traverse the eight-hop path to get to the upper cloud because latency is lower via that route.

Why is the last mile important? Because it is precisely these ISPs and networks that are often the best places to look to improve performance – not always by just increasing bandwidth from that provider, but through intelligent routing. It’s also important because it’s where the users are and if you run a website, you probably care about where your users are coming from. In this sense, it’s not just what geographies they come from, but it’s also what ISPs they come from. This information is crucial to be able to scale your service successfully. It’s also where your users are actually experiencing your sites performance. You can simulate this with synthetic measurements, but there are many problems with this type of simulation. Last mile RUM measurements are important for exactly these reasons.

Understanding the architecture of the Internet is important for trying to understand how to improve your performance.

And Real User Measurements are the key to understanding the internet because they actually show real performance of your site from those geographies and networks. You can download my newest eBook for free to learn about the last mile of RUM.

So the last piece of this puzzle is understanding that different content delivery solutions (clouds and CDNs) have different strengths and weaknesses geographically and from a peering perspective. 

But what if you could take the best performers in each geography and network and not play any games trying to figure out how to do it? There’s no Tetris involved in Cedexis’ solutions. We allow you to route traffic to the best performing public infrastructure for every user no matter what geography they are coming from or what network.

We don’t play games. You should not either – at least with your web and app performance.

Why Micro-Outages are the speed bumps of the Internet

In previous posts I have written about the concept of “Micro-Outages”. These outages:

  1. Usually last between 5-20 minutes.
  2. Are sometimes limited to a subset of the 40k+ ASN/ISP’s that exists in the world
  3. Typically do not violate a SLA (for duration reasons mostly) or if they do so, it’s minimally.
  4. Are sometimes undetected by synthetic monitoring.

There are many reasons for Micro-outages. Everything from a planned maintenance event that was poorly executed, to DNS misconfigurations, to fat fingered routes inserted into BGP by sleepy sysadmins, to fiber cuts; these outages can have a serious impact on performance. “Performance you say? Don’t you mean availability?” Glad you asked! In today’s note we want to show you the impact of performance when availability is reduced. Availability is simply defined as the ability to connect to a remote device. The great thing about the Internet is that from the ground up it was built as a “best effort” type protocol – so when a connection attempt is made from point A to point B and the connection is rejected (timed out for instance) point A will typically retry. When we measure Availability it’s not just on or off – but rather it is characterized as some connection attempt succeeded X times out of Y tries. So generally we talk about it in terms of:

  • 10 out of 10 connection attempts – 100% availability. Good availability
  • 7 out of 10 connection attempts – 75% availability. Moderate availability.
  • 5 or less out of 10 connections attempts – poor availability

When we measure a specific platform (say a Global CDN or a regional Cloud) we can see when they have micro-outages and we can report on them like we did in the previous posts.

  • Green – Approximately 100% availability
  • Yellow – Approximately 75% availability
  • Red – Less than 40% availability

For this blog we are taking a different, broader approach – we are going graph out availability for over 100+ platforms and map that on to how good (or bad) availability dictates good (or bad) performance.

So lets take a 24-hour period. That’s the horizontal axis. Next lets take the 50 top talkers ISPs. That’s the vertical axis. So what you see below is 24 hours for the top talker 50 ISPs. Every block is a minute. For every hour we are measuring the ability of that network (ISP) to make connections to over 100 platforms. Those platforms are Cloud’s and Content Delivery Networks.

So when a specific ISP can connect to every platform 100% of the time we can call that a perfect day – or a perfect game in baseball nomenclature – and its represented as a green block. Likewise – if a network has poor connectivity to the majority of platforms its color is red. For instance – looking at the next picture you can see 2 networks with a quite different day:

Availability issues

Arrow (1) points at a network that had really perfect availability for the entire day- the so-called perfect game. Arrow (2) points at a network that struggled with availability starting in the late morning and finally got back to green late in the evening.

In the middle you see many networks that have typical diurnal flows of congestion – meaning people wake up and start using the internet and that network has some congestion and connection attempts go from 10 out of 10 to 7 out of 10 to even lower for short periods. These are marked by the yellowish to orange markings. Network managers are always making decisions based on CAPX and customer requirements. Many network admins will not upgrade hardware and capacity until there are distinct problems. Consumer connection attempts retries is not a strong enough reason in many cases.

So now lets look at the same time period – but rather than measure connection attempts – lets measure latency. Again; over the exact same time period and the exact same ISPs – what latency was experienced by real end user (RUM) measurements.

So as you can see – the networks where users were experiencing high availability the latency was reasonably low (mostly). And where there was poor connectivity or availability there was always poor latency. In fact – if you look closely the rule seems to be something like: Poor availability always predicts poor latency – and good availability mostly predicts good latency.

An example of that latter claim can be seen by taking a look at arrow (3) in the below diagram:

Here is an example of a network that has decent connectivity (availability) but is REALLY slow. The latency on this network is horrible – but if you look back at previous diagrams you will see pretty good availability. But the exception proves the rule as they say. This is an exception – mostly poor connectivity will predict poor latency.

Hope this is interesting. I find it to be. Either way I would love to hear some feedback! We have a tremendous amount of data and would love to hear about what interest you. With over 2 billion RUM measurements of every major infrastructure player on the Internet – I am sure we can find some stories that would be of interest.

Thanks

-p

Peering into Availability Problems #3: The Final Installment

Recently, we’ve shown you how even though you may rely on a content delivery network (CDN) with a good SLA, there’s a lot that happens on the Internet that can affect how content availability looks from the perspective of end-users. So far, we’ve taken a look at an incident that affected much of Europe and another that afflicted North and South America. Using heat maps built from data obtained with Cedexis Real end-User Measurements (RUM), we’ll turn to another part of the world, and look at a more abrupt incident occurring across Asia.

Identifying the Incident
As in our first two articles, we define an incident as a reduced ability to reach a platform from a set of networks (ISPs), where a platform is defined as a CDN, a cloud service, or a private or public datacenter.
We identify an incident using the same method: For any 6 hour average, if 5 consecutive one minute averages are significantly under the 6 hour average (by 20% or more), it counts as an event. Using this approach, Cedexis identified an incident that affected availability for a major CDN starting at 9:16am on April 19, with a duration of about 9 minutes

Outage in the Pacific Rim
The heat maps use the following legend to depict availability:
• Green – Approximately 100% availability
• Yellow – Approximately 75% availability
• Red – Less than 40% availability

Figure 1 – Continent Level View of Incident

Right away, we can see that this incident is confined to Asia, and its impact is immediate and severe. Like a heart-attack, it’s symptom’s come on very quickly. CDN availability is reduced to less than 40% within the span of a minute and remains at that level until 9:25, when availability suddenly returns to approximately 100% availability. We’ll drill down to find out which countries were most affected.

Figure 2 – Country Level View of Incident

When we look at the entire continent, we can see that availability is affected most in the Pacific Rim countries, with most of them experiencing severe availability problems from the outset. Of the Pacific Rim countries, only Russia doesn’t experience availability problems, while Cambodia and Hong Kong experience minor availability problems. This suggest that the ISPs in these countries are perhaps multi-homed for their routes to these CDN services.
When we drill down to the ISP level, the heat maps show a similar picture.

Figure 3 – ISP Level View of Incident

ISPs throughout the Pacific Rim countries experienced an almost complete inability to reach the CDN, with only one ISP in Vietnam maintaining good availability throughout the incident. Similar to what we saw at the continent and country levels, when the incident ends, availability returns to nearly 100% immediately.

The Impact of Availability Problems
Users in the countries affected by this incident likely experienced severe performance degradation when trying to reach content delivered by the CDN. This could manifest in a number of ways, from slow loading or stalled web pages to unresponsive applications and interruptions of audio and video streams. While the problem could have originated outside the CDN’s control it seems unlikely. Too many ISP’s from too many locations were seeing the same problem at the same time. From the CDN’s point of view, there may have been no downtime – however end users would have a markedly different experience during this time.

RUM does not directly diagnose the cause of problems like these, but the insight into how users experience content minute by minute is very revealing. It could be an ISP, a data center, or any number of transit points that fail. It does not matter who fails the user, just that they see your site as down or they don’t get their content.

We don’t have to tell you that outages, even small ones, impact your business. Video streaming, ecommerce, and mobile apps are all impacted. In fact, availability issues are cited as one of the common reasons that users abandon e-commerce shopping carts.

When end users experience problems with content availability, their frustration can have a significant impact on your business, so you need the kind of insight that Real end-User Measurements with Cedexis Radar can give you. Read about how this works on our website. More specifically – go see what CDNs are performing best in what countries right now!

If you found this interesting please feel free to drop me a line at pete@cedexis.com. With billions of data points every day and hundreds of similar stories to this one we are always looking for input on what people are interested in hearing about.

New report from Arcus Advisors on content delivery marketplace evolution

Our friends at Arcus Advisors have released an interesting report on the state, and evolution, of the Content Delivery marketplace. The study breaks the market into three time/technology periods with an exploration of the success factors and challenges related to each.

Of particular interest is the opinion that the currently emerging 3rd phase content delivery offerings are being heavily influenced by mobile and enterprise content delivery needs, with new technologies emerging as both embedded elements of the network and as abstracted components of SDN offerings.

Executive Summary :

The current content delivery marketplace is a difficult operating environment with considerable headwinds to growth. It is political, has piracy problems, and extreme price sensitivity from consumers.

Opportunities in the enterprise space benefit from many of the same trends as the consumer market, but do not have the same piracy and political problems. Enterprises also have a history of paying for quality and security.

There are no CDNs in the mobile network right now. There is only delivery to mobile devices. The lack of infrastructure is an opportunity; but the technology, economics, real estate, and politics of wireless are considerable headwinds for any company operating in the space.

The first phase of CDNs were pure-play network operators such as Akamai. The second phase, which is concluding now, is defined as ISPs trying to copy what CDNs were already doing and encountering great headwinds. Growth in the next phase of content delivery will come from operators who combine storage, computing power, and delivery or have a unique focus. There are several companies already operating in this new space. CDN functionality will become more deeply integrated with existing network equipment and commodity hardware, with an SDN/app approach to adding new services.

The growth of the Internet is causing changes in the content delivery marketplace. Current leaders are under pressure to adapt and new providers are emerging to address new markets and opportunities.

For information about the complete study, please contact: slandman(at)arcusadvisors.com

Which Clouds have the best peering in the US?

Collecting a billion measurements a day from over 30,000 networks provides insight I wish I’d had when managing hosted security operations at my previous job. We had over 2,500 physical and virtual instances across 3 main data centers. Each data center was, of course, multi-homed and we bought transit and peering from a redundant set of bandwidth providers. But how well were those providers performing? What were all those peering arrangements and transit costs actually buying us? Agent based monitoring from the usual suspects could confirm that I was well peered to back bone networks and major data centers, but how well connected were my data centers to real people’s homes, offices and cafes?

In honor of Cloud Connect I’ve taken some Cedexis Radar data and made suggestions for those considering Azure and Amazon as cloud vendors. How good are their peering arrangements? How effective are they at delivering dynamic content to the last mile?

When you try to ask this question using Cedexis Radar, the first thing you realize is to be meaningful, you actually have to ask a more granular question: from which ISP and over what time frame? The per ISP and per minute variability in Radar measurements are a constant source of amazement to me. So to start, we’ll choose the week of Monday, January 30 2012 and 3 major US ISPs: AT&T, Verizon and QWest. We’ll measure HTTP Response Time – a measurement we define as the time to request and download, over a warmed up socket, a 50 Byte file. I’ve included both of Azure’s US locations, all 3 of Amazon’s US locations and for comparison the Internap/Voxel US location.

In each graph, the Y axis is average HTTP Response time in milliseconds measured from millions of browsers sitting in each network.

We’ll start (of course) with AT&T, the largest US network by subscriber base:

AT&T US (ASN 7132)

Interesting that EC2 US-West starts well and trails off and that Azure NorthCentral outperforms EC2 US-East. I’d have guessed that the AWS Virginia facility would be the consistent winner.

 

Verizon US (ASN 19262)

To be clear, this is Verizon’s home and business network (not their mobile network). Surprising that, except for a midweek hiccup, Voxel’s US location edges out Amazon US-East.

Qwest US (ASN 209)

In this case EC2 US-West is consistently the fastest even though that US AWS region would have been a poor choice for reaching visitors coming from AT&T or Verizon (at least over the course of this week) and Voxel’s US location is as much as 40% slower than the top choice.

Of course, our stance is spreading your risk and your performance across multiple cloud providers is the best bet and we’ve build a set of tools to make multi-provider strategy and execution possible. We can even show you how well your own peering providers are doing. Wish I’d had this in the last job!

 

Sample Country Report : Example of France

Philosophically, our goal is to show companies how fast they could be, by leveraging an effective multi-platform strategy. We’ve playing around with the idea of replacing our public charts with a set of Country Reports which answer a specific set of questions.

Here’s an example for France…

ISP Marketshare:
Where are my end-users (most likely to be) coming from within this country?

ISP Performance:
What are average page load times for end-users coming from these ISPs?

Web Benchmarks:
On average, how do the biggest sites in the world compare for end-users in this country?

Cloud Performance & Availability:
Where should I deploy my applications in order to deliver the best results to this country?

CDN Performance & Availability:
Features-aside (although this is often the most important consideration), what can I do about my static content to achieve the best results in this country?

Dynamic Content Acceleration (coming soon):
Which technologies can have the biggest impact on end-user perceived performance of my dynamic content?