Cloud Performance : Measure It and Take Action

Having spent nearly a decade with the traditional set of site monitoring tools entrenched in large-scale web sites including MSN, Bing, and Windows Live (you know where I worked), I’ve formed a few opinions on what’s important to measure and monitor.  First and foremost, I chose to focus my energy on the customer experience.  If you’re engaged in an activity which does not clearly impact the customer experience in a positive way, you should ask yourself whether you should be doing the activity at all.  Impacting the customer experience requires measuring the customer experience.  Recently some esteemed industry colleagues at Cloudsleuth measured cloud performance using a backbone & synthetic agent approach.   Our point of view and cloud benchmark results are markedly different and here’s why.

Let’s start with the user.  Does this scene sound remotely familiar? You’re sitting in a room full Engineering leads/architects who build and run your live site.  You’re talking about site performance and the big, bad “S” words comes up:  Customers are saying the site is “SLOW”. So like a good engineer you search for actionable data.   You ask yourself, “what can I do to make the site experience better?”  You pull up the usual suspects of browser plug-ins, developer tool du jour, and go to work to optimize Javascript loading, static file sizes, and the many site optimization tools out there.  Here’s a waterfall snapshot from Chrome that should look familiar.

Then, someone notices that some content which is seemingly out of your control in a particular browser session on a developer’s machine.  It’s a third-party ad image or rich content coming from a cloud or CDN that is slowing down the site, presumably for many end-users.   At this point, you either say it’s someone else’s problem or you direct your attention to more “front-end” issues in the page or “back-end” issues in the mysterious data center.  I’ve personally sat in many meetings in which there was no further thought beyond the page and the servers.  Are all users affected this way?  Which users (in which geographies) are experienced the worst latency?  What happened to the network in the equation?  It turns out, most folks lump the network into front-end analysis.  Is that the right approach?  Is the network (broadly speaking, the “cloud”) that sits between your end-user and service bits REALLY completely out of your control?

Real user monitoring (RUM) is too passive.  In a pure sense, passive monitoring means that the measurement does not interfere with the user experience of the page load.  (In some cases, RUM measurements have been known to add latency to the page load.)  Typical RUM data focuses on the page load which allows the front-end developer to inspect each page component and optimize page delivery at the browser.  RUM page load data is a great way to determine whether you’re doing what YSlow says you should be doing such as minimizing requests, optimizing CSS and JS, image & cookie usage, etc.  But a large piece of the problem (the public network between the browser) and your service is not actively monitored.

Cedexis Radar provides actionable network insights into real user experience.  In addition to the page load, a remote probe measures the networks used to deliver the content to the user.  This experience is measured from the browser (passively) to inform real business decisions such as where to intelligently route user traffic in real-time across the entire world of public networks.  We consider this a real user experience expressed in terms of the network latency, not page load, because this data is actionable.   Once content is loaded into the user’s browser, your code behaves as the browser chooses to let it behave– from AJAX to HTML5.  A remote probe, however, detects a network anomaly (slow or unreliable public cloud/CDN) and intelligently decides on the site owner’s behalf to improve the user experience.  Voila! Dynamic user experience improvement at your fingertips.

The public network is no longer out of your control.  Imagine a world where the engineering discussion of “front-end and back-end” suddenly transformed into “front-end, network, and back-end.”  With a comprehensive view from a community of remote probes, you can make an informed decision about a particular cloud of delivery platform.  The historical network probe data is useful in planning decisions to choose specific platforms in specific regions.  In real-time, the network probe data can be used to make decisions to route traffic according to your business rules.

We can not only tell you how fast you are, but how fast you *could be* if you used other CDN or Cloud Providers.

Know the improvement to your user (and your business) before you make any changes.  We provide data on page load times, but we also specialize in giving you meaningful network latency measurements.  Then we go one better and give you a comparison, from the perspective of your own web visitors, of how your existing content delivery or hybrid cloud strategy would look if you used different vendors or data centers.

So here’s an example of actionable network latency data.  Remember that slow-loading image you saw from Chrome?  It turns out users in some part of the world see an even poorer experience and others see a completely different experience even when using a CDN.  But I thought one of the most basic rules in web performance was “use a CDN”.  It isn’t.  It’s “use multiple CDNs (and clouds).”  Over the course of 2011, here’s what a sample of our Response Time data showing Rackspace, EC2, and Azure as the Top 3 clouds.  The height of the bar shows the actual value of measured Response Time (shorter the better) and the shade of bar’s color (darker has more variance) shows the standard deviation of the measured Response Time.

So what’s the take-away here?  Measure network performance from the perspective of real web visitors, not the network backbone or data centers from which your real users do not visit your site.   Different cloud platforms yield varying performance characteristics, so hedge your bets with multiple providers.  Finally, keep an eye on Cedexis, since we’ll be keeping an eye on the cloud for you!