China's Web Outage: Defense against the Risk of Third-Party Services

By Heiko Specht, Compuware APM Center of Excellence

On Tuesday, January 21, one of the biggest outages in history-if not the world's largest outage-happened to the Internet in China. Essentially, the Internet went completely dark for one of the strongest and fastest growing economies for one full business day. Since succeeding in China is a top priority for many companies, this outage impacted organizations from all corners of the world. Not only were revenues lost, but organizations took a huge hit to their brands and were seen as unresponsive, even if these organizations had nothing to do with causing the outage. What's more, global advertisers lost significant investments as major consumer-facing Chinese websites cascaded downward.

Though major outages like this are nothing new, they've actually increased in frequency in the past year. Major technology players from Microsoft to Apple's iCloud service to Amazon EC2 have all experienced major outages since this past summer, resulting in unplanned downtime for consumers and businesses alike. One would think that massive investments in technology infrastructures would have the effect of minimizing service disruptions like these, or at least containing them. But they haven't. The truth is, these outages and their widespread collateral damage signify the escalating dangers of an increasingly interconnected and complex Web world. More specifically, the China outage serves as a perfect case-in-point of something that's becoming increasingly obvious: the perils of relying so heavily on third-party services and applications interwoven into modern websites.

What Exactly Happened?

At around 3 p.m. local time on January 21, two-thirds of all domain requests in China were routed to a single IP address in Wyoming, which promptly collapsed under load. This was believed to be a domain name system (DNS) attack, the biggest of its type in history. Not all domains were affected; mainly it was those ending in .com and .net, while those ending in .com.cn were partially affected.

Unfortunately, even most of the Chinese websites that were not directly impacted also ended up going down. Here's why: many of the affected domains were hosts to third-party services relied upon by thousands of Chinese websites. One example is analytics engines. Never mind that the analytics engines weren't working, meaning that companies lost out on a whole day's worth of data that could have been used to increase conversions. That was just the tip of the iceberg. Like dominoes, these "poisoned" third-party services brought down the websites they were feeding into, even those websites that were not directly affected by the attack. Another third-party service that went dark was PayPal. This meant that any website integrating PayPal on its back-end could not process transactions for a full eight hours, which was a moot point anyway because these websites were likely inaccessible.

If third-party services make a website so vulnerable, why use them? Like it or not, for most companies, third-party services are here to stay. It's far easier to sign a contract with an advertising firm to help optimize the display of ads on a site, than to try and design such a system internally. Items such as analytics, social media, Web fonts and popular JavaScript libraries are often drawn from services that websites don't directly control, but rely on to work efficiently and reliably at all times. When these external services have an issue, it's the website owner that takes the hit to revenues and reputation, not the third party. In fact, according to some estimates, organizations only control one-third of the time required to load a page, as the rest is consumed by third-party services and content that are not within an organization's direct control. 

Lessons Learned

In this era of increased dependence on third-party services, is there anything organizations can do to experience the benefits while protecting and insulating their Web performance? Fortunately, with certain approaches, the negative impact of major Web service outages can be mitigated. For example:

- Organizations need to be better about getting ahead of website performance issues: Given all the performance-impacting elements standing between the data center and the end user - e.g. the cloud, CDNs, ISPs, devices and browsers - the end-user perspective is the only reliable vantage point from which to gauge performance. New generation application performance management (APM) tools can deliver this view, and it's important to work with technology providers that provide performance views across key geographies and user segments.

Organizations must closely evaluate and monitor third-party services

Before a third-party service is enlisted, organizations should carefully test its performance. One way is to compare website performance before a third-party service is added and afterward, in order to gauge the overall performance impact. If a performance degradation is identified, organizations must work with the third-party service to resolutely fix the problem, before the service is implemented.

Monitoring third-party services in production is also important in order to validate service level agreements (SLAs), but also to identify third-party performance issues as they occur and take appropriate action. As the China example illustrates, the "ripple effect" of third-party performance issues is often unavoidable. But that doesn't mean the impact can't be thwarted or minimized. When a serious performance problem is detected, organizations should have contingency plans in place so that offending third-party services can quickly be removed. While they can be extremely valuable when performing well, many third-party services (such as analytics) are not worth having if it means frustrating customers.

The end-user experience needs to be top-of-mind in all third-party service decisions

In general, websites should keep third-party services to a minimum. Organizations always need to ask themselves before adding a third-party service, if the added feature/functionality is worth the potential increase in overall vulnerability and lost conversions. In this vein, there needs to be constant communication between performance monitoring teams, and the teams who request and depend on these third-party services. This is the key to making the smartest decisions that will protect and promote revenues above all else.

Additionally, when a third-party service is implemented, there are certain design steps organizations can take to proactively reduce risk exposure. For example, by understanding the load order of elements on a site and making sure third-party services and applications are on the bottom, organizations can protect and enhance perceived customer load time, even when a third-party service does suddenly go awry.

As a final note here, to ensure better performance for feature-rich websites and applications, many organizations rely on content delivery networks (CDNs) strategically located in key geographies. Ironically, CDNs represent another third-party service and another potential point of failure, especially since they're likely serving multiple customers experiencing "flash" traffic events. Here, again, measuring performance from the true end-user perspective on the other side of a CDN, is critical to protecting and maximizing these investments. 

Leverage industry resources

Free services such as Outage Analyzer can identify third-party service outages and the corresponding regional impacts (see image). For example, around the 2013 holiday season, Outage Analyzer identified one third-party outage that impacted hundreds of domains. Services like this may not prevent major outages from happening, but they can help organizations at least see when a widespread performance issue is not their own, and give them a head start in putting contingency plans into place and communicating proactively with customers.

Prepare for the unexpected

Kaifu Lee, Google's former China head, recently wrote: "To have a chance in China, the American company must empower the local team to be responsive, autonomous, localized and ready for combat." Indeed, the Chinese market is huge, tough and hypercompetitive, and the recent outage likely left many global companies feeling as if their hands were tied. The lax reaction by the international media only served to downplay the significance of the event to those beyond China's physical borders. But make no mistake, the hit to revenues and brand to companies around the world was huge. 

To a certain extent, major Web events like the one that just happened in China are unavoidable. But the fact is, it's just one more example of the little "Internet storms" that are brewing all the time, and the subsequent impact on modern websites. Fortunately, this impact can be anticipated, contained and minimized with the right approaches. As Kaifu Lee implies, organizations must be prepared for the unexpected and refuse to cede ultimate responsibility for their website success, wherever they may be doing business.

Heiko Specht is a technology expert at the Compuware APM Center of Excellence.