Wednesday, April 5, 2017

Have expectations of five-nines reliability diminished?

On the evening of March 8th, AT&T experienced a widespread outage of 911 emergency calling service for mobile customers across a significant portion of the U.S. Some callers were simply unable to reach an emergency operator. Media reports suggested the outage impacted at least 14 states and Washington DC, while AT&T Mobility confirmed that service interruptions prevented callers from reaching 911 emergency centres, but did not disclose the extent of the problem or the cause.

During the course of the outage, which lasted several hours, FCC chairman Ajit Pai reached out directly to Randall Stephenson, AT&T's CEO and took to Twitter to express his alarm at the situation. The FCC has since launched an investigation to track down the root cause of the outage. Until the report is published, it is not known which systems failed, whether the issue cascaded from one facility to the next, how quickly the problem was detected, or how the network recovered.

AT&T is not the only major U.S. carrier in the news for 911 connectivity trouble. Another major story concerns 911 'ghost' calls from T-Mobile subscribers in the Dallas area. The alarming situation, as reported by The Washington Post, has meant the T-mobile users have been placed on hold for extended periods of time. At one point in March, 442 callers in Dallas reportedly were placed on hold for an average of 38 minutes. The technical fault in the city's 911 centre is being blamed for the death of at least two people. Worse, the problem apparently has happened before, perhaps dating back several months, and there has not been a sufficient effort to fix it. Whether the faulty equipment is untimely found to be in the city’s emergency response centre, in the carrier network, or with some interface between the two, the ultimate result is that public has been placed in danger by diminished networking standards.

For big public cloud providers, recent months have not been great for reliability

On February 28th, Amazon Web Services suffered a widespread outage with its S3 web-based storage system. The anomaly involved 'high error rates' with S3 in U.S.-EAST-1, which brought down many high visibility web sites including Business Insider, Quora, Slack and others. While other S3 regions were not impacted, the number of websites now relying on AWS infrastructure is remarkably high and rising. In fact, calculates that 165,344 websites and 137,396 unique domains are now running on AWS S3. On the positive side, Amazon publishes up-to-the-minute information on service availability worldwide. The company is also quite responsive in posting technical updates as the service is being restored. Rather than waiting months for a fault-finding report, AWS has posted technical assessments within hours of resolution. For this latest S3 outage, the blame was attributed to a human error, specifically an S3 team member using an established playbook executed a command to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. The command had unintended consequences. A full restart was required, resulting in hours of downtime for customers, many of whom had real-time, mission-critical applications.

Meanwhile, on March 16th Microsoft Azure experienced a storage incident that disrupted services in 26 of the public cloud’s 28 regions. The disruption has since been characterised as two separate incidents - the first having a global impact and the second being confined to its U.S. east region. In September, Azure experienced a different sort of DNS error that impacted many of its cloud services worldwide for several hours.

None of the outages cited above appear to have been caused by malicious intent, which is of course a prime concern for network reliability, especially given that DDoS attacks continue to grow in size and sophistication. For instance, the October 2016 Mirai botnet attack on Dyn's DNS infrastructure reportedly involved tens of millions of discrete IP addresses from IoT devices.

Are capex budgets sufficient for maintaining five-nines?

For decades, the expectation has been that emergency calls would always get connected, even on Mother's Day, when traffic volumes spike to the highest levels of the year, or if key equipment were to fail. Five-nines (99.999%) reliability translates as system downtime of less than 5.26 minutes per year, and the standard was achieved and maintained through excellence in engineering and in management; many aspired to six-nines reliability, the equivalent of 31.5 seconds of downtime per year.

AT&T is currently undergoing a historic transformation to a virtualised network architecture, and the company talks about its Network 3.0 cloud-centric vision as a guiding force for itself and the rest of the telecom industry. Earlier this month, AT&T stated that it has already converted 34% of its network functionality to SDN and is on the way to 75% by 2020. The network virtualisation goal for year-end 2017 is 55%. It is unclear if or when the 911 connectivity systems will become part of this transformation.

One of the touted benefits of new virtualised system is rapid and easy fail-over. There should be better than 1-to-1 redundancy by using pods of generic x86-based systems rather than the closed, purpose-built legacy systems. On the other hand, every component of the traditional systems was designed for high-availability.

One question that perhaps the FCC investigation will address is whether sufficient capex is being dedicated to maintaining the legacy systems until the new architecture is fully deployed and proven to be equally reliable. Last summer, the Communications Workers of America (CWA) reached out to regulators in New York, New Jersey, Maryland, Delaware, Pennsylvania, Virginia, and Washington, DC, arguing that Verizon has been under-investing in its copper access network since at least 2008. The complaint alleged that Verizon's spending on its Fios fibre infrastructure came at the expense of maintenance for its aging copper networks, which still serve some 8 million customers, and for whom the company still has a statutory obligation to provide safe and reliable service.

Public cloud providers have no such regulatory requirements to achieve five-nines, but they do maintain service level agreements with their customers. Hour-long outages are quite costly, and competitive pressures are even more costly. In the future, as billions of devices come online, such as self-driving cars, delivery drones in flight, in-home medical equipment, the need for always-on networking will be more acute than ever.