Cloud Providers Work To Disperse Points Of Failure

Outages at CloudFlare and Microsoft's Azure in the past month underscore that widespread chaos can be the result of a weak point in cloud infrastructure

Dark Reading Staff, Dark Reading

March 15, 2013

4 Min Read
Dark Reading logo in a gray background | Dark Reading

On March 3, a router glitch caused Web infrastructure service CloudFlare to disappear from the Internet for more than an hour, cutting off every website protected by the firm's widespread network. A week earlier, every customer of Microsoft Azure lost secure access to the service's storage network for half a day, after three critical certificates were accidentally allowed to expire.

While companies deal with system failures every day, single points of failure in cloud services and infrastructure do not just impact a single company. Every firm that uses an affected cloud provider's service will face a potential outage. The recent cases underscore the dangers of cloud services taking down dependent parts of the Internet if those services do not seek out and disperse points of failure, says Matthew Prince, co-founder and CEO of CloudFlare.

"The challenge, when you have these systemic problems, is that -- when, in Azure's case, they let the domain certificate expire and it affected all their customers at once or, in our case, when we hit a bug with the software that is running on our routers -- instead of having a very limited impact, you have an impact against all the service, generally," Prince says. "Even though the time of the impact may be short, it impacts a large number of customers."

In responding to an attack on March 3, CloudFlare issued a single update to its filtering rules, triggering a bug in Juniper routers that caused it to crash, making inaccessible the thousands of websites protected by the firm's large network. Nearly 3 percent of Web requests are routed through CloudFlare's network, Prince says. Unsurprisingly, within 10 minutes, the company had received some 10,000 tweets from customers.

"It looks really, really bad when we go offline," Prince says. "If we take our network offline, that's 3 percent of Web requests that suddenly stop, and people notice that. So what we are constantly trying to do is trying to make sure that there are no single points of failure that could happen."

[A wave of high-profile breaches of cloud-based services during the past few months is a reality check for entrusting your data with these providers. See The Dark Side Of The Cloud.]

Cloud providers need to study and model the potential cases of failure -- including attacks -- and plan for outages so that downtime does not cause chaos, but merely makes the system react in a predictable way, said Tim Rains, director of Microsoft's Trustworthy Computing group, in an e-mail interview. Microsoft has outlined its methods for designing reliable cloud services in a whitepaper highlighting the need for resilience, data integrity, and recoverability.

"The challenge for cloud providers is to anticipate these failures will invariably happen and design cloud services so that when something does go wrong, the impact to customers is avoided or minimized," Rains said.

In its own analysis of the Azure incident, Microsoft pledged to start tracking, through its support system, certificates that are in danger of expiring. CloudFlare will not stage worldwide rollouts of new filtering rules; instead, the company will introduce the changes slowly, starting with the routers in data centers currently seeing the attack that the rules are designed to stop.

Yet the companies that use cloud providers also need to bear some of the responsibilities, says John Howie, chief operating officer for the Cloud Security Alliance.

"It really is the consumers' responsibility to factor that in and have redundant and reliable business options," he says.

Many service providers -- especially infrastructure-as-a-service and platform-as-a-service providers -- advise customers to have backups of their cloud infrastructure, he says. In April 2011, Amazon had an outage in its Elastic Compute Cloud (EC2) that made any company that used the Amazon's Eastern US N.Virginia region, without redundancy, more difficult to access.

In the end, cloud providers -- many of which aim for 99.9 percent uptime, or "three nines" -- are likely to offer individual companies a more reliable service than those companies attain for themselves, the CSA's Howie says.

"Organizations have to ask if they can do better than that internally -- three nines is tough to get to," he says. "The question is whether you are better to defend your company yourself, or will the cloud provider do a better job. I would put my money on the cloud provider."

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.

About the Author

Dark Reading Staff

Dark Reading

Dark Reading is a leading cybersecurity media site.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights