Home
auf Deutsch           
Sign In / Register Advanced Search 
You are here:

Data center Management & Automation

The latest news and information on data center management technologies

Rackspace says service interruption was caused by human error
Provider in process of identifying root cause and pledges to follow through on SLA’s

As Rackspace engineers attempted to recreate in a lab the scenario that occurred when a problem with the company’s Dallas-area infrastructure caused an outage for many of its customers on Dec. 18, some customers were quick to write angry posts on their blogs. The data center service provider said those with uptime SLA’s would be compensated accordingly and apologized on its own blog.

This was one of several outages Rackspace customers experienced this year, including issues in November, June and July, which were all caused by problems with power infrastructure at its facilities.

Unlike the previous incidents, this week’s was a networking issue.

A ROUTING LOOP
The outage was caused by a routing loop that was created during a test Rackspace engineers ran as part of the network integration process between the company’s Dallas-area data center and its new facility in Chicago. Rackspace said the problematic router was located in a peering facility outside its own Dallas data center, impacting about 20 percent of the company’s connectivity in the Dallas-Fort Worth area.

“We were trying to bring a new data center online,” said Rackspace Chief Disruption Officer Rob La Gesse. “We were just doing a test. It went bad. Got no excuse for that.”

According to Rackspace, the loop was caused by a software problem and not the physical router itself, which prevented the issue from being avoided by redundancy features of the physical network infrastructure. The router was used for peering and backbone connectivity and the loop precluded some customers’ Web sites from being reached.

“Any outage is painful to me personally and to Rackspace as a company,” La Gesse said. “We spend a lot of money and a lot of time and a lot of resources on making sure that we are available but the Internet is hard and a lot of people are kind of missing the fact that keeping all these complex systems aligned and functional isn’t easy.”

He said Rackspace intended to follow through with compensating customers that had 100-percent-uptime agreements with the company. “We will pay off that SLA at predetermined rates.”

Connectivity issues were first detected by the company at 3:37 p.m. CST. About 25 minutes later, the aforementioned router was identified as the source of the problem. Traffic was rerouted and the affected systems were restored at 4:12 p.m.

ONE OF THE OLDEST CLIENTS STAYS LOYAL
“Since we had so many services there, I could tell it was the network,” recalled Scott Beale, founder and CEO of Laughing Squid – one of Rackspace’s oldest clients. The company provides Web hosting services, as well as a resource for art- , culture- and technology-related content.

Laughing Squid has been using Rackspace services since May of 1999. Currently, Rackspace provides the company with shared managed servers and cloud hosting infrastructure services.

Beale said he was sure Rackspace engineers did all that was in their power to restore normal operations and to keep clients up to date as soon as they could. Beale kept his customers updated via Twitter postings.

“Everybody goes down,” he said. “This stuff happens and unfortunately this year they had a lot more problems than normal.”

He added that the recent downtime instances did not cause Laughing Squid to consider switching providers. “We’re sticking around. That’s for sure.”

BLOGS BEAT NOTIFICATION SYSTEMS TO THE CHASE
According to La Gesse, Laughing Squid is one of about 60,000 customers that use services in Rackspace’s Dallas-Fort Worth data center.

While estimating that those who worked in the facility learned of the issue within seconds, he said there was usually a delay of about five minutes before his colleagues notify him.

“In that time frame, I already had received a couple of customer calls,” he said.

Besides its Web site, Rackspace uses Twitter to notify its customers of problems and La Gesse said a “tweet” was posted within 10 minutes of the moment the issue became known. There is also a paid monitoring and notification service called Rack Watch, through which individual customer sites are proactively monitored and the customers get alerts whenever there are issues.

Still, most customers are quicker to learn of the problems from other customers through sites like Twitter. “The information is out there before (Rack Watch) customers get notified because the Internet is … fast. Customers are faster than systems can be.”

James Melvin, president of Apparent Networks – a young company that monitors performance of cloud-based service providers – said Rackspace “were quite timely in delivering information about this.”

In general, Melvin said Rackspace was a top-level provider among the companies Apparent Networks monitored, including Amazon Web Services, Google and GoGrid. “From what we’ve been able to see, Rackspace is one of the highest-performance providers you see out there.”

Melvin placed GoGrid at the top of the list in terms of cloud-service performance, followed closely by Rackspace and Amazon.

Apparent Networks CTO Matt Stevens said networking issues such as the one Rackspace faced on Dec. 18 were not common but added that problems were inevitable when engineers were faced with having to manage such complex networks.

“When you get into any kind of a multilayer distributed network, the reality is they are tough to manage,” he said. “You have an increase in complexity. You have an increased chance of configuration error.”

La Gesse (of Rackspace) agreed that such routing loops were fairly uncommon and attributed the problem to human error. “Industry-wide, I bet that most … outages we see are caused by human failure and I don’t see that changing any time soon.

“Humans are part of the Internet and we’re not going to get humans out of the Internet for a long time. It’s still about people and I hope it stays that way.”

Related news: Electrical failure causes downtime for Amazon cloud customers
Related analysis: Lessons learned from one of Amazon’s cloud computing client’s prolonged downtime
Related feature: Crash and Burn: Data center outages

Title image courtesy of Wikimedia Commons

Keywords: Rackspace, Rackspace outage, Rackspace downtime, routing loop, Apparent Networks, human error, colocation, cloud computing, data center downtime, Dallas data center, Laughing Squid, Apparent Networks

Comment Box
 
You must sign in to post
 
Username 
Password 
No Blogger account? Sign up here.
CAPTCHA Validation
Retype the code from the picture
CAPTCHA Code Image
Speak the code Change the code
 
 

The Management & Automation Knowledge Bank contains news, articles and features on how software and systems that help you control your data center environment.
Keywords: Asset Management, CMDB, ITIL, ITSM, Dashboard, Control technologies, BMS, BAS.

© DatacenterDynamics 2010