Salesforce's database outage: Why it happened and how to prevent another one

How important is sticking to the script? Important enough that when a script was inadvertently changed last month, Salesforce customers were 'victims' of an outage on one of the company's main servers, facing millions of dollars in losses.

Why the outage? Apparently, it was due to a change in a script of a database used by customers of its Pardot SaaS marketing automation platform. The change in the database script 'broke' permissions and authorizations on databases, apparently throwing them open to anyone who could login. As Salesforce worked to fix the problem, it took the system offline for nearly 20 hours.

Nightmare scenario

According to a statement by Salesforce, which otherwise has been very reticent about the issue: “On May 17, 2019, Salesforce blocked access to certain instances that contained customers affected by a database script deployment that inadvertently gave users broader data access than intended.”

Salesforce could have, and should have, done better. Ironically, the outage affected Salesforce's Pardot marketing automation system, which automatically looks for data on leads from Salesforce CRM information, and helps users build campaigns to reach new customers. The key here is that Pardot is automated, meaning that all the data involved is examined and evaluated in order to make accurate recommendations to users. And while automation is an integral component of Pardot, Salesforce failed to use it where it really counted - conducting an automated test of how the script changes would affect customers.

For companies whose business lives in the cloud - especially SaaS companies - “outage” is perhaps one of the most nightmarish words in their vocabulary. An outage means an inability to do business, an inability to collect payments, and an inability to satisfy customers – who, if they find themselves unable to get what they need from a company, may decide to take their business elsewhere.

Outage-related losses are not “on paper” or otherwise theoretical; they are as real as can be. According to Statista, 24 percent of global enterprises polled said that a single hour of server downtime in 2017 and 2018 cost them between $301,000 and $400,000. And, for 14 percent of companies, that loss topped over $5 million – an hour! The stockholders of those companies are no doubt praying that their firms' data was not stored on Salesforce's NA14 server, where many instances of Salesforce apparently live.

While heavenly intervention would certainly be welcome, the victims of the Salesforce outage probably would want a more terrestrial guarantee from the company that these outages are prevented in the future. This is far from the first time Salesforce has been down, although the latest outage is considered by most to be its worst one ever. What could the company have done differently?

A wider issue

Indeed, this is a problem that goes beyond Salesforce, and beyond databases. Before any change is made in the configuration of a database, virtual machine, network, DNS, storage or any other or service of a system that serves departments, entire organizations, or hundreds of online customers, it's imperative to examine the impact of those changes. Dependencies, scripts, algorithms, and much more could be the “victims” of changes that are not vetted in advance.

That seems to be what happened in the Salesforce outage. The security dependencies that protected data from unauthorized personnel were broken, and data was available to anyone in an organization who logged into the company's Salesforce account. The details of how that came about, of course, are likely to remain a Salesforce secret; but it's clear that those responsible for changing the database script did not vet the results of that change in advance.

But, as the change was done manually, there would have been no way for IT staff at Salesforce to determine the impact of the change anyway. With thousands of customers dependent on that database script, how could an IT team - even a huge one - examine all the scripts and permissions on the server? The answer, of course, is that they're unlikely to be able to. But the problem still needs resolving – and any resolution will require three elements:

Automation: As mentioned, there is too much code for any team to take into consideration when they test their work. An automated system that will parse the disparate details of an environment and test them against checkpoints is a basic requirement for performing any serious QA.
Visibility: Fortunately, many checkpoints do allow for automated testing, but are they testing for the right elements, or for all elements? And what about the dependencies between those elements? A good checkpoint is like night-vision goggles - with them, you become aware of issues and problems that are already inherent in the uploaded code, but wouldn't be visible without your special equipment.
Knowledge: Since new uploaded code is constantly being fed through the pipeline, new issues are constantly surfacing. By automating and constantly updating the process of testing and its content, and assuring it’s aligned with the up-to-date knowledge base and vendor best practices, you become aware of risks and problems before they impact your business. In addition, the ability to automatically record the circumstances and issues those tests uncovered make it easier to resolve problems when they crop up, and provide better guidance on what to avoid in the future before deploying code in a production environment.

A proactive approach like this will take all these elements into account, provide automated scanning and testing of the software-defined environment, as well as its staging environment, and revealing the risks that need to be dealt with, while maintaining and growing the knowledge base to ensure that the current problems do not repeat themselves.

Salesforce knows how to automate processes - that's what Pardot is all about. To avoid future outages, it should apply the same strategy to the systems its customers are dependent upon.

Salesforce's database outage: Why it happened and how to prevent another one

Nightmare scenario

A wider issue

Further reading

Salesforce suffers major outage

Google Cloud Platform outage analysis

Packet loss at Cogent London data center leads to short WhatsApp outage

Tags

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies