Microsoft Azure is currently experiencing problems that have left users unable to access its cloud service.
The company claims that the outage only affects its South Central US region, but users are not so sure - and, to make matters worse, the Azure status page appears to be down, globally (Update: Microsoft has published a post-mortem. Read an analysis here). The status page is now available, intermittently). Further details available in the updates below.
Maybe host the status page elsewhere?
The status page is still accessible via Google Cache. On it, Microsoft notes: "Starting at 09:29 UTC on 04 Sep 2018 a subset of customers in South Central US may experience difficulties connecting to resources hosted in this region. Engineers have isolated an issue with cooling in one part of the data center, which caused a localized spike in temperature, as the preliminary root-cause.
"Automated data center procedures to ensure data and hardware integrity went into effect when temperatures hit a specified threshold and critical hardware entered a structured power down process. The impact to the cooling system has been isolated and is in the process of being mitigated. Engineers are continuing to work towards restoration of services. The next update will be provided at 14:00 UTC or as events warrant."
The Azure Support Twitter page, which is still referring people to the inaccessible status page, has called the problem a "networking issue." It also claims that the outage only affects South Central US customers, although some users claim it has impacted other regions, including West Europe.
Update: Azure Support notes on Twitter: "Engineers are working on mitigating a networking issue in South Central US which has had downstream impact to AAD and other services in the other regions. Customers have been notified in the portal." Customers, however, have replied by noting that the portal is offline.
Overheating took Microsoft offline for 16 hours in 2013, while last year saw the cloud service suffer downtime due to a fire suppression incident.
Update 08:28 Wednesday 5 UTC: Azure Support writes: "Engineers are recovering impacted scale units and remaining Storage-dependent services in South Central US. Some services are gradually recovering. Mitigation efforts continue."
In a Twitter reply, the support also clarified why other regions have been impacted: "At some level all of our Data centers are connected. So if one [fails] it will fall over to the other data centers. Also [a] customer in Europe might have some resources hosted in the affected Data Center."
On its status page, which now appears to load fully, Microsoft gave further details about the outage.
"A severe weather event, including lightning strikes, occurred near one of the South Central US datacenters. This resulted in a power voltage increase that impacted cooling systems. Automated datacenter procedures to ensure data and hardware integrity went into effect and critical hardware entered a structured power down process."
The severe weather impacted its San Antonio, Texas data center. Increasing extreme weather events, and rising sea levels are expected to impact data centers and Internet infrastructure more and more often, as anthropogenic climate change continues to affect the planet. Lightning strikes and severe weather have taken data centers offline before, including Fujitsu's Perth data center, the Singapore Stock Exchange's data center, and a Google data center. Last year, ABB's Bruno Roland told us the best ways to fend off lightning strikes, including the installation of Type 1 + 2 surge arresters to protect incoming power lines and sensitive circuits.
In this instance, nearly 40 Azure services were impacted, along with Office 365 services including Exchange, SharePoint and Teams. Microsoft is still working to fully return services to normal.
The company writes: "Engineers are prioritizing the restoration of Storage resources in order to recover all services with dependencies on these impacted resources. As storage mitigation continues to progress, a necessary extended mitigation phase is required. The current mitigation workflow is outlined below:
1) Restore power to the South Central US datacenter (COMPLETED)
2) Recover software load balancers for Azure Storage scale units in South Central US (COMPLETED)
3) Recover impacted Azure Storage scale units in South Central US. (Mostly complete)
4) Recover the remaining Storage-dependent services in South Central US (Mostly complete)"
Update 15:00 Wednesday 5 UTC: Microsoft writes: "Engineers have restored storage availability for the majority of impacted services, and customers should be continuing to see improvements to service availability." Social network comments suggest the issue is still prevalent.
Update 07:00 Thursday 6 UTC: Microsoft's latest comments remain the same as the above update. A software update pushed out after the outage also caused problems with Office 365 and Skype.
Update 07:00 Friday 7 UTC: "Engineers have restored storage availability for the majority of impacted services, and customers should be continuing to see improvements to service availability. Services outside of the region such as Azure Active Directory, Visual Studio Team Services, and Azure Resource Manager may have experienced impact, but this impact has been mitigated."
Update 18:00 Friday 7 UTC: "Services are operating normally," Microsoft Azure states.
This story is developing, we will update as we learn more.