Due to a simple code error, Microsoft Azure DevOps services in the South Brazil region were interrupted for about ten hours. It is noticed that Microsoft’s software engineering manager Eric Mattingly apologized for the outage on Friday and revealed the cause of the accident: a typo caused seventeen production databases to be deleted.
Azure DevOps provides an integrated set of services and tools for managing software projects, from planning and development to testing and deployment. Eric Mattingly explained that Azure DevOps engineers sometimes take snapshots of production databases in order to investigate reported issues or test performance improvements. They rely on a background system that runs daily and deletes old snapshots after a certain amount of time. Recently, Azure DevOps engineers performed a code upgrade, replacing the deprecated Microsoft.Azure.Managment.* packages with supported Azure.ResourceManager.* NuGet packages. This resulted in a large pull request that replaced the API calls in the old package and the new package.
The typo is in this pull request, which replaces the call to delete the snapshot database with a call to delete the Azure SQL Server hosting the database. Azure DevOps has tests designed to catch such issues, but Eric Mattingly said that because the faulty code only runs under certain conditions, existing tests don’t cover it.
A few days later, the software changes were deployed to the customer’s environment at the South Brazil scale unit (a cluster of role-specific servers). The environment had an old snapshot database that triggered the bug, causing a background task to delete “the entire Azure SQL Server and all seventeen production databases”.
All data has been recovered, but it took more than ten hours. There are several reasons for this, Eric Mattingly said. One of them is that since the customer cannot recover the Azure SQL Server themselves, it must be handled by the on-call Azure engineer, which takes about an hour. Another reason is that databases have different backup configurations: some configured as region-redundant backups, some as newer geo-region-redundant backups, and resolving this mismatch adds significant recovery times.
In order to prevent the problem from happening again, Eric Mattingly said that Microsoft has taken various fixes and reconfiguration measures, and once again apologized to all customers affected by this outage.