On May 03, 2022, between 13:21 and 14:12 UTC, a majority of AMER and EMEA customers using Jira Software, Jira Service Management, or Confluence Cloud as well as all customers of Bitbucket Cloud were unable to search for users and teams, mention users and teams, assign issues and tickets, and perform some JQL queries. The incident was triggered by a throughput issue after a routine upgrade of a primary database cluster in our eu-west region followed by a failure of multiple levels of redundancy. Our us-east and eu-west regions were affected.
The incident was detected within one minute by our monitoring systems and our on-call engineer responded approximately four minutes after impact began. The team was able to mitigate this issue by shedding some of the load on our us-east region by reversing the failover for team search. Our SLOs returned to normal after approximately 51 minutes. Over the following 48 hours, our engineers monitored the situation and worked to restore the redundant systems to normal operation.
The overall impact was between May 03, 2022, 13:21 UTC and May 03, 2022, 14:12 UTC. The incident caused service disruption to AMER and EMEA customers using Jira Software, Jira Service Management, or Confluence Cloud as well as all customers of Bitbucket Cloud. Specifically, the features below were impacted during this time:
Jira Software & Jira Service Management
The user search service, a central service responsible for storing and returning user search results from our cloud products, has two redundant database clusters provisioned in five globally distributed regions. Each database cluster is intended to be large enough to be able to support all search traffic in case of a database failure or database maintenance. Over the last couple of weeks, we began to see latency for search responses increase in our eu-west region, which generally indicates under-scaled clusters.
The team began the process of increasing the scale of our clusters in our eu-west region. We began work on May 02, 2022 at 19:44 UTC to scale up our primary database cluster and updated our configuration so all search traffic in our eu-west region was served from our redundant database cluster. The process started when our eu-west traffic was off-peak and gave us a 10 hour window to minimize risk.
This routine upgrade was expected to provide a 20% increase in throughput. But for an unknown reason currently under investigation by Amazon Web Services, we observed reduced throughput in that database cluster by at least 75%. This left our eu-west region without a primary database cluster and a redundant database cluster not yet scaled to handle peak traffic. The team spent the next 10 hours, with support from AWS, attempting to fix the degraded database cluster before peak traffic.
Unfortunately, the team was unable to recover the performance of our primary database cluster and the redundant database cluster was unable to handle peak load in our eu-west region. The team immediately triggered a failover of all eu-west traffic to our us-east region as soon as search latencies exceeded SLOs. Our us-east region's traffic was off-peak and the us-east database clusters are significantly larger than database clusters in other regions. These larger database clusters are designed to give us the ability to safely handle region failovers. The failover occurred on May 03, 2022 at 08:37 UTC. Approximately 5 hours later, on May 03, 2022 at 13:25 UTC, the on-call engineer was paged due to multiple node failures in our us-east region.
In rare situations, the database cluster can have a node failure due to a variety of causes. The database cluster can self-heal by relocating its data onto new nodes. The user search team has built a mechanism to force all traffic to the redundant cluster in that region immediately after a node failure is detected. This mechanism allows the user search service to meet its latency and reliability SLOs in that region and reduces the load on the unhealthy database cluster to expedite its recovery.
Unfortunately, this put four times the normal amount of traffic on our redundant database cluster in our us-east region. This amount of load is well outside of our current design specifications and the database cluster could not keep up with this intense load. User search requests began to timeout and the products that make requests to this core user search service were unable to return search results to our customers. The team was able to mitigate this issue by shedding some of the load on our us-east region by reversing the failover for only team search, a low-touch, but resource-intense query. This was enough to stabilize the us-east cluster enough to handle peak traffic. Our SLOs returned to normal after approximately 51 minutes. Over the following 48 hours, our engineers monitored the situation and worked to restore the redundant systems to normal operation.
While we have a number of redundancies in place, we have multiple things we can do to prevent cascading failures of multiple redundant systems in the future. We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Atlassian Customer Support