Users are not able to search users, assign fields, and user mention

Incident Report for Jira

Postmortem

SUMMARY

On May 03, 2022, between 13:21 and 14:12 UTC, a majority of AMER and EMEA customers using Jira Software, Jira Service Management, or Confluence Cloud as well as all customers of Bitbucket Cloud were unable to search for users and teams, mention users and teams, assign issues and tickets, and perform some JQL queries. The incident was triggered by a throughput issue after a routine upgrade of a primary database cluster in our eu-west region followed by a failure of multiple levels of redundancy. Our us-east and eu-west regions were affected.

The incident was detected within one minute by our monitoring systems and our on-call engineer responded approximately four minutes after impact began. The team was able to mitigate this issue by shedding some of the load on our us-east region by reversing the failover for team search. Our SLOs returned to normal after approximately 51 minutes. Over the following 48 hours, our engineers monitored the situation and worked to restore the redundant systems to normal operation.

IMPACT

The overall impact was between May 03, 2022, 13:21 UTC and May 03, 2022, 14:12 UTC. The incident caused service disruption to AMER and EMEA customers using Jira Software, Jira Service Management, or Confluence Cloud as well as all customers of Bitbucket Cloud. Specifically, the features below were impacted during this time:

Jira Software & Jira Service Management

Assigning issues and tickets to users
Mentioning users or teams
Some JQL queries involving user details
Searching for users
Searching for teams

Confluence Cloud

Mentioning users or teams
Searching for users
Searching for teams

Bitbucket Cloud

Mentioning users or teams
Searching for users
Searching for teams

ROOT CAUSE

The user search service, a central service responsible for storing and returning user search results from our cloud products, has two redundant database clusters provisioned in five globally distributed regions. Each database cluster is intended to be large enough to be able to support all search traffic in case of a database failure or database maintenance. Over the last couple of weeks, we began to see latency for search responses increase in our eu-west region, which generally indicates under-scaled clusters.

The team began the process of increasing the scale of our clusters in our eu-west region. We began work on May 02, 2022 at 19:44 UTC to scale up our primary database cluster and updated our configuration so all search traffic in our eu-west region was served from our redundant database cluster. The process started when our eu-west traffic was off-peak and gave us a 10 hour window to minimize risk.

This routine upgrade was expected to provide a 20% increase in throughput. But for an unknown reason currently under investigation by Amazon Web Services, we observed reduced throughput in that database cluster by at least 75%. This left our eu-west region without a primary database cluster and a redundant database cluster not yet scaled to handle peak traffic. The team spent the next 10 hours, with support from AWS, attempting to fix the degraded database cluster before peak traffic.

Unfortunately, the team was unable to recover the performance of our primary database cluster and the redundant database cluster was unable to handle peak load in our eu-west region. The team immediately triggered a failover of all eu-west traffic to our us-east region as soon as search latencies exceeded SLOs. Our us-east region's traffic was off-peak and the us-east database clusters are significantly larger than database clusters in other regions. These larger database clusters are designed to give us the ability to safely handle region failovers. The failover occurred on May 03, 2022 at 08:37 UTC. Approximately 5 hours later, on May 03, 2022 at 13:25 UTC, the on-call engineer was paged due to multiple node failures in our us-east region.

In rare situations, the database cluster can have a node failure due to a variety of causes. The database cluster can self-heal by relocating its data onto new nodes. The user search team has built a mechanism to force all traffic to the redundant cluster in that region immediately after a node failure is detected. This mechanism allows the user search service to meet its latency and reliability SLOs in that region and reduces the load on the unhealthy database cluster to expedite its recovery.

Unfortunately, this put four times the normal amount of traffic on our redundant database cluster in our us-east region. This amount of load is well outside of our current design specifications and the database cluster could not keep up with this intense load. User search requests began to timeout and the products that make requests to this core user search service were unable to return search results to our customers. The team was able to mitigate this issue by shedding some of the load on our us-east region by reversing the failover for only team search, a low-touch, but resource-intense query. This was enough to stabilize the us-east cluster enough to handle peak traffic. Our SLOs returned to normal after approximately 51 minutes. Over the following 48 hours, our engineers monitored the situation and worked to restore the redundant systems to normal operation.

REMEDIAL ACTIONS PLAN & NEXT STEPS

While we have a number of redundancies in place, we have multiple things we can do to prevent cascading failures of multiple redundant systems in the future. We are prioritizing the following improvement actions to avoid repeating this type of incident:

A temporary scale up of all database clusters to a level above design spec in all regions hosting the user search service. This will allow the service to handle multiple region failovers gracefully.
Set up a periodic review of our database cluster performance with the goal of deciding if the team should temporarily or permanently scale up clusters to keep up with demand.
We are reprioritizing our efforts towards the rollout of an upcoming architecture change designed to significantly improve the performance and resiliency of the service.
The user search service team will update their runbooks on how to better respond to multi-region failures.
Continue working with AWS to fully understand the root cause behind the failed database cluster upgrade.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted May 17, 2022 - 16:19 UTC

Resolved

Service is now fully operational

Posted May 04, 2022 - 04:18 UTC

Update

We are continuing to monitor for any further issues.

Posted May 03, 2022 - 15:35 UTC

Monitoring

Between 1:21 PM UTC to 2:12 PM UTC, some customers experienced issues with user search, assigning fields, and user mentions. The root cause was due to a scaling failure with the service responsible for user search across multiple products. We have deployed a fix to mitigate the issue and have verified that the services have recovered. The conditions that cause the bug have been addressed and we’re actively working on a permanent fix. The issue has been resolved and the service is operating normally.

Posted May 03, 2022 - 15:33 UTC

Identified

We are investigating an issue with user search, assigning fields, and user mentions that are impacting some Confluence, Jira Service Management, Jira Software, and Bitbucket customers. We will provide more details within the next hour.

Posted May 03, 2022 - 14:19 UTC

This incident affected: Create and edit and Search.