User searches in EU West failing

Incident Report for Jira Software

Postmortem

SUMMARY

On November 2, 2021, beginning at 1:36 PM UTC to 6:30 PM UTC, some customers using Confluence Cloud, Jira Cloud, and certain developer APIs had a degraded experience or were unable to perform user searches or team searches in user fields (such as assignee fields, user filters, or mentions) or in CQL queries. A similar outage happened on October 11th, and we know that a repeated incident is unacceptable.

Several of our products and developer APIs use an internal service called Cross-Product User Search (CPUS). The CPUS service is responsible for providing GDPR-compliant user and team search to our customers and internal services.

At 1:09 PM UTC, prior to the incident, we were alerted to a queue owned by CPUS that contains events that began to backlog beyond normal operating latency. These events tell the CPUS service that a user modified their profile, account access, or privacy settings across any of the affected products. This influx of events is normal during peak times and during cloud migration. As part of our standard operating procedure, we increased the throughput of the service responsible for processing these events in all affected regions beyond our standard auto-scaling limits. This action is considered a normal operational task and, until this incident, has not caused an outage. However, in region located in western Europe, prod-euwest, the database clusters could not handle the significant increase in updates.

In each region, CPUS comprises of redundant database clusters, a searcher service, and an indexing service. To maintain data integrity, both databases clusters are updated simultaneously by the indexing service. Unfortunately, neither database cluster were able to handle the increase in updates from the indexer and caused a significant increase in search latency. At 1:38 PM UTC, two minutes after the increase in latency, we were paged. Since both database clusters in prod-euwest were considered non-operational, we made the decision to failover all search requests in prod-euwest to prod-east at 2:07 PM UTC. This induced an additional 50-millisecond latency on all requests for our EMEA customers but allowed those customers to perform user and team searches again at 2:12 PM UTC. We spent the next several hours diagnosing the root cause and clearing out the backlog of events. At 6:13 PM UTC, we reverted the failover. At 6:30 PM UTC, we finally achieved a normal operating latency for our EMEA customers.

Similar to the outage on October 11, 2021, we experienced node drops in our database clusters. This time, we were able to build enough evidence as to why these node drops caused a considerable increase in search latency and began to update the cluster configuration to apply a long-term fix in all regions.

IMPACT

The overall impact was on November 2, 2021, between 1:36 PM UTC and 6:30 PM UTC, with a slower, operational search experience for our EMEA customers starting at 2:12 PM UTC. During this time, the outage affected Confluence Cloud, the Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management), team search developer APIs, and people directory search developer APIs. Depending on the service or experience, this resulted in an HTTP 503 response (Bad Gateway), an HTTP 500 response (Server Error), increased response time, or a request failed due to timeout. This includes client requests for user mentions, the user or team searches, lookups of users in user fields such as Assignee in Jira, certain CQL queries in Confluence, and third-party apps that use the Atlassian user search APIs.

ROOT CAUSE

Internally, CPUS indexes and performs search queries against two redundant database clusters in each of our five service regions. This redundancy is key to maintaining our 99.99% availability SLO and zero downtime during upgrades.

During normal operation, user search traffic is split evenly between these two database clusters. If our auto-failover mitigation system detects one of the two database clusters is in an unhealthy state, all traffic is redirected to the healthy database cluster and we monitored the situation until both database clusters are healthy again. This operation is usually seamless and our customers are not impacted. Auto-failovers do happen occasionally and are a response to issues outside of Atlassian's control. Even with our updated auto-failover system developed after the October 11th incident, the system was unable to failover as it detected issues in both clusters, and manual intervention was still required.

As our customer base grows, so does our user search database. We discovered that a configuration that declares how our database cluster distributes the data stored in the index was no longer sufficient for the scale of our database clusters. This outdated configuration made node recovery excessively stressful since more data had to be redistributed during a failure. Additionally, these recoveries took longer as the disk read throughput in our service configuration was not sufficiently high enough to handle node recovery, indexing, and searching simultaneously.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid the impact of this kind of outage in the future. The following is a list of high-priority action items that will be implemented to augment existing testing, monitoring, and deployment practices:

Update the configuration for how our database cluster distributes its data within all 10 database clusters for CPUS.
Update the disk read throughput on all 10 database clusters for CPUS. We have completed 6 high-risk clusters already and the remaining 4 low-risk clusters are being updated now.
Rebuild the event search index on all 10 database clusters for CPUS. This step applies the new cluster configuration in the first action and redistributes the existing data. We have completed the reindex in our highest risk database clusters, and now working through the remaining clusters based on risk. The disk read throughput upgrade will give us ample time to complete this rebuild safely.
Investigate options to delay event processing automatically if our indexing service detects increased latency, prioritizing user search requests.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Nov 10, 2021 - 19:20 UTC

Resolved

Between 1330 GMT to 1430, we experienced issues related to user searches for EMEA customers for Confluence, Jira Work Management, Jira Service Management, and Jira Software. The issue has been resolved and the service is operating normally.

Posted Nov 02, 2021 - 18:48 UTC

Monitoring

We are investigating cases of degraded performance for some Confluence, Jira Work Management, Jira Service Management, and Jira Software Cloud customers. We will provide more details within the next hour.

Posted Nov 02, 2021 - 14:24 UTC

This incident affected: Create and edit, Search, Administration, Mobile, and Automation for Jira.