Jira Cloud performance degraded
Incident Report for Jira Software
Postmortem

SUMMARY

Between 11:22 AM UTC on September 21, 2021, and 10:26 PM UTC on September 22, 2021, some Atlassian customers using Jira Software, Jira Service Management, and Jira Work Management products were intermittently unable to create issues and receive emails, and overall experienced performance degradation. The event was triggered by a feature rollout. 

The change was introducing minor modifications in permissions check component. During the production rollout misbehaving safe rollout checks impacted some customers on US-East and US-West2 regions. The incident was detected after 1 hour and 20 minutes by an internal alert system and mitigated by reversing the feature rollout which put Atlassian systems into a known good state. The total time to resolution was about 35 hours and 04 minutes (9 hours and 3 minutes to resolve the incident on the first occurrence, then 12 hours and 1 minute when the same incident occurred the following day).

IMPACT

The overall impact was between 11:22 AM UTC on September 21, 2021, and 10:26 PM UTC on September 22, 2021, for customers of Jira Software, Jira Service Management, and Jira Work Management products. The incident caused service disruption to US-East and US-West2 regions customers only, where they were intermittently unable to create issues, receive emails and experienced performance degradation. 

ROOT CAUSE

The issue was caused by a bug in a component responsible for safe feature rollouts - JQL Consistency Checks.  When new feature of minor modifications in permission check component was rolled and enabled for checks on small percentage of production fleet in two regions, the asynchronous jobs running in the background to validate this feature started failing.  Moreover, they increased memory pressure on the servers, leading to long JVM Garbage Collection pauses and eventually to their crashes. The long JVM Garbage Collection pauses impacted the database replication latency health checks and this in turn degraded performance of remaining database operations in Jira Software, Jira Service Management and Jira Work Management products, or complete outage.  The root cause of the incident was the bug in JQL Consistency Checks component used to validate new features being rolled to production.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this issue wasn’t identified because the change was related to a very specific kind of legacy case that was not picked up by our automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Harden the component responsible for safe feature rollouts (JQL Consistency Checks), to prevent increased memory pressure.

The following improvement actions will help lower time-to-detect and time-to-resolution for this class of incidents:

  • Fix the alert that detects background workers running out of memory (already completed).
  • Provision a dedicated machine for analyzing heap dumps, which has enough RAM to load them into memory.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Oct 12, 2021 - 16:27 UTC

Resolved
This incident has been resolved.
Posted Sep 22, 2021 - 06:40 UTC
Update
We have performed mitigation actions to reduce the performance impact, although there is still no root cause identified.
Posted Sep 21, 2021 - 19:58 UTC
Investigating
We are investigating cases of degraded performance for some Jira Cloud customers. We will provide more details within the next hour.
Posted Sep 21, 2021 - 18:42 UTC