Between 11:22 AM UTC on September 21, 2021, and 10:26 PM UTC on September 22, 2021, some Atlassian customers using Jira Software, Jira Service Management, and Jira Work Management products were intermittently unable to create issues and receive emails, and overall experienced performance degradation. The event was triggered by a feature rollout.
The change was introducing minor modifications in permissions check component. During the production rollout misbehaving safe rollout checks impacted some customers on US-East and US-West2 regions. The incident was detected after 1 hour and 20 minutes by an internal alert system and mitigated by reversing the feature rollout which put Atlassian systems into a known good state. The total time to resolution was about 35 hours and 04 minutes (9 hours and 3 minutes to resolve the incident on the first occurrence, then 12 hours and 1 minute when the same incident occurred the following day).
The overall impact was between 11:22 AM UTC on September 21, 2021, and 10:26 PM UTC on September 22, 2021, for customers of Jira Software, Jira Service Management, and Jira Work Management products. The incident caused service disruption to US-East and US-West2 regions customers only, where they were intermittently unable to create issues, receive emails and experienced performance degradation.
The issue was caused by a bug in a component responsible for safe feature rollouts - JQL Consistency Checks. When new feature of minor modifications in permission check component was rolled and enabled for checks on small percentage of production fleet in two regions, the asynchronous jobs running in the background to validate this feature started failing. Moreover, they increased memory pressure on the servers, leading to long JVM Garbage Collection pauses and eventually to their crashes. The long JVM Garbage Collection pauses impacted the database replication latency health checks and this in turn degraded performance of remaining database operations in Jira Software, Jira Service Management and Jira Work Management products, or complete outage. The root cause of the incident was the bug in JQL Consistency Checks component used to validate new features being rolled to production.
We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this issue wasn’t identified because the change was related to a very specific kind of legacy case that was not picked up by our automated continuous deployment suites and manual test scripts.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
The following improvement actions will help lower time-to-detect and time-to-resolution for this class of incidents:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support