Jira Software, Jira Service Management and Jira Work Management is unavailable to some customers
Incident Report for Jira Software
Postmortem

SUMMARY

On October 19, 2021, between 09:20 PM and 11:27 PM UTC, some Atlassian customers using Jira Software, Jira Service Management, and Jira Work Management products experienced performance degradation. The event was triggered by a sudden spike in requests from a tenant and the metric pipeline was not able to handle that spike. 

This impacted customers in the us-west-2 region. The incident was detected within a minute by an internal alert system and mitigated by rolling over the EC2 instances and turning off the metric publisher which put Atlassian systems into a known good state. The total time to resolution was about one hour and 47 minutes.

IMPACT

The overall impact was on October 19, 2021, between 09:20 PM and 11:27 PM UTC, on  Jira Software, Jira Service Management, and Jira Work Management products. The incident caused service disruption to customers in us-west-2 region only, where they experienced performance degradation. 

ROOT CAUSE

The issue was caused by the metric pipeline not being scaled when there was a sudden spike in requests. This created back pressure in handling the callback from the metric pipeline which resulted in a long JVM Garbage Collection pauses and thread saturation. As a result, the affected Jira customers encountered delays and experienced performance degradation.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this issue wasn’t identified because it was related to a very specific case that was not picked up by our automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Introduce circuit breakers to stop propagating resource scalability issues.

Furthermore, to minimise the impact of breaking our environments and control the blast radius, we will implement additional preventative measures to

  • Control the sudden spike and overloading requests from one tenant; and
  • Add additional alerts to detect resource saturation more quickly.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Nov 08, 2021 - 16:06 UTC

Resolved
Between 9:50 pm UTC to 11:36 pm UTC, we experienced a partial outage for Jira Work Management, Jira Service Management, and Jira Software to a small group of customers. The issue has been resolved and the service is operating normally.
Posted Oct 19, 2021 - 23:38 UTC
Investigating
We are investigating an outage that is impacting some Jira Work Management, Jira Service Management, and Jira Software Cloud customers. We will provide more details within the next hour.
Posted Oct 19, 2021 - 23:32 UTC