On October 19, 2021, between 09:20 PM and 11:27 PM UTC, some Atlassian customers using Jira Software, Jira Service Management, and Jira Work Management products experienced performance degradation. The event was triggered by a sudden spike in requests from a tenant and the metric pipeline was not able to handle that spike.
This impacted customers in the us-west-2 region. The incident was detected within a minute by an internal alert system and mitigated by rolling over the EC2 instances and turning off the metric publisher which put Atlassian systems into a known good state. The total time to resolution was about one hour and 47 minutes.
The overall impact was on October 19, 2021, between 09:20 PM and 11:27 PM UTC, on Jira Software, Jira Service Management, and Jira Work Management products. The incident caused service disruption to customers in us-west-2 region only, where they experienced performance degradation.
The issue was caused by the metric pipeline not being scaled when there was a sudden spike in requests. This created back pressure in handling the callback from the metric pipeline which resulted in a long JVM Garbage Collection pauses and thread saturation. As a result, the affected Jira customers encountered delays and experienced performance degradation.
We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this issue wasn’t identified because it was related to a very specific case that was not picked up by our automated continuous deployment suites and manual test scripts.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
Furthermore, to minimise the impact of breaking our environments and control the blast radius, we will implement additional preventative measures to
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support