On August 19, 2024, between 6:00 am and 9:58 am UTC, Jira customers located in the EU West 1 region experienced intermittent failures when loading boards and backlogs inside team-managed projects. These failures were triggered by a bug introduced to a backend service that increased load which then triggered downstream services to reject some requests due to rate-limiting.
The incident was detected within five minutes by automated monitoring systems and mitigated by a rollback of the faulty service which put Atlassian systems into a known good state. The total time to resolution was approximately four hours.
All customers in the EU West 1 region experienced elevated error rates when trying to access the Jira team-managed boards and backlogs on Monday, August 19, between 6:00 am UTC and 10:00 am UTC. Customers may have noticed the board and backlog views failing to load due to 429 and 500 error responses. However, they may have been able to eventually view the page after multiple retries.
The issue was caused by a change to a service backing the team-managed experiences. Specifically, a caching layer was accidentally removed which caused a large increase in the number of requests being sent to a downstream service.
We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to the introduction of a feature flag that was not picked up by our automated continuous deployment suites and manual test scripts.
New feature changes are usually behind feature flags and rolled out progressively to customers to allow for increased safety when making new changes. However, in this case, the bug that caused this incident came as a result of an unintended change in behaviour when introducing this flag into our code base.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support