Jira Team managed projects, Boards & Backlogs not loading correctly

Incident Report for Jira

Postmortem

Summary

On August 19, 2024, between 6:00 am and 9:58 am UTC, Jira customers located in the EU West 1 region experienced intermittent failures when loading boards and backlogs inside team-managed projects. These failures were triggered by a bug introduced to a backend service that increased load which then triggered downstream services to reject some requests due to rate-limiting.

The incident was detected within five minutes by automated monitoring systems and mitigated by a rollback of the faulty service which put Atlassian systems into a known good state. The total time to resolution was approximately four hours.

Impact

All customers in the EU West 1 region experienced elevated error rates when trying to access the Jira team-managed boards and backlogs on Monday, August 19, between 6:00 am UTC and 10:00 am UTC. Customers may have noticed the board and backlog views failing to load due to 429 and 500 error responses. However, they may have been able to eventually view the page after multiple retries.

Root Cause

The issue was caused by a change to a service backing the team-managed experiences. Specifically, a caching layer was accidentally removed which caused a large increase in the number of requests being sent to a downstream service.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to the introduction of a feature flag that was not picked up by our automated continuous deployment suites and manual test scripts.

New feature changes are usually behind feature flags and rolled out progressively to customers to allow for increased safety when making new changes. However, in this case, the bug that caused this incident came as a result of an unintended change in behaviour when introducing this flag into our code base.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Improving the test coverage in our services to enforce caching within the service.
Improving the scaling configurations of our services to allow them to handle large increases in load and make them more resilient to spikes in traffic.
Increasing the coverage of rate limiting within our services.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted 7 months ago. Aug 28, 2024 - 00:14 UTC

Resolved

Between 19 Aug 2024 - 06:00 UTC to 19 Aug 2024 - 10:00 UTC, boards and backlogs were not loading for Team Managed projects. The issue has been resolved and the service is operating normally.
If you face any challenges please reach out to us via a support ticket.

Posted 8 months ago. Aug 19, 2024 - 10:21 UTC

Monitoring

We've identified the root cause and rolled out a fix. We're currently monitoring the behavior.

Posted 8 months ago. Aug 19, 2024 - 10:12 UTC

Update

We are continuing to investigate the issue and identify the root cause. JSW and JSM team managed projects are affected.

Posted 8 months ago. Aug 19, 2024 - 09:29 UTC

Update

We are continuing to investigate the issue and identify the root cause.

Posted 8 months ago. Aug 19, 2024 - 08:27 UTC

Investigating

We are currently investigating an issue where boards and backlogs are not loading for Team Managed projects. The team is currently working on identifying the root cause and resolving it.

Posted 8 months ago. Aug 19, 2024 - 07:27 UTC

This incident affected: Viewing content.