Atlassian Cloud login is timing out for some customers
Incident Report for Jira
Postmortem

Summary

On August 22, 2024, between 17:04 and 18:55 UTC, Atlassian customers using Jira, Jira Service Management, and Confluence experienced intermittent failures during login. Other affected features that faced intermittent failures are inviting teammates to Jira/Confluence, Jira Service Management helpseeker sign-up using email domain, and creating requests in the Jira portal.

The event was triggered by a faulty database configuration change, which caused approximately 25% of new login attempts to fail. The Automated Monitoring system detected the incident within five minutes and mitigated it by reverting the database configuration, which put Atlassian systems into a known-good state. The total time to resolution was about one hour and 51 minutes.

IMPACT

The overall impact was on August 22 2024, between 17:04 and 18:55 UTC on Jira, Jira Service Management, and Confluence products. The Incident caused service disruption to customers across all regions where the new logins experienced intermittent failures.

ROOT CAUSE

The issue was caused by a faulty database auto-scaling configuration change. As a result, 25% of the new Atlassian login attempts received HTTP 5xx errors.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to the system's traffic load. The change reduced the database's autoscaling capacity, which was sufficient for low load conditions, but as traffic increased, the capacity was not enough. The automated testing didn’t load the system enough to detect the issue.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Streamline the login flow such that inviting teammates or helpseeker sign-ups etc are not part of the main workflow;
  • Improve the deployment pipelines such that changes to resources like database can be deployed and tested independent of the code for faster delivery.

Furthermore, we deploy our changes progressively to avoid broad impact. This works well for changes in code, but in this case, the change was to a global resource like database configuration where it gets deployed universally. To minimize the impact of changes to our environments, we will implement additional preventative measures such as:

  • Update the testing guidelines so that any database/infrastructure change needs to complete more thorough performance and reliability testing;
  • Fix the dashboards for the traffic monitoring such that they show traffic for correct tables and indices.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Aug 30, 2024 - 21:34 UTC

Resolved
Between 17:08 UTC and 18:55 UTC, we experienced log-in timing out on some customers for Confluence, Jira Service Management, and Jira. The issue has been resolved and the service is operating normally.
Posted Aug 22, 2024 - 19:26 UTC
Investigating
We are investigating reports of intermittent login errors to create new sessions for SOME Confluence, Jira Service Management, and Jira Cloud customers. Once we identify the root cause, we will provide more details.
Posted Aug 22, 2024 - 18:38 UTC
This incident affected: Authentication and User Management.