Jira Family is unavailable across shards

Incident Report for Jira

Postmortem

Summary

On January 18, 2024, starting at 06:34 UTC, customers using Atlassian Marketplace and the Jira family of products may have experienced intermittent failures.

A scheduled database upgrade on an internal Atlassian Marketplace service resulted in degraded performance for that service. This degraded performance manifested in increasing response times and eventually time outs. This service degradation then cascaded upstream and resulted in requests timing out across the Jira family of products, impacting product experiences.

Impact

Jira family:

On January 18, 2024, at 07:14 UTC, the impact on product experiences hit critical alerting thresholds.

This impact would resulted in performance degradation, service unavailability or in some instances, full service disruption. Customers would have experienced this as failed page loads or failed interactions with the products.

All end user impact related to product functionality was fully resolved by 10:30 UTC.

Marketplace:

On January 18, 2024, starting from 06:34 UTC, there were impacts to customer functionality related to app management (install, trial, uninstall, update, purchase). There were also impacts to Marketplace partner functionality such as app management and account management.

We resolved the underlying service degradation and restored full service by 15:13 UTC. We then monitored closely for further impact until we officially closed the incident at 16:15 UTC.

Root Cause

The issue was caused by a scheduled database upgrade within the central service that supports the Atlassian Marketplace. The upgrade occurred during a scheduled maintenance window between 06:30 UTC and 08:30 UTC on January 18, 2024.

One of the database upgrade steps triggered degraded performance of the Marketplace service. As the performance degraded this created back pressure on clients of this service. This back pressure eventually drove request timeouts. Our global edge compounded this issue by retrying on timeout, which further exacerbated the issue and increased the load on the service.

Overall this resulted in degrading performance and an effective outage on this service. Attempts to rollback the change were not immediately effective under heavy load.

Atlassian products are dependent on this Marketplace service for some user-facing capabilities. In the case of this incident, there is a licensing check for some marketplace apps from Jira family into the back-end service.

Jira should degrade gracefully when there is degradation or outages in downstream services. For this dependency we don't have sufficient isolation of downstream impact from user experience impact on the front end which caused the impact to experiences in the Jira family. We were able to recover Jira ahead of the marketplace service recovery by breaking that hard dependency without losing end user capability.

Remedial actions & next steps

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because of a difference in load between our staging and production environments.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

For Marketplace:

Improve concurrency handling within the service, tune and test the system for scale and performance, and apply protective measures like rate limiting.
Harden the database migration procedure to avoid unexpected downtime, including monitoring for expected vs unexpected alerts during upgrades.
Implement faster rollback / roll forward procedures for this type of service impact.

For Jira:

Isolate impact from this dependency into the product by routing through a single logical proxy responsible for ensuring appropriate circuit-breaking behaviour and timeouts.
Discover and identify and remediate any other potential similar class of issue.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jan 26, 2024 - 20:22 UTC

Resolved

Between 06:52 UTC to 10:29 UTC, we experienced an outage for Jira Work Management, Jira Software, and Jira Product Discovery. The issue has been resolved and the service is operating normally.

Posted Jan 18, 2024 - 15:56 UTC

Monitoring

Our engineering team has implemented fixes for Jira and should be recovered again. We will continue to monitor all systems.

Posted Jan 18, 2024 - 10:52 UTC

Identified

We have identified the root cause of the issue and working on the fix.

Posted Jan 18, 2024 - 09:49 UTC

Investigating

We are investigating an issue with outage that is impacting Jira Work management, Jira Software, Jira Service Management and Jira Product Discovery. We will provide more details within the next hour.

Posted Jan 18, 2024 - 08:35 UTC