On January 18, 2024, starting at 06:34 UTC, customers using Atlassian Marketplace and the Jira family of products may have experienced intermittent failures.
A scheduled database upgrade on an internal Atlassian Marketplace service resulted in degraded performance for that service. This degraded performance manifested in increasing response times and eventually time outs. This service degradation then cascaded upstream and resulted in requests timing out across the Jira family of products, impacting product experiences.
On January 18, 2024, at 07:14 UTC, the impact on product experiences hit critical alerting thresholds.
This impact would resulted in performance degradation, service unavailability or in some instances, full service disruption. Customers would have experienced this as failed page loads or failed interactions with the products.
All end user impact related to product functionality was fully resolved by 10:30 UTC.
On January 18, 2024, starting from 06:34 UTC, there were impacts to customer functionality related to app management (install, trial, uninstall, update, purchase). There were also impacts to Marketplace partner functionality such as app management and account management.
We resolved the underlying service degradation and restored full service by 15:13 UTC. We then monitored closely for further impact until we officially closed the incident at 16:15 UTC.
The issue was caused by a scheduled database upgrade within the central service that supports the Atlassian Marketplace. The upgrade occurred during a scheduled maintenance window between 06:30 UTC and 08:30 UTC on January 18, 2024.
One of the database upgrade steps triggered degraded performance of the Marketplace service. As the performance degraded this created back pressure on clients of this service. This back pressure eventually drove request timeouts. Our global edge compounded this issue by retrying on timeout, which further exacerbated the issue and increased the load on the service.
Overall this resulted in degrading performance and an effective outage on this service. Attempts to rollback the change were not immediately effective under heavy load.
Atlassian products are dependent on this Marketplace service for some user-facing capabilities. In the case of this incident, there is a licensing check for some marketplace apps from Jira family into the back-end service.
Jira should degrade gracefully when there is degradation or outages in downstream services. For this dependency we don't have sufficient isolation of downstream impact from user experience impact on the front end which caused the impact to experiences in the Jira family. We were able to recover Jira ahead of the marketplace service recovery by breaking that hard dependency without losing end user capability.
We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because of a difference in load between our staging and production environments.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support