Automations and rules issue

Incident Report for Jira

Postmortem

SUMMARY

On June 13 2023, from 6:49 PM UTC to June 14, 2023, 02:20 AM UTC, Atlassian customers using Jira Software, Jira Service Management, Jira Work Management, Confluence and Trello with services hosted in AWS us-east-1 region were impacted by Automation rule degradation. This event was triggered by an increased error rates and latencies for AWS Lambda function invocations in the us-east-1 region. Some other AWS services also experienced increased error rates and latencies as a result of degraded Lambda functions invocations.  This incident was automatically detected by multiple monitoring systems within 6 minutes, paging on-call teams. Recovery of the affected AWS Lambda service began after 116 minutes at June 13th 8:45 PM UTC.  Full recovery of all AWS services occurred at 10:37 PM UTC June 13th after the backlog of asynchronous Lambda events had been processed. Some Jira tenants with large event backlogs experienced delays in running schedule-based rule reruns. Full recovery of all Atlassian Cloud services was notified at June 14, 2023, 02:20 AM UTC.

IMPACT

The overall impact was between June 13, 2023, 06:49 PM UTC and June 14, 2023, 02:20 AM UTC.  Product-specific impacts are listed below.

  • Jira Software, Jira Service Management,  Jira Work Management - Automation rules were not executed for 2 hours between Jun 13, 06:49 PM UTC and Jun 13, 08:45 PM UTC.  Jira automation events generated during this period were unable to be rerun.  When AWS Lambda recovered delays were still experienced in our schedule-based and event-based rules for some larger tenants due to a large backlog of events. Full recovery was at June 14, 2023, 02:20 AM UTC.
  • Confluence - Automation rules were not executed for 2 hours between Jun 13, 06:49 PM UTC and Jun 13, 08:45 PM UTC.  On AWS service restoration Confluence automation recovered and Confluence automation events generated during this period were rerun and processed.  Full recovery was at June 14, 2023, 12:41 AM UTC.
  • Jira Product Discovery - Automation rules were not executed for 2 hours between Jun 13, 06:49 PM UTC and Jun 13, 08:45 PM UTC. Jira automation events generated during this period were unable to be rerun.  Sending feedback/filing a support ticket from the application did not work. 
  • Trello -  Email to board delays, card covers image upload failures, attachment preview generation failures, board background upload failures, custom sticker images upload failures, custom emoji upload failures. Trello automation was unaffected. Full recovery was at June 13, 2023, 10:08 PM UTC.

The service disruption lasted for 7 hours and 1 minutes between June 13, 2023, 06:49 PM UTC and June 14, 2023, 02:20 AM UTC and caused service disruption to customers with services hosted in the US-EAST-1 region.

ROOT CAUSE

Atlassian uses Amazon Web Services (AWS) as a cloud service provider. The root cause was an issue with a subsystem responsible for capacity management for AWS Lambda in US-EAST-1 Region, which also impacted 104 AWS services.  This impacted Automation rules as the service is hosted exclusively in this region.

There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Increase reliability of message delivery and recoverability from Jira to Automation platform to improve recovery times. 
  • Create a plan for multi-region impact mitigation for Automation.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted 2 years ago. Jun 27, 2023 - 03:06 UTC

Resolved

Between 6:49PM UTC to 8:45PM UTC, AWS API Gateway and AWS Lambda were down. During this time, events from Jira products and Confluence were not received by Automation. These events will not trigger the automation. The AWS services have recovered from an outage at 8:45PM UTC. Automation events for Confluence during that time have been reprocessed, however, events for Jira were not received and can't be recovered. After the AWS services came back online, Automation started processing any events triggered after 8:45PM UTC. Due to the volume, this caused some events to become backlogged, however, these have now been processed successfully. All new executions after the recovery are confirmed operating normally. For any further issues, kindly reach out to Atlassian support.
Posted 2 years ago. Jun 14, 2023 - 03:02 UTC

Update

From 6:49PM UTC to 8:45PM UTC, AWS API Gateway and AWS Lambda were down. During this time events from Jira products and Confluence were not received by Automation. These events will not trigger automation. The AWS services have recovered from an outage at 8:45PM UTC. Since the recovery, the new executions is confirmed operating normally. Backlogged automation impacted during the incident is still in process. We will provide further updates shortly once all backlogged events are processed.
Posted 2 years ago. Jun 14, 2023 - 00:29 UTC

Update

AWS services have recovered from an outage. Automation rules in Jira products and Confluence are now recovering as well. New executions are backlogged and will be delayed. We will provide further updates shortly.
Posted 2 years ago. Jun 13, 2023 - 22:25 UTC

Update

We are continuing to work on a fix for this issue.
Posted 2 years ago. Jun 13, 2023 - 21:55 UTC

Identified

AWS services are recovering from an outage. This is impacting to execution of automation rules for Jira products, Confluence, and Trello as backlogged requests are processed. We will provide further updates shortly.
Posted 2 years ago. Jun 13, 2023 - 21:53 UTC

Update

We are continuing to investigate this issue.
Posted 2 years ago. Jun 13, 2023 - 19:59 UTC

Investigating

We are investigating an issue with automation that is impacting many Atlassian Cloud customers. We suspect that this is due to AWS lambda degradation. We will provide more details within the next hour.
Posted 2 years ago. Jun 13, 2023 - 19:59 UTC
This incident affected: Automation for Jira.