On June 3rd, between 09:43pm and 10:58 pm UTC, Atlassian customers using multiple product(s) were unable to access their services. The event was triggered by a change to the infrastructure API Gateway, which is responsible for routing the traffic to the correct application backends.
The incident was detected by the automated monitoring system within five minutes and mitigated by correcting a faulty release feature flag, which put Atlassian systems into a known good state. The first communications were published on the Statuspage at 11:11pm UTC. The total time to resolution was about 75 minutes.
The overall impact was between 09:43pm and 10:17pm UTC, with the system initially in a degraded state, followed by a total outage between 10:17pm and 10:58pm UTC.
The Incident caused service disruption to customers in all regions and affected the following products:
A policy used in the infrastructure API gateway was being updated in production via a feature flag. The combination of an erroneous value entered in a feature flag, and a bug in the code resulted in the API Gateway not processing any traffic.
This created a total outage, where all users started receiving 5XX errors for most Atlassian products.
Once the problem was identified and the feature flag updated to the correct values, all services started seeing recovery immediately.
We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because the change did not go through our regular release process and instead was incorrectly applied through a feature flag.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were affected by this incident and are taking immediate steps to address the above gaps.
Thanks,
Atlassian Customer Support