On 21/02/2023, between 2:30am and 4:15am UTC, Atlassian customers using Jira Software, Jira Service Management, Jira Work Management and Confluence Cloud products were unable to view issues or pages. The event was triggered by a change to Atlassian's Network (Edge) infrastructure, where an incorrect security credential was deployed. This impacted requests to Atlassian's Cloud originating from the Europe and South Asia regions. The incident was detected within 21 minutes by monitoring and mitigated by a failover to other Edge regions and a rollback of the failed deployment which put Atlassian systems into a known good state. The total time to resolution was about 1 hour and 45 minutes.
The failed change impacted 3 out of the 14 Atlassian Cloud regions (Europe/Frankfort, Europe/Dublin, and India/Mumbai). Between 21/02/2023 2:30am and 04:15am UTC, end-users may have experience intermittent errors or complete service disruption for multiple Cloud products. As the traffic is directed to Atlassian Cloud using DNS latency-based records, only the traffic originating from locations close to Europe and India was impacted.
A change to our Network Infrastructure used faulty credentials. As a result, customer authentication requests could not be validated, and requests were returned with a 500 or 503 errors. After investigation, it was found that the health-check and tests which should have prevented the faulty credentials to reach the production environment, contained a bug and never indicating a fault.
We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified in our dev and staging environments because the new credentials were only valid for production.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact but in this case, our detection and health-checks did not work as expected. To minimise the impact of breaking changes to our environments, we will implement additional preventative measures such as:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support