On June 04, 2024, between 07:28 and 08:32 UTC, some Atlassian customers using Confluence Cloud products could not access the functionality in parts of the EU Central region. The event was triggered by temporarily reduced capacity following a rollback deployment, with insufficient nodes to handle the load. Our alerts detected the incident within two minutes and mitigated it by rolling forward the release and manually adding more nodes, which put Atlassian systems into a known-good state. The total time to resolution was about 64 minutes.
The overall impact was on June 04, 2024, between 07:28 and 08:32 UTC, to customers using Confluence Cloud. The Incident caused service disruption to some EU Central region customers, resulting in reduced functionality, slower response times, and limited access when loading Confluence pages, space overview pages, and the home page.
When we deploy and release our changes to all the different regions, we execute a progressive rollout strategy. This helps us catch any issues proactively. If an issue is found during deployment, the deployment is paused and an automated rollback to the previous release is triggered. In this case, there was a timeout from our configuration to our cloud provider for one region, which forced an automated rollback to the previous release in all the regions. As that rollback happened in the EU Central region, there needed to be more compute nodes to handle the high traffic. Other regions rolled back successfully with no customer impact.
Thus the root cause of the incident was the failure to scale our nodes to the optimal capacity caused by an automated rollback of a release.
We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because it was an issue in our deployment pipeline that was not picked up by our regular automated continuous deployment suites and manual test scripts.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted by this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support