Service disruption

Incident Report for Confluence

Postmortem

Summary

On June 04, 2024, between 07:28 and 08:32 UTC, some Atlassian customers using Confluence Cloud products could not access the functionality in parts of the EU Central region. The event was triggered by temporarily reduced capacity following a rollback deployment, with insufficient nodes to handle the load. Our alerts detected the incident within two minutes and mitigated it by rolling forward the release and manually adding more nodes, which put Atlassian systems into a known-good state. The total time to resolution was about 64 minutes.

IMPACT

The overall impact was on June 04, 2024, between 07:28 and 08:32 UTC, to customers using Confluence Cloud. The Incident caused service disruption to some EU Central region customers, resulting in reduced functionality, slower response times, and limited access when loading Confluence pages, space overview pages, and the home page.

ROOT CAUSE

When we deploy and release our changes to all the different regions, we execute a progressive rollout strategy. This helps us catch any issues proactively. If an issue is found during deployment, the deployment is paused and an automated rollback to the previous release is triggered. In this case, there was a timeout from our configuration to our cloud provider for one region, which forced an automated rollback to the previous release in all the regions. As that rollback happened in the EU Central region, there needed to be more compute nodes to handle the high traffic. Other regions rolled back successfully with no customer impact.

Thus the root cause of the incident was the failure to scale our nodes to the optimal capacity caused by an automated rollback of a release.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because it was an issue in our deployment pipeline that was not picked up by our regular automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Introduce more aggressive scaling of our nodes to handle incoming traffic if we rollback to an older release
Address gaps in deployment and testing processes, add more post-deployment validations to handle these cases
Have a fallback to an alternative source when we cannot render the experiences due to these failures

We apologize to customers whose services were impacted by this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jun 12, 2024 - 18:26 UTC

Resolved

Between 7:30 UTC to 8:30 UTC, we experienced errors accessing Confluence Cloud in EU region. The issue has been resolved and the service is operating normally.

Posted Jun 04, 2024 - 09:21 UTC

Monitoring

A fix has been implemented and we're monitoring the results.

Posted Jun 04, 2024 - 08:37 UTC

Investigating

We are investigating an issue accessing Confluence across EU region.

Posted Jun 04, 2024 - 08:15 UTC