Service Disruptions Affecting Confluence in Asia Pacific region

Incident Report for Confluence

Postmortem

Summary

On January 18, 2024, between 01:12 am UTC and 02:12 am UTC, Atlassian customers using Confluence Cloud were unable to access core product functionality, seeing degraded performance in the APAC region. The event was triggered by a deployment of a downstream dependency service which could not scale with the increase in traffic. The incident was detected within 18 minutes by an automated monitoring system and mitigated by scaling out nodes manually which put Atlassian systems into a known good state. The total time to resolution was about one hour.

In response to this incident, we helped scale the service and put in a deployment block with the goal of preventing the service from being deployed to production again until the issue was resolved.

On January 25, 2024, between 01:05 am UTC and 01:42 am UTC, a separate automated deployment process ran to deploy services that were not deployed in the previous seven days. This deployment also caused the dependent service to run at a lower-than-desired capacity, resulting in degraded performance in Confluence. The issue was detected within 10 minutes by an automated monitoring system and mitigated by scaling out nodes manually. The total time to resolution was about 37 minutes.

IMPACT

The overall impact was on January 18, 2024, between 01:12 am UTC and 02:12 am UTC, and then on January 25, 2024, between 01:05 am UTC and 01:42 am UTC. These incidents caused service disruption to customers in the APAC region where they may have noticed timeouts and failed requests for viewing pages, creating pages, and other functionality of Confluence Cloud.

ROOT CAUSE

The issue was caused by a deployment of a downstream service that had not scaled to meet the growing traffic. As a result, Confluence Cloud saw timeouts and errors in their requests and the users received HTTP 500 errors.

REMEDIAL ACTIONS PLAN & NEXT STEPS

‌

We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident.

Ensure the right capacity for the target service pool during deployment.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jan 31, 2024 - 01:25 UTC

Resolved

Between 1:05 AM UTC - 1:48 AM UTC, some customers in Asia Pacific region experienced degraded performance for Confluence. The root cause was a bug in our deployment process and we also found that a previous mitigation put in place for a similar issue was not functioning as expected. These issues have been resolved and the service is now operating normally.

Posted Jan 26, 2024 - 02:54 UTC

Monitoring

We have identified the root cause of the performance degradation for Confluence in Asia Pacific region and have mitigated the problem. We are now monitoring this incident closely.

Posted Jan 26, 2024 - 02:21 UTC

Investigating

We are investigating cases of degraded performance for some Confluence Cloud customers. We will provide more details within the next hour.

Posted Jan 26, 2024 - 01:34 UTC

This incident affected: View Content, Create and Edit, Comments, Search, and Administration.