Service Disruptions Affecting Confluence

Incident Report for Confluence

Postmortem

Summary

On February 12, 2024, between 1:45pm and 2:54pm UTC, some Atlassian customers were unable to access Confluence Cloud. The event was triggered by a Redis (cache) failure causing a spike in the retries to connect to Redis (Cache). The connection spike along with synching data from multiple nodes to Redis caused the threadpool depletion. This impacted some customers in US East and EU regions. The incident was detected within two minutes by monitoring and mitigated by scaling up the webserver nodes which put Atlassian systems into a known good state. The total time to resolution was about one hour and nine minutes.

IMPACT

The overall impact was on February 12, 2024, between 1:45pm and 2:54pm UTC. This caused service disruption to some customers in US East and EU regions where they were unable to access Confluence.

ROOT CAUSE

Confluence has a set of data controlling features that are stored locally in all nodes, and in a Redis cache as a persistent store. A Redis failover from primary to read replica triggered a process across all nodes to sync data from the local cache to Redis with a large payload. This resulted in spikes in connections with multiple retries resulting in spikes in CPU and threadpool depletion. As a result, Confluence was unavailable or with degraded experience (very slow on load).

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident.

Rely on streaming to sync local cache to Redis that prevents connections from being held for a long duration.
Increase the Redis instance type from t3.micros to t3.medium that provides a larger memory to accommodate a write-a-storm scenario

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 07, 2024 - 18:53 UTC

Resolved

Between 13:36 UTC to 14:36 UTC, we experienced degraded performance in Confluence Cloud. The issue has been resolved and the service is operating normally.

Posted Feb 12, 2024 - 15:07 UTC

Investigating

We are investigating cases of degraded performance/accessibility for some Confluence Cloud customers. We will provide more details within the next 2 hours.

Posted Feb 12, 2024 - 14:24 UTC

This incident affected: View Content, Create and Edit, and Comments.