On February 12, 2024, between 1:45pm and 2:54pm UTC, some Atlassian customers were unable to access Confluence Cloud. The event was triggered by a Redis (cache) failure causing a spike in the retries to connect to Redis (Cache). The connection spike along with synching data from multiple nodes to Redis caused the threadpool depletion. This impacted some customers in US East and EU regions. The incident was detected within two minutes by monitoring and mitigated by scaling up the webserver nodes which put Atlassian systems into a known good state. The total time to resolution was about one hour and nine minutes.
The overall impact was on February 12, 2024, between 1:45pm and 2:54pm UTC. This caused service disruption to some customers in US East and EU regions where they were unable to access Confluence.
Confluence has a set of data controlling features that are stored locally in all nodes, and in a Redis cache as a persistent store. A Redis failover from primary to read replica triggered a process across all nodes to sync data from the local cache to Redis with a large payload. This resulted in spikes in connections with multiple retries resulting in spikes in CPU and threadpool depletion. As a result, Confluence was unavailable or with degraded experience (very slow on load).
We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident.
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support