Confluence not accessible
Incident Report for Confluence
Postmortem

SUMMARY

On January 20, 2023, between 08:57 AM and 09:29 AM UTC, Atlassian customers using Confluence Cloud couldn’t access their Confluence sites. The event was triggered by thread depletion in some of the GraphQL nodes in the Europe (Frankfurt) region. Our system automatically stopped sending requests to these unhealthy nodes but then overloaded the rest of the healthy nodes. The remaining nodes experienced high CPU, slow response time, and high error rates, which in turn rendered Confluence sites unresponsive.

The incident impacted customers in the Europe (Frankfurt) region. It was detected within seven minutes by the GraphQL high CPU detection system and mitigated by aggressively scaling up the service in that region, which put Atlassian systems into a known good state. The total time to resolution was 32 minutes.

IMPACT

The overall impact was on January 20, 2023, between 08:57 AM UTC and 09:29 AM UTC for Confluence Cloud products. The incident caused service disruption to customers in the Europe (Frankfurt) region.

ROOT CAUSE

The issue was caused by thread pool depletion in some of Confluence’s GraphQL nodes. The thread pool depletion was triggered by a code change in GraphQL, which fetches authorization tokens from our authorization service using async threads meant to improve performance. Confluence’s GraphQL service recently switched to using a thread pool to handle all token fetch operations. During the incident, Confluence’s authorization service experienced spikes in new connections and returned timeout errors when the GraphQL service tried to fetch tokens. Since these fetches were executed in a thread pool, we ended up with thread pool depletion on our GraphQL service, making some of the nodes unresponsive.

While our system automatically rotated the unhealthy nodes out of service and replaced them with new nodes, the ongoing traffic overloaded the remaining healthy nodes, which then rendered the remaining healthy nodes unresponsive.

The faulty service is a GraphQL gateway service to all Confluence Cloud API traffic. Since all the nodes were either not serving traffic or too overloaded to serve traffic, many of the GraphQL requests failed and rendered Confluence Cloud in the region inaccessible.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We appreciate outages impact your productivity and apologize to our partners and customers whose services were impacted during this incident. We are taking immediate steps to avoid a similar situation and improve the platform’s performance and availability.

While we have a number of testing and preventative processes in place to avoid this type of situation, this specific issue was triggered only under particular load conditions when the authorization service fails to respond to token fetch requests and when the majority of the GraphQL nodes experience thread depletion.

We are prioritizing the following actions to avoid this type of incident in the future:

  • Turn off the async token fetch immediately
  • Look into faster automatic node replacement when nodes are unhealthy
  • Audit GraphQL and authorization services to confirm they can handle increasing loads in different regions

Thank you for your continued support.

Atlassian Customer Support

Posted Jan 31, 2023 - 18:10 UTC

Resolved
The incident has been resolved and the systems are fully operational. A post incident review will be published later
Posted Jan 20, 2023 - 12:01 UTC
Monitoring
Confluence services are now operational. Our engineering team are investigating logged errors on the platform during the outage window and we are closely monitoring this situation.
Posted Jan 20, 2023 - 10:21 UTC
Investigating
We are investigating an issue where Confluence sites for a number of customers are not loading. We will provide additional information within the next 2 hours.
Posted Jan 20, 2023 - 09:49 UTC
This incident affected: View Content, Create and Edit, Comments, Authentication and User Management, Search, Administration, Notifications, Marketplace Apps, Purchasing & Licensing, Signup and Mobile (iOS App, Android App).