On January 20, 2023, between 08:57 AM and 09:29 AM UTC, Atlassian customers using Confluence Cloud couldn’t access their Confluence sites. The event was triggered by thread depletion in some of the GraphQL nodes in the Europe (Frankfurt) region. Our system automatically stopped sending requests to these unhealthy nodes but then overloaded the rest of the healthy nodes. The remaining nodes experienced high CPU, slow response time, and high error rates, which in turn rendered Confluence sites unresponsive.
The incident impacted customers in the Europe (Frankfurt) region. It was detected within seven minutes by the GraphQL high CPU detection system and mitigated by aggressively scaling up the service in that region, which put Atlassian systems into a known good state. The total time to resolution was 32 minutes.
The overall impact was on January 20, 2023, between 08:57 AM UTC and 09:29 AM UTC for Confluence Cloud products. The incident caused service disruption to customers in the Europe (Frankfurt) region.
The issue was caused by thread pool depletion in some of Confluence’s GraphQL nodes. The thread pool depletion was triggered by a code change in GraphQL, which fetches authorization tokens from our authorization service using async threads meant to improve performance. Confluence’s GraphQL service recently switched to using a thread pool to handle all token fetch operations. During the incident, Confluence’s authorization service experienced spikes in new connections and returned timeout errors when the GraphQL service tried to fetch tokens. Since these fetches were executed in a thread pool, we ended up with thread pool depletion on our GraphQL service, making some of the nodes unresponsive.
While our system automatically rotated the unhealthy nodes out of service and replaced them with new nodes, the ongoing traffic overloaded the remaining healthy nodes, which then rendered the remaining healthy nodes unresponsive.
The faulty service is a GraphQL gateway service to all Confluence Cloud API traffic. Since all the nodes were either not serving traffic or too overloaded to serve traffic, many of the GraphQL requests failed and rendered Confluence Cloud in the region inaccessible.
We appreciate outages impact your productivity and apologize to our partners and customers whose services were impacted during this incident. We are taking immediate steps to avoid a similar situation and improve the platform’s performance and availability.
While we have a number of testing and preventative processes in place to avoid this type of situation, this specific issue was triggered only under particular load conditions when the authorization service fails to respond to token fetch requests and when the majority of the GraphQL nodes experience thread depletion.
We are prioritizing the following actions to avoid this type of incident in the future:
Thank you for your continued support.
Atlassian Customer Support