On January 13, 2023, between 00:27 and 01:25 UTC, Atlassian customers using Confluence Cloud couldn’t access their Confluence sites. The event was triggered by a bug in the GraphQL service deploy script. The bug caused the script to fetch the wrong configuration file version, which is incompatible with the application code, and caused the service to fail when responding to all the incoming API requests with authentication errors.
The incident impacted customers in U.S. West (N. California), Europe (Frankfurt), and Asia Pacific (Singapore) regions. The incident was detected within 3 minutes by the frontend error detection system and mitigated by rolling back the last deployment of our GraphQL service, which put Atlassian systems into a known good state. The total time to resolution was 58 minutes.
The overall impact was between January 13, 2023, between 00:27am UTC and January 13, 2023, 01:25 AM UTC on Confluence Cloud products. The incident caused service disruption to customers in the U.S. West (N. California), Europe (Frankfurt), and Asia Pacific (Singapore) regions.
The issue was caused by a bug in the GraphQL service deploy script. As a precaution, GraphQL service’s code changes have to be soaked in our staging environment before they are released to production. On the day of a production release, the GraphQL team will select a commit that has been soaked enough time to release.
There is a bug in the deploy script so that while the application source code is deployed per the commit the team chooses to release, the configuration file is fetched from the latest commit by the script. The latest configuration file change was incompatible with the application code and caused the service to fail when responding to all incoming API requests with authentication errors.
The faulty service is a GraphQL gateway service to all Confluence Cloud API traffic. Since it throws 401 errors for all the requests, all the GraphQL requests fail and hence render Confluence Cloud in the regions deployed inaccessible.
We appreciate outages impact your productivity. We apologize to our partners and customers whose services were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability.
While we have a number of testing and preventative processes in place to avoid this type of situation, this specific bug wasn’t discovered because it’s rare the service has a configuration file change that is not backward compatible with the code. A combination of the bug in the deploy script and the non-backward-compatible configuration file change caused this problem in production.
We are prioritizing the following improvement actions to avoid this type of incident in the future:
Furthermore, we deploy our changes progressively (by percentage and each region) to avoid a broad impact but in this case, our error rate detection did not work as expected. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures such as:
Thank you to your continued support.
Atlassian Customer Support