Confluence sites are unavailable

Incident Report for Confluence

Postmortem

SUMMARY

On January 13, 2023, between 00:27 and 01:25 UTC, Atlassian customers using Confluence Cloud couldn’t access their Confluence sites. The event was triggered by a bug in the GraphQL service deploy script. The bug caused the script to fetch the wrong configuration file version, which is incompatible with the application code, and caused the service to fail when responding to all the incoming API requests with authentication errors.

The incident impacted customers in U.S. West (N. California), Europe (Frankfurt), and Asia Pacific (Singapore) regions. The incident was detected within 3 minutes by the frontend error detection system and mitigated by rolling back the last deployment of our GraphQL service, which put Atlassian systems into a known good state. The total time to resolution was 58 minutes.

IMPACT

The overall impact was between January 13, 2023, between 00:27am UTC and January 13, 2023, 01:25 AM UTC on Confluence Cloud products. The incident caused service disruption to customers in the U.S. West (N. California), Europe (Frankfurt), and Asia Pacific (Singapore) regions.

ROOT CAUSE

The issue was caused by a bug in the GraphQL service deploy script. As a precaution, GraphQL service’s code changes have to be soaked in our staging environment before they are released to production. On the day of a production release, the GraphQL team will select a commit that has been soaked enough time to release.

There is a bug in the deploy script so that while the application source code is deployed per the commit the team chooses to release, the configuration file is fetched from the latest commit by the script. The latest configuration file change was incompatible with the application code and caused the service to fail when responding to all incoming API requests with authentication errors.

The faulty service is a GraphQL gateway service to all Confluence Cloud API traffic. Since it throws 401 errors for all the requests, all the GraphQL requests fail and hence render Confluence Cloud in the regions deployed inaccessible.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We appreciate outages impact your productivity. We apologize to our partners and customers whose services were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability.

While we have a number of testing and preventative processes in place to avoid this type of situation, this specific bug wasn’t discovered because it’s rare the service has a configuration file change that is not backward compatible with the code. A combination of the bug in the deploy script and the non-backward-compatible configuration file change caused this problem in production.

We are prioritizing the following improvement actions to avoid this type of incident in the future:

Fix the bug in the deploy script so that the configuration file will be fetched together with the application code to ensure they are always compatible with each other
Update the on-call runbook and continue to train on-call engineers on how to perform a quick rollback on faulty service to minimize the incident time
Audit all the Confluence Cloud microservices to confirm every microservice deployed has the correct anomaly detection and progressive rollout setup

Furthermore, we deploy our changes progressively (by percentage and each region) to avoid a broad impact but in this case, our error rate detection did not work as expected. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures such as:

Bolster error detection mechanisms on the outer HTTP layer. We currently have error detector in the GraphQL layer but this time the request failed in the HTTP layer without reaching the GraphQL request handling. We need both types of detectors to cover different scenarios.
Bolster deployment validation to verify core functionalities of GraphQL during deployment so that deployment will be immediately halted and rolled back if the validation fails.

Thank you to your continued support.

Atlassian Customer Support

Posted Jan 24, 2023 - 00:41 UTC

Resolved

This incident has been resolved and a Post Incident Review will be published later.

Posted Jan 13, 2023 - 02:43 UTC

Monitoring

We identified the issue and recovered the services responsible at around 5:25 PM Pacific time.

Confluence instances have recovered and we are now monitoring

Posted Jan 13, 2023 - 02:09 UTC

Investigating

We have received reports of Confluence being unavailable.

The team is currently investigating the situation to recover the service to a normal state.

Further updates will be shared as they are made available.

Posted Jan 13, 2023 - 01:32 UTC

This incident affected: View Content, Create and Edit, Comments, Authentication and User Management, Search, Administration, Notifications, Marketplace Apps, Purchasing & Licensing, Signup and Mobile (iOS App, Android App).