Multiple products experiencing elevated error rates for cloud customers

Incident Report for Confluence

Postmortem

SUMMARY

On December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC, Atlassian Cloud services using AWS services in the US-EAST-1 region experienced a failure. This affected customers using Atlassian Access, Bitbucket Cloud, Compass, Confluence Cloud, the Jira family of products, and Trello. Products were unable to operate as expected, resulting in partial or complete degradation of services. The event was triggered by an AWS networking outage in US-EAST-1 affecting multiple AWS services and led to the inability to access AWS APIs and the AWS management console. The incident was first reported by Atlassian Access whose monitoring detected faults accessing DynamoDB services in the region. Recovery of affected Atlassian services occurred on a service-by-service basis from 2021-12-07 21:50 UTC when the underlying AWS services also began to recover. Full recovery of Atlassian Cloud services was notified at 2021-12-08 1:55 UTC.

IMPACT

The overall impact occurred between December 7, 2021, between 15:54 UTC and December 8, 2021, at 01:55 UTC. The incident caused partial to complete service disruption of Atlassian Cloud services in the US-EAST-1 region. Product-specific impacts are listed below.

The primary impact for customers of Jira Software, Jira Service Management and Jira Work Management hosted in the US-EAST-1 region, was being unable to scale up, which caused slow response times for web requests and delays in background job processing, including webhooks in the AP region. There was significant latency for customers accessing Jira. Some customers experienced service unavailability while the incident took place.

Jira Align experienced an email outage for US customers due to the AWS Service outage that affected many of the AWS Services including Simple Email Service. A small percentage of Jira Align emails were not sent due to the AWS incident.

Bitbucket Pipelines was unavailable and steps failed to be executed.

For Jira Automation, tenant’s rules execution were delayed since CloudWatch was affected.

Confluence experienced minor impact due to upstream services impacting user management, search, notifications, and media. At the same time Confluence was impacted by error rates related to the inability to scale up, and GraphQL had higher latencies.

Trello email-to-board and dashcards features experienced degraded performance.

Atlassian Access reported product transfers from one organization failed intermittently. Admins were not able to update features like IP Allowlist, Audit Logs, Data Residency, Custom Domain Email Notification and Mobile Application Management. Yet, users were able to access and view these features. During the incident, emails to admins experienced a delay. There was degraded experience when creating and deleting API tokens.

Statuspage was largely unaffected. However, notification workers could not scale up and communications to customers were delayed, though they could be replayed later. The incident also impacted users trying to sign in to manage portals and private pages.

Compass experienced a minor impact on its ability to write to its primary database store. No core features were affected.

Atlassian's customers could have experienced stale data issues in production, US-EAST-1 for ~30s, against expected 5s at p99, because of delayed token resolution.

The provisioning of new cloud tenants was also impacted until the recovery of the services.

ROOT CAUSE

The issue was caused by a problem with several network devices within AWS’s internal network. These devices were receiving more traffic than they were able to process, which led to elevated latency and packet loss. As a result, it affected multiple AWS services which Atlassian's platform relies on, causing service degradation and disruption to the products mentioned above. For more information in regards to the root cause, see Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region.

There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are taking immediate steps to improve the Atlassian platform's resiliency and availability to reduce the impact of such an event in the future. While Atlassian's Cloud services do run in several regions (US EAST and WEST, AP, EU CENTRAL and WEST, among others) and data is replicated across several regions to increase the resilience against outages of this magnitude, we have identified and are taking actions that include improvements to our region failover process. This will minimize the impact of future outages on Atlassian’s Cloud services and provide better support for our customers.

We are prioritizing the following actions to avoid repeating this type of incident:

Enhance and strengthen our plans for cross-region resiliency and disaster recovery plans, including: continue practicing region failover in production, investigate and implement better resilience strategies for services, Active/Active or Active/Passive.
Improving and adopting multi-region architecture for services that do require it.
Exercise wargaming scenarios that will simulate this outage to assess customer view of the incident. This will allow us to create further action items to improve our region failover process.

We apologize to customers whose services were impacted during this incident.

Thanks,

Atlassian Customer Support

Posted Dec 16, 2021 - 15:46 UTC

Resolved

Between 2021/12/07 17:40 UTC to 2021/12/08 12:45 UTC, we experienced elevated error rates for some operations. The issue has been resolved and the service is operating normally.

Posted Dec 08, 2021 - 01:54 UTC

Update

Between 2021/12/07 17:40 UTC to 2021/12/08 12:45 UTC, we experienced elevated error rates for some operations. The issue has been resolved and the service is operating normally.

Posted Dec 08, 2021 - 01:52 UTC

Monitoring

We have started to see recovery for this issue involving elevated error rates for multiple products. We are now monitoring closely and expect recovery shortly.

Posted Dec 07, 2021 - 22:35 UTC

Update

We continue to work on resolving the incident with elevated error rates for multiple products. We have identified the root cause and will provide additional updates as soon as possible.

Posted Dec 07, 2021 - 20:40 UTC

Update

We continue to work on resolving the incident with elevated error rates for multiple products. We have identified the root cause and will provide additional updates as soon as possible.

Posted Dec 07, 2021 - 18:56 UTC

Identified

We are currently investigating an incident resulting in elevated error rates for multiple products. We will provide additional updates as soon a possible.

Posted Dec 07, 2021 - 17:40 UTC

This incident affected: Authentication and User Management, Search, Administration, Purchasing & Licensing, and Signup.