Authentication failures impacting multiple Atlassian Cloud products
Incident Report for Confluence
Postmortem

SUMMARY

On May 31, 2021, between 5:43 AM and 7:05 AM UTC, a number of Jira Software, Jira Service Management, Jira Work Management, Confluence, Bitbucket, Trello, Opsgenie and Statuspage customers were unable to login into their Atlassian accounts. Additionally, Jira Service Management customer accounts could not login to the Customer portal during that period.

The event was triggered by an outage in one of Atlassian’s 3rd party authentication providers. Atlassian detected the incident within four minutes by our automated monitoring systems. We then mitigated the issue by disabling the problematic feature with our third party provider, and followed that with a fix deployed by our third party provider. These mitigations put our systems in a known good state. The total time to resolution was about 1 hour and 22 minutes.

IMPACT

The overall impact was between 5:43 AM and 7:05 AM UTC on May 31, 2021, and affected the Jira family of products, Confluence, Bitbucket ,Trello, Opsgenie and Statuspage. The incident caused service disruption to customers in all regions where logins were impacted. The key impacted areas were:

  • Most Jira Service Management customer accounts could not login to the Customer portal during the outage.
  • Bitbucket APIs were impacted, particularly APIs that rely on Atlassian account authentication were experiencing sporadic failures. Our investigation identified that 15 514 distinct users were impacted.
  • Atlassian account’s login page experienced a failure rate close to 20%.

ROOT CAUSE

The issue was caused by a database outage experienced by one of our 3rd party authentication providers, and attributed to the database hitting a scaling limit that was not properly configured.

We also acknowledge that our internal fallbacks did not behave fully as expected, and while some of the impact was mitigated, we were expecting lower impact particularly on Bitbucket API users.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • The 3rd party provider conducted a post-mortem and will be taking several steps to improve the prevention, detection and mitigation of issues of similar nature. This includes improved processes and monitoring, in addition to improvements to the existing database backup and configuration strategy.
  • We are improving our fallback strategy and coverage so that logins are more resilient in case our 3rd party provider faces another similar outage.

We apologize to those customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jun 10, 2021 - 23:49 UTC

Resolved
On May 31, 2021, between 5:43 AM and 7:05 AM UTC, a number of Jira Software, Jira Service Management, Jira Work Management, Confluence, Bitbucket, Trello, Opsgenie and Statuspage customers were unable to login into their Atlassian accounts. Additionally, Jira Service Management customer accounts could not login to the Customer portal during that period.

The event was triggered by an outage in one of Atlassian’s 3rd party authentication providers. Atlassian detected the incident within four minutes by our automated monitoring systems. We then mitigated the issue by disabling the problematic feature with our third party provider, and followed that with a fix deployed by our third party provider. These mitigations put our systems in a known good state. The total time to resolution was about 1 hour and 22 minutes.
Posted May 31, 2021 - 05:30 UTC