Increased authentication errors across multiple products

Incident Report for Confluence

Postmortem

Summary

On 21/02/2023, between 2:30am and 4:15am UTC, Atlassian customers using Jira Software, Jira Service Management, Jira Work Management and Confluence Cloud products were unable to view issues or pages. The event was triggered by a change to Atlassian's Network (Edge) infrastructure, where an incorrect security credential was deployed. This impacted requests to Atlassian's Cloud originating from the Europe and South Asia regions. The incident was detected within 21 minutes by monitoring and mitigated by a failover to other Edge regions and a rollback of the failed deployment which put Atlassian systems into a known good state. The total time to resolution was about 1 hour and 45 minutes.

IMPACT

The failed change impacted 3 out of the 14 Atlassian Cloud regions (Europe/Frankfort, Europe/Dublin, and India/Mumbai). Between 21/02/2023 2:30am and 04:15am UTC, end-users may have experience intermittent errors or complete service disruption for multiple Cloud products. As the traffic is directed to Atlassian Cloud using DNS latency-based records, only the traffic originating from locations close to Europe and India was impacted.

ROOT CAUSE

A change to our Network Infrastructure used faulty credentials. As a result, customer authentication requests could not be validated, and requests were returned with a 500 or 503 errors. After investigation, it was found that the health-check and tests which should have prevented the faulty credentials to reach the production environment, contained a bug and never indicating a fault.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified in our dev and staging environments because the new credentials were only valid for production.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Improving end-to-end healthchecks
Faster rollback of our infrastructure deployment
Improved monitoring

Furthermore, we deploy our changes progressively (by cloud region) to avoid broad impact but in this case, our detection and health-checks did not work as expected. To minimise the impact of breaking changes to our environments, we will implement additional preventative measures such as:

Canary and shakedown deployments with automated rollback

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 01, 2024 - 08:24 UTC

Resolved

Between 2:30 UTC to 4:26 UTC, we experienced increased authentication errors for Confluence, Jira Work Management, Jira Service Management, Jira Software, and Atlassian Bitbucket. The issue has been resolved and the service is operating normally.

Posted Feb 21, 2024 - 04:57 UTC

Monitoring

We have identified the root cause of the authentication errors and have mitigated the problem. We are now monitoring this closely and will provide further updates within the hour.

Posted Feb 21, 2024 - 04:22 UTC

Investigating

We are investigating authentication issues impacting some Confluence, Jira Work Management, Jira Service Management, Jira Software, and Atlassian Bitbucket Cloud customers. We will provide more details within the next hour.

Posted Feb 21, 2024 - 03:41 UTC

This incident affected: Authentication and User Management.