Issue with Automation and Connect Apps

Incident Report for Confluence

Postmortem

Summary

On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform any actions on behalf of users. Some apps may have retried and later succeeded these actions, whereas others may have failed the request. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes.

Technical Summary

On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform token exchanges specifically for the purpose of user impersonation requests initiated by the app. The event was triggered by the failure of the oauth-2-authorization-server service to scale as the load increased. The unavailability of this service and apps retrying failing requests created a feedback loop, compounding the impacts of the service not scaling. The problem impacted customers in all regions. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes.

IMPACT

The overall impact was on February 28, 2024, between 12:17 UTC and 15:23 UTC, and impacted Connect apps for Jira and Confluence products that relied on the user impersonation feature. The incident caused service disruption to customers in all regions. Apps that made requests to act on behalf of users would have seen some of their requests failing throughout the incident. Where apps had retry mechanisms in place, these requests may have eventually succeeded once the service was in a good state. Impacted apps received HTTP 502 and 503 errors as well as request timeouts when making requests to the oauth-2-authorization-server service.

Product functionality such as automation rules in Automation for Jira are partially built on the Connect platform, and some of these were impacted. During the impact window, Automation rules performing rule executions on behalf of a user instead of Automation for Jira failed to authenticate. Rules that failed were recorded in the Automation Audit Log. Additionally, manually triggered rules would have failed to trigger, these will not appear in the Automation Audit Log. Overall, this impacted approximately 2% of all rules run in the impact window. Automation for Confluence was not impacted.

ROOT CAUSE

The issue was caused by an increase in traffic to the oauth-2-authorization-server service in the US-East region and the service not autoscaling in response to the increased load. As the service began to fail requests, apps retried the requests, which further increased the service load. By adding additional processing resources (scaling the nodes), the service was able to handle the increased load and restore availability.

While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the load conditions had not been encountered previously. The service has operated for many years in its current configuration and has never experienced this particular failure mode where traffic ramped faster than our ability to scale. As such, the scaling controls were not exercised and when required they did not proactively scale the oauth-2-authorization-server service due to the CPU scaling threshold never being reached.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident:

The CPU threshold for scaling the service has been lowered significantly so that scaling will begin much earlier as the service load increases in each region.
We are updating our scaling policy to switch to step scaling in order to more rapidly scale capacity if there are significant load increases.
We have increased the minimum number of nodes for the service and will monitor service behaviour to see what should be the optimal minimum scaling value.
Further analysis of rate limiting being triggered will be undertaken to determine if apps are responding to rate limiting appropriately. The service rate limiting is described in https://developer.atlassian.com/cloud/confluence/user-impersonation-for-connect-apps/#rate-limiting.
Longer-term, network-based rate limiting will be explored to avoid a misbehaving app overloading the service.

We apologize to customers, partners, and developers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 07, 2024 - 04:53 UTC

Resolved

Between ~12:15 UTC and ~ 15:20 UTC Jira Cloud Automation rules and Connect apps impersonating users were failing. We have scaled up the underlying services and confirmed we aren't observing failures anymore.

Posted Feb 28, 2024 - 17:16 UTC

Identified

Starting from 13:15 UTC, Automation and Connect Apps affecting certain cloud products

We have scaled up the underlying services and we're seeing an improvement in response times and success rates. We continue to investigate the root cause and will provide the next update by 18:00 UTC.

Posted Feb 28, 2024 - 16:24 UTC

Update

We continue to investigate the issue with the Automation and Connect Apps affecting certain cloud products. We are actively working to resolve this issue as quickly as possible.

Posted Feb 28, 2024 - 15:12 UTC

Investigating

We are investigating an issue with Automation and Connect Apps that is impacting some Cloud products. We will provide more details within the next hour.

Posted Feb 28, 2024 - 13:41 UTC

This incident affected: Marketplace Apps and Confluence Automations.