On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform any actions on behalf of users. Some apps may have retried and later succeeded these actions, whereas others may have failed the request. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes.
On February 28, 2024, between 12:17 UTC and 15:23 UTC, Jira and Confluence apps built on the Connect platform were unable to perform token exchanges specifically for the purpose of user impersonation requests initiated by the app. The event was triggered by the failure of the oauth-2-authorization-server service to scale as the load increased. The unavailability of this service and apps retrying failing requests created a feedback loop, compounding the impacts of the service not scaling. The problem impacted customers in all regions. The incident was detected within four minutes by automated monitoring of service reliability and mitigated by manual scaling of the service which put Atlassian systems into a known good state. The total time to resolution was about three hours and six minutes.
The overall impact was on February 28, 2024, between 12:17 UTC and 15:23 UTC, and impacted Connect apps for Jira and Confluence products that relied on the user impersonation feature. The incident caused service disruption to customers in all regions. Apps that made requests to act on behalf of users would have seen some of their requests failing throughout the incident. Where apps had retry mechanisms in place, these requests may have eventually succeeded once the service was in a good state. Impacted apps received HTTP 502 and 503 errors as well as request timeouts when making requests to the oauth-2-authorization-server service.
Product functionality such as automation rules in Automation for Jira are partially built on the Connect platform, and some of these were impacted. During the impact window, Automation rules performing rule executions on behalf of a user instead of Automation for Jira failed to authenticate. Rules that failed were recorded in the Automation Audit Log. Additionally, manually triggered rules would have failed to trigger, these will not appear in the Automation Audit Log. Overall, this impacted approximately 2% of all rules run in the impact window. Automation for Confluence was not impacted.
The issue was caused by an increase in traffic to the oauth-2-authorization-server service in the US-East region and the service not autoscaling in response to the increased load. As the service began to fail requests, apps retried the requests, which further increased the service load. By adding additional processing resources (scaling the nodes), the service was able to handle the increased load and restore availability.
While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the load conditions had not been encountered previously. The service has operated for many years in its current configuration and has never experienced this particular failure mode where traffic ramped faster than our ability to scale. As such, the scaling controls were not exercised and when required they did not proactively scale the oauth-2-authorization-server service due to the CPU scaling threshold never being reached.
We know that outages impact your productivity. We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers, partners, and developers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support