On Thursday, June 25, 2021, between 8:00 AM - 11:28 AM UTC, customers of the Jira family of products, Confluence and Bitbucket were unable to search for users and select entries via user pickers; customers using Confluence were also not able to search for content or experienced very slow loading times for search results to appear.
The event was triggered by a change that was rolled out for Bitbucket, which introduced pre-fetching of users to provide recommendations for approvers in new pull requests. Unfortunately, the changes resulted in a high volume of queries not optimized for our search infrastructure, which overwhelmed it. This impacted customers using those products and specifically connecting to our US East region due to their close geographical location to it. The incident was detected within 42 minutes through our automated monitoring system and mitigated by identifying and rolling back the changes that triggered the event, which put Atlassian systems into a known good state. The total time to resolution was about 3 hours and 28 minutes.
The overall impact was between 8:00 AM and 11:28 AM UTC on Thursday, June 25, 2021, and affected the Jira family of products, Confluence and Bitbucket. The incident caused service disruption only to customers connecting to our US East region due to their close geographical location to it, and they couldn’t search for users or content. Product-specific impact areas were the following:
The issue was triggered by a change being progressively rolled out in Bitbucket to add user recommendations for approvers in new pull requests. As a result of the change, the products mentioned above could not reach the search infrastructure for the purpose of user and group lookups.
User lookup powers user search and by proxy also user pickers across the Jira family of products, Confluence and Bitbucket, resulting in those requests timing out and eventually failing; customers were not able to see results for their user searches and to see items to select in user pickers.
Group lookup is a dependency for content search in Confluence, resulting in those requests to time out and eventually fail; customers were not able to see results for their content searches or experienced very slow loading times for search results to appear.
The root cause of the incident was found in the search infrastructure in our US East region getting overwhelmed by a high volume of non-optimized queries introduced by the rollout of changes to Bitbucket. As those changes progressively propagated to customers whose user/content search queries where routed to US East due to their close geographical location to it, the resources consumption generated by the resulting queries eventually reached a point where the infrastructure was not able to process requests and started failing.
An attempt at rerouting search infrastructure traffic to the next closest US region did not improve the situation until the source of the queries was fully identified and the changes were rolled back. Following the rollback, the infrastructure quickly recovered and our systems reached a known good state leading to the resolution of the incident.
We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to a very specific type of query reaching our search infrastructure at scale. Its impact was not picked up by our automated continuous deployment suites and manual test scripts.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We deploy our changes progressively (by cloud region) to avoid broad impact. However, in this case, our detection did not work as expected. To minimise the impact of breaking changes to our environments, we will implement additional preventative measures such as:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve our platform’s performance and availability.
Thanks,
Atlassian Customer Support