On October 11, 2021, between 14:02 and 14:50 PM UTC, customers using Confluence Cloud, Jira Cloud, and certain developer APIs were unable to perform user searches or team searches in user fields (such as assignee fields, user filters, or mentions) or in CQL queries. The event was triggered by a full database outage in the “prod-euwest” region for the service responsible for user search, also known as Cross-Product User Search (CPUS). This region is responsible for serving traffic to a majority of our EMEA customers. The database degradation itself was caused by problematic data nodes in one database cluster. Additionally, due to a recent database client migration, the search request timeout was misconfigured. This caused an unnecessary buildup of stale requests and is a contributing cause to length of the outage.
The incident was detected by our automated monitoring systems approximately five minutes prior to customer impact, and alerted our engineers less than one minute (T+1) after customer impact began. The issue was mitigated by manually switching the user search database to a redundant database cluster in the same region. The total time to resolution was approximately 48 minutes. The duration of this outage was longer than expected, as our redundant cluster was in "maintenance mode" and our auto-failover mechanisms were not triggered.
The overall impact was on October 11, 2021, between 14:02 and 14:50 PM UTC, and affected Confluence Cloud, the Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management), team search developer APIs, and people directory search developer APIs. The Incident caused service disruption to our EMEA customers when they performed certain search operations. Depending on the service or experience, this resulted in a HTTP 503 response (Bad Gateway), a HTTP 500 response (Server Error), increased response time, or request failure due to timeout. This includes client requests for user mentions, user or team searches, lookups of users in user fields such as Assignee in Jira, certain CQL queries in Confluence, and third-party apps that use the Atlassian user search APIs.
Product-specific impacts included:
Confluence Cloud
EMEA users had issues when:
Jira Software Cloud
EMEA users had issues with searching or looking up users when
Jira Service Management
EMEA users were unable to:
Team Search API:
People Directory API:
ROOT CAUSE
Several of our products and developer APIs use an internal service called Cross-Product User Search (CPUS). The CPUS service is responsible for providing GDPR-compliant user and team search to our customers and internal services. Internally, CPUS indexes and performs search queries against 2 redundant database clusters in each of our 5 service regions. This redundancy is key to maintaining our 99.99% availability SLO and zero downtime during upgrades.
During normal operation, user search traffic is split evenly between these database clusters. If our auto-failover mitigation system detects one of those database clusters is in an unhealthy state (which usually indicates a database node failure that is being repaired), all traffic is redirected to the healthy cluster and our engineers are alerted to monitor the situation until both clusters are healthy again. This operation is usually seamless and our customers are not impacted. Auto-failovers do happen occasionally and are a response to issues outside of Atlassian's control.
During maintenance, we are able to keep the service online by routing all search queries to a single database cluster while we perform maintenance on the redundant database cluster. Once maintenance is complete, we return the service in that region back into a normal state by serving search queries evenly between both database clusters. When a database cluster is not handling queries due to upgrades, we put that database cluster in 'maintenance mode', and our auto-failover is disabled to prevent users from being served stale or incorrect data during an outage.
After investigation, we identified three unique points of failure in our systems that caused the outage for some of our customers.
The CPUS team was able to resolve this outage by configuring CPUS to send all search queries on the database cluster that was unintentionally left in maintenance mode. Since the database cluster had not served queries in a significant amount of time, it took approximately 15 minutes for all requests to perform within normal reliability tolerances. Once the unhealthy database cluster returned to a healthy state, we returned it back into rotation, which re-enabled our auto-failover mechanism.
We know that outages impact to your productivity. Additionally, we apologize that we failed to properly notify our customers through our public StatusPage regarding the incident; this was due to user error and we are adding an action item to make sure the same mistake doesn’t happen again. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid and/or reduce the impact of this kind of outages in the future. The following is a list of high-priority action items that will be implemented to augment existing testing, monitoring, and deployment practices:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support