Poor performance and timeouts when running user searches

Incident Report for Confluence

Postmortem

SUMMARY

On October 11, 2021, between 14:02 and 14:50 PM UTC, customers using Confluence Cloud, Jira Cloud, and certain developer APIs were unable to perform user searches or team searches in user fields (such as assignee fields, user filters, or mentions) or in CQL queries. The event was triggered by a full database outage in the “prod-euwest” region for the service responsible for user search, also known as Cross-Product User Search (CPUS). This region is responsible for serving traffic to a majority of our EMEA customers. The database degradation itself was caused by problematic data nodes in one database cluster. Additionally, due to a recent database client migration, the search request timeout was misconfigured. This caused an unnecessary buildup of stale requests and is a contributing cause to length of the outage.

The incident was detected by our automated monitoring systems approximately five minutes prior to customer impact, and alerted our engineers less than one minute (T+1) after customer impact began. The issue was mitigated by manually switching the user search database to a redundant database cluster in the same region. The total time to resolution was approximately 48 minutes. The duration of this outage was longer than expected, as our redundant cluster was in "maintenance mode" and our auto-failover mechanisms were not triggered.

IMPACT

The overall impact was on October 11, 2021, between 14:02 and 14:50 PM UTC, and affected Confluence Cloud, the Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management), team search developer APIs, and people directory search developer APIs. The Incident caused service disruption to our EMEA customers when they performed certain search operations. Depending on the service or experience, this resulted in a HTTP 503 response (Bad Gateway), a HTTP 500 response (Server Error), increased response time, or request failure due to timeout. This includes client requests for user mentions, user or team searches, lookups of users in user fields such as Assignee in Jira, certain CQL queries in Confluence, and third-party apps that use the Atlassian user search APIs.

Product-specific impacts included:

Confluence Cloud
- EMEA users had issues when:
  - mentioning other users
  - searching for content
Jira Software Cloud
- EMEA users had issues with searching or looking up users when
  - inviting users to a project
  - assigning issues to a user
  - selecting a user in JQL
Jira Service Management
- EMEA users were unable to:
  - invite an agent
  - raise ticket on behalf of another user through the Help Centre
  - select a user in JQL input fields while configuring features like Reports, SLAs, Approvals and Queues might have resulted in an error
  - add request participants to tickets
  - add customers to a project or to an organization
Team Search API:
- A majority of developer API calls failed to return a successful response
People Directory API:
- A small amount of developer API calls failed to return a successful response

ROOT CAUSE

Several of our products and developer APIs use an internal service called Cross-Product User Search (CPUS). The CPUS service is responsible for providing GDPR-compliant user and team search to our customers and internal services. Internally, CPUS indexes and performs search queries against 2 redundant database clusters in each of our 5 service regions. This redundancy is key to maintaining our 99.99% availability SLO and zero downtime during upgrades.

During normal operation, user search traffic is split evenly between these database clusters. If our auto-failover mitigation system detects one of those database clusters is in an unhealthy state (which usually indicates a database node failure that is being repaired), all traffic is redirected to the healthy cluster and our engineers are alerted to monitor the situation until both clusters are healthy again. This operation is usually seamless and our customers are not impacted. Auto-failovers do happen occasionally and are a response to issues outside of Atlassian's control.

During maintenance, we are able to keep the service online by routing all search queries to a single database cluster while we perform maintenance on the redundant database cluster. Once maintenance is complete, we return the service in that region back into a normal state by serving search queries evenly between both database clusters. When a database cluster is not handling queries due to upgrades, we put that database cluster in 'maintenance mode', and our auto-failover is disabled to prevent users from being served stale or incorrect data during an outage.

After investigation, we identified three unique points of failure in our systems that caused the outage for some of our customers.

The auto-failover mechanism failed to detect that one of the two database clusters entered an unhealthy state, and so the automatic failover to the healthy cluster was not performed. This specific kind of failure was not considered during the design of the auto-failover mechanism.
After a recent maintenance in one of the clusters in the EMEA region, the cluster was left in the maintenance mode due to a human error. This means this cluster was not available for auto-failover even if the failover was executed when one of the clusters became unhealthy.
The logic to handle a parameter that configures the search request timeout was incorrect after a database client migration. This resulted in a backlog of search requests that had to be processed even though the clients had already received a timeout response.

The CPUS team was able to resolve this outage by configuring CPUS to send all search queries on the database cluster that was unintentionally left in maintenance mode. Since the database cluster had not served queries in a significant amount of time, it took approximately 15 minutes for all requests to perform within normal reliability tolerances. Once the unhealthy database cluster returned to a healthy state, we returned it back into rotation, which re-enabled our auto-failover mechanism.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact to your productivity. Additionally, we apologize that we failed to properly notify our customers through our public StatusPage regarding the incident; this was due to user error and we are adding an action item to make sure the same mistake doesn’t happen again. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid and/or reduce the impact of this kind of outages in the future. The following is a list of high-priority action items that will be implemented to augment existing testing, monitoring, and deployment practices:

Update the logic on the database client to properly handle the search request timeout parameter and adding additional tests to ensure the functionality does not change over time.
Update and test operating run-books to ensure that clusters in maintenance mode are returned to the rotation in a timely manner by removing the diffusion of responsibility; if an engineer are unable to return the cluster on their own, call out a specific team member to return the cluster back into rotation. We will automate this process as much as possible.
Reconfigure an internal alert that lets the CPUS team know that a database cluster has been in maintenance mode for an extended period of time. Today, that alert is triggered once, and only in one communication channel. We will update this to announce the alert periodically and in multiple channels where we expect our on-call personnel to actively monitor.
Improve cluster auto-failover behavior to consider cluster performance instead of just cluster state.
Add clarification within our internal incident management policy and patch our internal publishing tools to ensure StatusPage is properly updated during similar incidents.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Oct 20, 2021 - 01:15 UTC

Resolved

This is a retroactive notification for a resolved incident. Between 14:05 and 14:50 UTC on 2021/10/11, customers with Jira and Confluence Cloud sites hosted in Europe experienced poor performance and timeouts when running user searches. Ecosystem Apps carrying out such searches were affected too.

The issue is now mitigated, and the team will perform a PIR to isolate the root cause and identify actions to avoid repeat incidents.

Posted Oct 12, 2021 - 00:57 UTC

This incident affected: Search.