Create content via API failing on Confluence Cloud

Incident Report for Confluence

Postmortem

SUMMARY

On June 22, 2022, from 01:08 AM UTC to 03:42 AM UTC, some customers using Bitbucket Pipelines, Confluence Cloud, Forge, and Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management). While for Bitbucket Pipelines there was an increase in build failures, Jira, Confluence, and Forge experienced performance and functionality degradation. The event was triggered by our internal Artifact Repository Manager becoming unavailable during a scheduled multi-availability zone disaster recovery test. Customers across all regions were affected. The incident was detected within two minutes by monitoring and mitigated by restarting the Artifact Repository service, which recovered the affected products. The total time to resolution was about three hours.

IMPACT

The overall impact was between June 22, 2022, 01:08 AM UTC, and June 22, 2022, 05:58 AM UTC on Bitbucket Pipelines, Confluence Cloud, Forge, and Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management). The outage of the internal Artifact Repository Manager caused scalability problems in the aforementioned products and an inability to build or deploy new versions of our services. That meant the degradation of performance and functionality for most of these products.

ROOT CAUSE

The issue was caused by an outage of the internal Artifact Repository Manager during the planned multi-availability zone disaster recovery test. As a result, the products listed above could not access docker images and other necessary artifacts to scale up, which caused partial degradation of services or complete unavailability of services for some customers. The restart of the internal Artifact Repository Manager caused downtime to the service but led to successful recovery.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages may impact your productivity. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid the impact of this kind of outage in the future.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

We raised a critical issue with the vendor who provides us with software for Artifact Management to optimise the resilience of the application caused by availability zone failures.
We are working on improving our disaster recovery plan to be able to mitigate such incidents faster.
We are reviewing our test strategies to be able to catch similar issues in the early stages.

To minimize the impact of such incidents on our customers, we will implement additional preventative measures such as:

Development of a redundant caching mechanism for our platform system to improve the scalability and reliability of our products.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted 3 years ago. Jul 07, 2022 - 23:54 UTC

Resolved

This incident has been resolved.

Posted 3 years ago. Jun 22, 2022 - 13:03 UTC

Update

We are continuing to monitor for any further issues.

Posted 3 years ago. Jun 22, 2022 - 11:29 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted 3 years ago. Jun 22, 2022 - 11:15 UTC

Investigating

Between 01:30 AM UTC to 09:05 AM UTC, we experienced issues creating content via API on Confluence Cloud. We have identified the root cause and have mitigated the problem. We are now monitoring closely.

Posted 3 years ago. Jun 22, 2022 - 11:14 UTC