On Sept 27th 2021 between 08:11 and 13:06 UTC, customers using Confluence Cloud were unable to create or edit pages. A deployment to pick up an environment variable change in a critical collaborative editing service triggered the incident. The new stack was unable to handle the traffic transfer from the existing stack. Detection of the impact took 15 minutes by automated alerts. The response team blocked traffic to the service until it could return to a healthy state. The total time to resolution was about 4 hours and 55 minutes.
IMPACT
Disruption to creating and editing capabilities within Confluence Cloud existed between Sept 27th 2021, 08:11 UTC and Sept 27th 2021, 13:06 UTC. The response team had mitigated the impact for most customers by 10:40 UTC, approximately 2.5 hours after the initial detection. Some customers in the EU continued to see an impact to their experience until we resumed all traffic at 13:06 UTC.
ROOT CAUSE
The critical realtime collaboration service could not handle incoming requests after a redeployment. Architectural limitations of the critical service prevent progressive rollouts. Instead, traffic is cutover with some jitter to the new stack. The new stack was slow to respond, and retries put further pressure on the new stack, resulting in it failing to stabilize.
We know that outages are disruptive to your productivity, and we apologize to all customers who were impacted by this incident. We are prioritizing the following improvement actions to avoid repeating this type of service disruption in the future::
To minimize the impact of breaking changes to our environments, we have a larger architectural change to this service in progress. This will enable progressive rollouts and address the long term scale/reliability limitations of the service.
Thank you,
Atlassian Customer Support