Content Cloud issues are being investigated

Incident Report for Purple Publish

Postmortem

Summary

On 30.04.2025 at around 16:10 CEST, we experienced a disruption in our publishing pipeline that caused significant delays in content processing for several hours. This was due to an unexpected behavior in our event handling system when processing large-scale taxonomy updates.

Root Cause

Our system is designed to generate content update events whenever a taxonomy is modified. In this incident, a customer with a large volume of content updated a taxonomy that was referenced across many content items. The resulting high number of update events caused strain on the processing service, which then took longer than expected to acknowledge events from Kafka.

Due to these delays, Kafka re-assigned the same events to other service instances. This led to duplicate generation of update events, significantly inflating the workload. This loop continued until we stopped event processing and applied a hotfix—originally scheduled for release on May 12—which corrected the event handling behavior.

Once the fix was applied, the system stopped generating duplicate events. However, it took several hours to process the backlog that had already accumulated.

Resolution

We applied the hotfix to our event processing system to prevent duplicate event generation.
Once in place, the system resumed normal operation and processed the remaining backlog.
All publishing operations returned to normal as of 18:00 CEST.
Post processing operations were finished by 22:30 CEST

Posted May 02, 2025 - 09:28 CEST

Resolved

Summary:
Our publishing pipeline experienced delays due to an unexpected technical issue. A bug in our system caused an unusually high volume of events to be generated, which temporarily overloaded our processing pipeline.

Impact:
This resulted in delays in post processing of published content for a few hours while the system processed the backlog of events. We understand this may have affected your workflows, and we sincerely apologize for the inconvenience.

Resolution:
Our engineering team quickly identified and fixed the issue to prevent further event generation. Once the fix was in place, the system gradually cleared the backlog, and publishing operations returned to normal.

Posted Apr 30, 2025 - 05:00 CEST