On 30.04.2025 at around 16:10 CEST, we experienced a disruption in our publishing pipeline that caused significant delays in content processing for several hours. This was due to an unexpected behavior in our event handling system when processing large-scale taxonomy updates.
Our system is designed to generate content update events whenever a taxonomy is modified. In this incident, a customer with a large volume of content updated a taxonomy that was referenced across many content items. The resulting high number of update events caused strain on the processing service, which then took longer than expected to acknowledge events from Kafka.
Due to these delays, Kafka re-assigned the same events to other service instances. This led to duplicate generation of update events, significantly inflating the workload. This loop continued until we stopped event processing and applied a hotfix—originally scheduled for release on May 12—which corrected the event handling behavior.
Once the fix was applied, the system stopped generating duplicate events. However, it took several hours to process the backlog that had already accumulated.