Mastodonapp.uk content processing failure Thursday 29th August 2024 20:10:10


Our content processing service has experienced a catastrophic failure resulting in feeds not being updated. We are actively back queueing all of these actions and as soon as we can remediate the issue we will start to catch up on the content that needs processing.

Content processing is now fully operational and working as expected.

We are starting to once again see further performance issues to the database layer of MastodonAppUK which is also resulting in disruption to our ability to process content on Universeodon.com - We are working to mitigate the impacts of this now.

We have managed to slightly increase our ingress capacity for content processing. We're currently running approx 8 Mins behind live on all our queues with the exception of ingress which is at around 21 Hours behind. We expect this queue to take a couple of hours to fully clear and will continue to monitor.

We have increased capacity on our database infrastructure however are still hitting bottlenecks. As a result we've scaled back our processing workers to prioritise the default content queue and are currently running a small amount of capacity for the ingress queue to try to catch this up. This means for all queues other than ingress we're currently running around 30 mins behind with ingress currently around 22 hours behind. We will continue to adjust the scaling to ensure the site remains online and operational while we process all this content.

We have identified a capacity bottleneck on the router that serves part of our database infrastructure, we are scaling this up now and hope this should upgrading some of the bottleneck issues.

It appears that the content processing has resulted in too much pressure being put on our database infrastructure causing major outages across the site. We are looking to scale back the content processing to restore the sites access.

We have powered on our legacy content processing server which is starting to work through the backlog, it looks like around midnight on the 29th August 2024 the new content processing services had a major failure resulting in the vast majority of content processing jobs failing to be executed. We currently have a backlog of a little over 1.1 million events which is likely to increase as we process content and additional processing is required. I suspect it'll take a few hours to get things back caught up. We are going to monitor the infrastructure and queues over the coming hours to ensure full recovery.