Universeodon Contnet Delays Tuesday 5th November 2024 15:00:00


We are currently recovering from an incident with our content processing service. This is resulting in a delay on new ingested content and intermittent delays to other content. We're continuing to re-balance our content processing capacity for the various types of content we need to process to expedite the processing of this content to the site.

The database storage migration fully resolved the issue.

Our content processing is once again falling behind with queues between 30 and over 1 hour. Ingress currently is at least 1 hour behind. I'm working to adjust scaling however we're seeing a significantly higher load than usual at this time combined with our known database issues is resulting in poor performance all around.

All feeds have now caught up. We will continue to observe live for a short amount of time before we sign off for the night and will check back in tomorrow morning to ensure nothing falls over for too long of a period.

All of our content feeds with the exception of ingress have once again caught up fully. We're running approx 30 mins behind real time now on our ingress queue which continues to shrink. I'm hoping in the next 20-30 mins we should have that queue down to real time as well and can get the content processing back to it's normal state.

We're now down to approx 55 mins of backlog on our ingress pipeline and have cleared a large amount of the backlog. We're seeing around a 10 min delay currently on processing other content which may mean you have a slightly outdated feed, this will resolve itself shortly and as soon as our ingress pipeline is processed the compute capacity will ensure any other content in the queue is processed which should keep things running smoother tonight.

We are continuing to see a steady reduction in our ingress queue and a parallel (approx) increase in our other queues as the posts get processed into timelines and other aspects of the application. We're currently still tracking around 1 hour behind on ingress and we're seeing a small spike where some timelines and other basic functionality is between 5 and 10 mins behind real time. Often a refresh will fix this or alternatively waiting until queues catch up. We're continuing to adjust the scaling of the service behind the scenes to account for these fluctuation.

We are monitoring the status of the queues and current data suggests the balance we have is slowly burning down our ingress queue while maintaining our other queues at near real time (Anywhere up to 2-3 mins lag currently). We will continue to monitor and update as appropriate.

I've managed to re-set our content processing scale back to a sensible default and cleared all queues except ingress which is currently around an hour behind based on the oldest item in the queue. I'm going to look to do some delicate scaling now to try to scale us up without breaking too much.

We're starting to see major disruption to the site, likely a result of an increase in traffic to the website itself further stressing our capacity. We've paused content processing for a moment while we look to see how we can more safely scale back up the content processing without crashing things further.

We have reverted our additional server as we're finding the capacity constraints on our database to be too great to allow us at this time to scale beyond where we already were. We're looking to restore full service as a matter of critical urgency.

I'm currently trying to increase our databases capacity without needing to take the entire service offline for an extended amount of time, the hope being that this may unlock additional capacity for the second content processing service which is currently unable to connect as we have saturated the capacity of our database. Updates to follow.

We have confirmed that part of the impact of this issue relates to ongoing performance issues on our database. We are currently building out an additional content processing server to attempt to pick up some of the slack and reduce the backlog. We're currently still tracking an approx 2 hour lag on new content being ingested through our queues with all other queues currently remaining near real time.