Universeodon - Connectivity Issues Tuesday 16th September 2025 20:21:04


We are seeing connectivity issues between Universeodon's old compute cluster and our new compute environment with part of our data storage, as a result the site is offline. We are urgently working to resolve the situation.

The planned mainteannce tonight has been cancelled and will be re-scheduled as the site is now working as expected.

We will raise a new incident here and announce via @wild1145@mastodonapp.uk and via server announcements when the new maintenance will take place.

We do still need to re-build the feeds for all users however we will pick that up under the existing incident on here.

Universeodon queues are now tracking at under 10 mins, we do still have a lot of retries to execute so there might still be some delays but it looks like we're mostly back to where we should be.

We will shortly spin up a new incident to cover the database move to the new infrastructure and we expect the site to be down for a few hours while we do this.

Due to the overnight database connectivity issues we have lost an unknown number of jobs which tried and failed to run due to DB connectivity issues.

Our queues this morning are difficult to fully process as we have around 152k retry jobs and a rapidly growing set of standard queues as well. We expect some fairly slow processing now for the next few hours but will continue to monitor.

Following this database issue and the fact it's happened multiple times in recent memory including over the last couple of days we will look to take the site offline for maintenance later today and migrate the database to a new server in the hopes we can remedy some of these connectivity issues.

We are now seeing around 50 mins of lag on the default queue and around 1 hour on our ingress and push queues. As a result we're scalling back our ingress capacity to prioritise the other queues and enable capacity to ensure the WebUI works as expected and content is pushed where it needs to go. We currently have a total of around 285k jobs queued across our queues.

We are seeing some reduction in speed and responsiveness on Universeodon however our intention for the time being is to keep the content processing ticking over at it's current rate. We are making good progress through our ingress queue and it's now down to around 3 hours of latency from live with around 184k jobs in the queue. Our default queue is currently lagging at just under 30 mins of latency from live with just under 110k jobs in the queue. We will review the situation in a couple of hours and if needed re-balance the queues again to re-prioritise the default queue to ensure timelines aren't significantly impacted beyond any existing impact.

Ingress queues on Universeodon are currently tracking at approx 4 hours behind real-time with the default queue currently slightly delayed at just under 20 mins. We've adjusted our balance of queues to prioritise our default queue and non-ingress queues as the ingress queue will often cause other jobs to spawn and can rapidly increase the job count on the other queues.

MastodonAppUK continues to look okay in terms of queue lenght and delays with real-time processing continuing.

All queues except the ingress queue are now back to real time processing. Ingress is currently around 6 hours behind live, we're going to try to increase the processing capacity a little further to hopefully clear through the backlog a bit quicker.

We are still going to need to do a feed re-build due to previouly lost Redis data however this will happen once the database has been migrated to the new infrastructure.

Queues on Universeodon are coming down slowly, we've again increased processing capacity as things appear to be running a lot more reliably now.

Ingress is around 7 hours behind live and the default queue (Timelines and other aspects) is currently tracking approx 20 mins behind live. All other queues are processing around real time and are within our expected ranges.

The queues on MastodonAppUK have now cleared and we're processing as-normal in real time.

We've just increased some of our queue processing again on Universeodon in the hopes that it will help work through the queues there now that we are more confident with our original content processing configuration.

We have reverted one of our original tuning changes which increased the number of content processor processes running by reducing the number of threads each had to match the original CPU configuration of the VM's. This appears to have made a noticable difference in performance and reduced the issues crossing the boundries between our old and new infrastructure. MastodonAppUK is now fully operating back on the original infrastructure platform with Universeodon seeing a 2x increase in processing capacity without any noticable impact to site / end-user performance at this time. We will continue to monitor.

We are seeing increasing queue sizes on Universeodon and MastodonAppUK with some intermittent issues on Universeodon's website. We have scaled again to a maximum safe point without doing any other work so are hoping to try to keep the queues somewhat managable. Ingress is currently tracking around 9 hours behind on Universeodon while all other queues are between 30 and 45 mins behind at this time.

For MastodonAppUK We are going to activate our legacy content processing nodes on the old infrastructure in the hopes that it will both work down the queue and allow us to temporarily disable the new content processing nodes, as our Redis and Database servers are still on the old cluster it should be as if nothing had ever moved and we hope will give some much needed capacity to the Universeodon service. Queues are currently around 40 mins behind on Ingress and 1 hour behind for feeds and default queues.

The majority of the stability issues have cleared overnight and with the majority of the MastodonAppUK backlog cleared the demand on connectivity between the two sites has dropped considerably. We're slowly ramping up the content processing on Universeodon to catch-up on ingress which was not running overnight and to try to keep on top of the MastodonAppUK queues.

We will look to migrate the Universeodon database server this afternoon / evening onto the new compute cluster as we hope that will help the overall performance of the environment and will allow us to route all Universeodon traffic locally within the new DC.

We have managed to get the site online after bringing the database servers connectivity fully online again. We've re-enabled a very small amount of content processing capacity however this is already proving to cause Universeodon to struggle. We're not sure why Universeodon is struggling more than MastodonAppUK however we're monitoring the situation and will have to look to expedite the migration to the new compute cluster for all other aspects of the sites.

We're now experiencing an issue with the Universeodon Database server which is no longer connecting to the network, this is preventing any sort of access and resulting in the site continuing to be fully offline. We are working as fast as possible to resolve the matter.

We have confirmed the issue is related to our SideKiq processing, we've now suspended all content processing and switched over to the new website infrastructure. We will start to bring back online the Universeodon content processing and then will bring back online the MastodonApp content processing to ensure stability.

We have reversed the change to our load balancer configuration and are re-routing the traffic back through our old infrastructure for the web services at this time as we are unable to bring the new server online for reasons which are not clear at this time. We will continue investigating.

We are continuing to have significant issues getting the new web infrastructure to serve public traffic and the root cause is currently unknown. We are working as hard as we can to restore service.

In an effort to stablise the site itself and to ensure events destined for Universeodon.com are delivered we have suspended the active processing of content while we re-build the content processing servers on the new compute cluster. Content will then be processed on our new infrastructure where we should be able to better catch up with the backlog.

We think we've identified the issue as relating to the traffic routing between our old and new clusters, where our MastodonApp.UK content processing still talks to our Redis server on the old environment and our Web and Content processing services talk to Redis on our new environment we think the delays to MastodonApp.UK and our work to increase processing capacity there has significantly impacted service here.

We are working now to migrate Universeodon content processing onto the new compute environment which we hope will minimise some of this disruption and will continue to monitor the situation.