Real time service status from across the ATLAS Media Group portfolio
We are seeing outages across our compute cluster. We are unsure what the root cause of the issue is and are currently investigating.
We are now confident that full service has been restored.
I've made a further fix to our underlying infrastructure and believe I've managed to rectify the intermittent connectivity fault. I still need to resolve issues with a large number of members home timelines but that should be a lot smoother now this issue is resolved.
Universeodon is partially operational at this time. We are still seeing intermittent issues where the root cause remains unclear. Due to a series of other urgent issues that have come up I've not yet had time to fully troubleshoot the issues but will look to get this intermittent outage issue resolved ASAP.
Universeodon.com continues to have major disruption and ongoing issues. We have had to remove the corrupted redis database which has resulted in any sidekiq / content processing data being lost which we expect to be almost no data, however this has also resulted in some feeds such as home feeds being lost and needing re-creation. We're running into major issues when we try to regenerate the home feeds at this time and do not have capacity to continue working on the issue right now.
We have been able to fully restore MastodonApp.UK to normal operations. Unfortunately we're having much bigger issues when it comes to getting Universeodon.com operational due to a corrupted component (Redis) which is currently resulting in a full service outage while we look to minimise the data corruption issues there.
No incidents reported
No incidents reported
No incidents reported
No incidents reported
No incidents reported
No incidents reported
No incidents reported
No incidents reported
We are investigating an issue impacting multiple services due to shared storage becoming unavailable. We will update as soon as we know more.
We are working to resolve issues with one of the nodes in our French region. We have started the process of moving clients onto a new operational node however our VPSCP is having issues with the migration. We have engaged our vendor that owns the VPSCP software and are awaiting further updates from them for us to resolve this issue.
Starting 19:30 BST we will commence migration of the Universeodon.com database server to new infrastructure to enable us to use additional capacity that has been provisioned.
We are currently experiencing a major outage on our content processing services. We are looking to restore service ASAP and are actively investigating this issue.
No incidents reported
Our content processing service has experienced a catastrophic failure resulting in feeds not being updated. We are actively back queueing all of these actions and as soon as we can remediate the issue we will start to catch up on the content that needs processing.
Content processing is now fully operational and working as expected.
We are starting to once again see further performance issues to the database layer of MastodonAppUK which is also resulting in disruption to our ability to process content on Universeodon.com - We are working to mitigate the impacts of this now.
We have managed to slightly increase our ingress capacity for content processing. We're currently running approx 8 Mins behind live on all our queues with the exception of ingress
which is at around 21 Hours behind. We expect this queue to take a couple of hours to fully clear and will continue to monitor.
We have increased capacity on our database infrastructure however are still hitting bottlenecks. As a result we've scaled back our processing workers to prioritise the default
content queue and are currently running a small amount of capacity for the ingress
queue to try to catch this up. This means for all queues other than ingress
we're currently running around 30 mins behind with ingress
currently around 22 hours behind. We will continue to adjust the scaling to ensure the site remains online and operational while we process all this content.
We have identified a capacity bottleneck on the router that serves part of our database infrastructure, we are scaling this up now and hope this should upgrading some of the bottleneck issues.
It appears that the content processing has resulted in too much pressure being put on our database infrastructure causing major outages across the site. We are looking to scale back the content processing to restore the sites access.
We have powered on our legacy content processing server which is starting to work through the backlog, it looks like around midnight on the 29th August 2024 the new content processing services had a major failure resulting in the vast majority of content processing jobs failing to be executed. We currently have a backlog of a little over 1.1 million events which is likely to increase as we process content and additional processing is required. I suspect it'll take a few hours to get things back caught up. We are going to monitor the infrastructure and queues over the coming hours to ensure full recovery.